Patchwork [4,of,7] revlog: add test to check raw processing is sane

login
register
mail settings
Submitter Jun Wu
Date March 28, 2017, 7:49 a.m.
Message ID <a9d87712bec99abe109c.1490687345@localhost.localdomain>
Download mbox | patch
Permalink /patch/19778/
State Changes Requested
Headers show

Comments

Jun Wu - March 28, 2017, 7:49 a.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1490677106 25200
#      Mon Mar 27 21:58:26 2017 -0700
# Node ID a9d87712bec99abe109c155948ee4b7f1f5ec208
# Parent  d557aaee6ada70bf51fcc9d4d05d07a54d8f2d4e
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r a9d87712bec9
revlog: add test to check raw processing is sane

This test fails in various ways, if any of the previous 3 patches is
missing.

Note: I was first putting the test at the first patch. Then found it got
too much churn with the fixes, and the test file diff is not that obvious
explaining what happened. So I've decided to put the test after behavior
fix, before performance fix.
Ryan McElroy - March 29, 2017, 12:40 p.m.
On 3/28/17 8:49 AM, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1490677106 25200
> #      Mon Mar 27 21:58:26 2017 -0700
> # Node ID a9d87712bec99abe109c155948ee4b7f1f5ec208
> # Parent  d557aaee6ada70bf51fcc9d4d05d07a54d8f2d4e
> revlog: add test to check raw processing is sane

Thanks for adding this test. I have a bunch of comments to make the test 
more understandable (both the code and the output).

>
> This test fails in various ways, if any of the previous 3 patches is
> missing.

nit: uneccesary comma makes this sentence harder to understand

>
> Note: I was first putting the test at the first patch. Then found it got
> too much churn with the fixes, and the test file diff is not that obvious
> explaining what happened. So I've decided to put the test after behavior
> fix, before performance fix.

I agree that with all the changes that show up in this test (I've 
applied and ran it locally), this is the right choice if we can't make 
the test output more self-explanatory.

I have some suggestions inline that might make the output of this test 
easier to understand. Then you could see if the test changes are 
"self-explanatory" or not. I think it would be pretty useful to see how 
output changes over time, although it's not worth blocking these fixes 
if that proves to be hard to do.

>
> diff --git a/tests/test-revlog-raw.py b/tests/test-revlog-raw.py
> new file mode 100644
> --- /dev/null
> +++ b/tests/test-revlog-raw.py
> @@ -0,0 +1,121 @@
> +# test revlog interaction about raw data (flagprocessor)
> +
> +from __future__ import absolute_import, print_function
> +
> +from mercurial import (
> +    encoding,
> +    extensions,
> +    node,
> +    revlog,
> +    transaction,
> +    vfs,
> +)
> +
> +# TESTTMP is optional - so the test can run outside run-tests.py
> +tvfs = vfs.vfs(encoding.environ.get('TESTTMP', b'/tmp'))
> +
> +# format utilities
> +def rightpad(data, length):
> +    return (data + b' ' * length)[:length]
> +
> +def normalize(data):
> +    data = data.replace(b'LINEPREFIX_', b'').replace(b'\n', b'')
> +    return rightpad(data, 7)

Why is this LINEPREFIX_ thing here? Make it obvious or add a comment.

> +
> +# register a revlog processor for EXTSTORED
> +def readprocessor(self, text):
> +    return text[2:], True
> +
> +def writeprocessor(self, text):
> +    return b'+\n' + text, False

I'd suggest having the writeprocessor add a "RAW:" prefix rather than 
this "+\n" thing. That will make it more clear what is going on I think. 
The the readprocessor can strip off 4 chars and you don't need such a 
funky normalize function.

Unless you need a strange character like newline for some reason? If so, 
you should add a comment to that effect and explain why it's needed. If 
you indeed need a "strange" character, is there another character that 
doesn't have the same output issues that still has the "strange" 
property you need?

If you need to keep newline, I think normalize should be changed to 
replace newlines with "\\n", rather than "", so they are "visible" to 
the test output.

> +
> +def rawprocessor(self, text):
> +    # do not check hash
> +    return False
> +
> +revlog.addflagprocessor(revlog.REVIDX_EXTSTORED,
> +                        (readprocessor, writeprocessor, rawprocessor))
> +
> +# replace revlog.hash so the test is sensitive to anything that goes wrong
> +def _hash(orig, text, p1, p2):
> +    r = orig(text, p1, p2)
> +    print('  # %s = hash(%s, %s)'
> +          % (node.short(r), node.short(p1), normalize(text).strip()))

Wait, you "normalize" the text and then strip() it? What's the point of 
the padding then?

> +    return r
> +
> +extensions.wrapfunction(revlog, 'hash', _hash)
> +
> +# writing revlog requires a transaction
> +def report(msg):
> +    print('transaction: %s' % msg.strip())
This output is never seen in your test output. What's the point of this 
print statement?

> +
> +tr = transaction.transaction(report, tvfs, {'plain': tvfs}, b'journal')
> +
> +def newrevlog(name='foo.i'):
> +    return revlog.revlog(tvfs, 'foo.i')
You should pass along the name, or not accept name as a parameter.

> +
> +rl = newrevlog()
> +
> +# We are going to construct something like:
> +# rev 0: delta-base
> +# rev 1: delta against rev 0, with special flag
> +# rev 2: delta against rev 1, without special flag

Good comment, thanks!

> +
> +def appendrev(rlog, data, flags=revlog.REVIDX_DEFAULT_FLAGS):
> +    nextrev = len(rlog)
> +    p1 = rlog.node(nextrev - 1)
> +    p2 = node.nullid
> +    try:
> +        n = rl.addrevision(data, tr, nextrev, p1, p2, flags=flags)
> +        print(b'Appended rev %d as %s' % (nextrev, node.short(n)))
> +    except Exception as ex:
> +        print(b'Failed to append rev %d: %s' % (nextrev, ex))
> +
> +# revision 0
> +data = b''.join(b'LINEPREFIX_%d\n' % i for i in range(5))
> +appendrev(rl, data)

What is the purpose of using LINEPREFIX_ and \n? Why not just keep the 
data nice and short and on one line so we can see the whole thing 
without normalizing?

> +
> +# revision 1
> +data = data.replace(b'3\n', b'T\n3\n')
> +appendrev(rl, data, flags=revlog.REVIDX_EXTSTORED)
> +
> +# revision 2
> +data = data.replace(b'1\n', b'').replace('4\n', '5')
> +appendrev(rl, data)

Is there a method to these replacements, or are you just making changes 
of some sort? If there's a method, please explain it in a comment. If 
it's just random changes, make the smallest possible changes that lead 
to the result.

My guess now that I've been thinking about this: the LINEPREFIX_ stuff 
it to make sure that there's enough "content" that revs 1 and 2 are 
delta-encoded instead of fulltext encoded. Is that right? Then you need 
to call it out in a comment.

The minimal changes would probably be just a character changed at the 
end of the line.

> +
> +tr.close()
> +
> +# examine content of the revlog
> +print(b'\nREV RAW DATA    CACHE BEFORE -> AFTER')

It took me a while looking at this header and the data to see how 
everything fit together here. I think we should be a little more 
explicit that CACHE has a rev and text data in its output. I'd just 
spread things out more so that the output of the test is more obvious.

> +
> +def describecache(rlog):
> +    if rlog._cache is None:
> +        cache = b'None'
> +    else:
> +        n, r, text = rlog._cache
> +        cache = b'%d %s' % (r, normalize(text))
> +    return rightpad(cache, 9)
> +
> +def printrev(rlog, rev, raw):
> +    cachebefore = describecache(rlog)
> +    try:
> +        data = rlog.revision(rev, raw=raw)
> +    except Exception:
> +        data = b'Error'
> +    cacheafter = describecache(rlog).strip()
> +    print(b'%3d %s   %s %s       %s'
> +          % (rev, raw and 'y' or ' ', normalize(data), cachebefore, cacheafter))
I'd prefer an 'n' for not raw for clarity.

> +
> +for revorder in [[1, 2], [2, 1]]:
> +    for raworder in [[True], [False], [True, False], [False, True]]:
> +        rlog = newrevlog()
> +        for rev in revorder:
> +            for raw in raworder:
> +                printrev(rlog, rev, raw)
> +        print()

It's great to have each of these cases covered, but I'd argue that the 
test would be more readable if you only had one case that exposed the 
issue and the fix, and then added the rest of the cases once the issues 
are all fixed to prevent regressions.

> +
> +# when run outside run-tests.py, clean-ups are needed
> +if 'TESTTMP' not in encoding.environ:
> +    rl = None
> +    tvfs.tryunlink('foo.i')
> +    tvfs.tryunlink('foo.d')

Hopefully we don't ever introduce a file by these names...

> diff --git a/tests/test-revlog-raw.py.out b/tests/test-revlog-raw.py.out
> new file mode 100644
> --- /dev/null
> +++ b/tests/test-revlog-raw.py.out
> @@ -0,0 +1,58 @@
> +  # ea4d2d9efcc9 = hash(000000000000, 01234)
> +  # ea4d2d9efcc9 = hash(000000000000, 01234)
> +Appended rev 0 as ea4d2d9efcc9
> +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> +Appended rev 1 as 7bc003d2bcd3
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +Appended rev 2 as 3338e2e7da35
> +
> +REV RAW DATA    CACHE BEFORE -> AFTER
> +  1 y   +012T34 None            1 +012T34
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2 y   02T35   1 +012T34       2 02T35
> +
> +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> +  1     012T34  None            1 +012T34
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2     02T35   1 +012T34       2 02T35
> +
> +  1 y   +012T34 None            1 +012T34
> +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> +  1     012T34  1 +012T34       1 +012T34
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2 y   02T35   1 +012T34       2 02T35
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2     02T35   2 02T35         2 02T35
> +
> +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> +  1     012T34  None            1 +012T34
> +  1 y   +012T34 1 +012T34       1 +012T34
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2     02T35   1 +012T34       2 02T35
> +  2 y   02T35   2 02T35         2 02T35
> +
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2 y   02T35   None            2 02T35
> +  1 y   +012T34 2 02T35         1 +012T34
> +
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2     02T35   None            2 02T35
> +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> +  1     012T34  2 02T35         1 +012T34
> +
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2 y   02T35   None            2 02T35
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2     02T35   2 02T35         2 02T35
> +  1 y   +012T34 2 02T35         1 +012T34
> +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> +  1     012T34  1 +012T34       1 +012T34
> +
> +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> +  2     02T35   None            2 02T35
> +  2 y   02T35   2 02T35         2 02T35
> +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> +  1     012T34  2 02T35         1 +012T34
> +  1 y   +012T34 1 +012T34       1 +012T34
> +
Jun Wu - March 29, 2017, 3:34 p.m.
Excerpts from Ryan McElroy's message of 2017-03-29 13:40:41 +0100:
> On 3/28/17 8:49 AM, Jun Wu wrote:
> > # HG changeset patch
> > # User Jun Wu <quark@fb.com>
> > # Date 1490677106 25200
> > #      Mon Mar 27 21:58:26 2017 -0700
> > # Node ID a9d87712bec99abe109c155948ee4b7f1f5ec208
> > # Parent  d557aaee6ada70bf51fcc9d4d05d07a54d8f2d4e
> > revlog: add test to check raw processing is sane
> 
> Thanks for adding this test. I have a bunch of comments to make the test 
> more understandable (both the code and the output).
> 
> >
> > This test fails in various ways, if any of the previous 3 patches is
> > missing.
> 
> nit: uneccesary comma makes this sentence harder to understand
> 
> >
> > Note: I was first putting the test at the first patch. Then found it got
> > too much churn with the fixes, and the test file diff is not that obvious
> > explaining what happened. So I've decided to put the test after behavior
> > fix, before performance fix.
> 
> I agree that with all the changes that show up in this test (I've 
> applied and ran it locally), this is the right choice if we can't make 
> the test output more self-explanatory.
> 
> I have some suggestions inline that might make the output of this test 
> easier to understand. Then you could see if the test changes are 
> "self-explanatory" or not. I think it would be pretty useful to see how 
> output changes over time, although it's not worth blocking these fixes 
> if that proves to be hard to do.
> 
> >
> > diff --git a/tests/test-revlog-raw.py b/tests/test-revlog-raw.py
> > new file mode 100644
> > --- /dev/null
> > +++ b/tests/test-revlog-raw.py
> > @@ -0,0 +1,121 @@
> > +# test revlog interaction about raw data (flagprocessor)
> > +
> > +from __future__ import absolute_import, print_function
> > +
> > +from mercurial import (
> > +    encoding,
> > +    extensions,
> > +    node,
> > +    revlog,
> > +    transaction,
> > +    vfs,
> > +)
> > +
> > +# TESTTMP is optional - so the test can run outside run-tests.py
> > +tvfs = vfs.vfs(encoding.environ.get('TESTTMP', b'/tmp'))
> > +
> > +# format utilities
> > +def rightpad(data, length):
> > +    return (data + b' ' * length)[:length]
> > +
> > +def normalize(data):
> > +    data = data.replace(b'LINEPREFIX_', b'').replace(b'\n', b'')
> > +    return rightpad(data, 7)
> 
> Why is this LINEPREFIX_ thing here? Make it obvious or add a comment.

Shorter text will let mercurial think storing "full text" is better than
deltas and won't trigger the bugs.

> 
> > +
> > +# register a revlog processor for EXTSTORED
> > +def readprocessor(self, text):
> > +    return text[2:], True
> > +
> > +def writeprocessor(self, text):
> > +    return b'+\n' + text, False
> 
> I'd suggest having the writeprocessor add a "RAW:" prefix rather than 
> this "+\n" thing. That will make it more clear what is going on I think. 
> The the readprocessor can strip off 4 chars and you don't need such a 
> funky normalize function.

The motivation is to make the output short. The content does not matter. It
has to be a new line to demonstrate some delta application failures.

> 
> Unless you need a strange character like newline for some reason? If so, 
> you should add a comment to that effect and explain why it's needed. If 
> you indeed need a "strange" character, is there another character that 
> doesn't have the same output issues that still has the "strange" 
> property you need?
> 
> If you need to keep newline, I think normalize should be changed to 
> replace newlines with "\\n", rather than "", so they are "visible" to 
> the test output.

I'm not convinced that it's a good idea. The test output is made to be short
and readable, so we can show density information in a comparison table. It's
never meant to show the raw data byte-to-byte.

> > +
> > +def rawprocessor(self, text):
> > +    # do not check hash
> > +    return False
> > +
> > +revlog.addflagprocessor(revlog.REVIDX_EXTSTORED,
> > +                        (readprocessor, writeprocessor, rawprocessor))
> > +
> > +# replace revlog.hash so the test is sensitive to anything that goes wrong
> > +def _hash(orig, text, p1, p2):
> > +    r = orig(text, p1, p2)
> > +    print('  # %s = hash(%s, %s)'
> > +          % (node.short(r), node.short(p1), normalize(text).strip()))
> 
> Wait, you "normalize" the text and then strip() it? What's the point of 
> the padding then?
> 
> > +    return r
> > +
> > +extensions.wrapfunction(revlog, 'hash', _hash)
> > +
> > +# writing revlog requires a transaction
> > +def report(msg):
> > +    print('transaction: %s' % msg.strip())
> This output is never seen in your test output. What's the point of this 
> print statement?

Probably not needed.

> 
> > +
> > +tr = transaction.transaction(report, tvfs, {'plain': tvfs}, b'journal')
> > +
> > +def newrevlog(name='foo.i'):
> > +    return revlog.revlog(tvfs, 'foo.i')
> You should pass along the name, or not accept name as a parameter.

My failure.

> 
> > +
> > +rl = newrevlog()
> > +
> > +# We are going to construct something like:
> > +# rev 0: delta-base
> > +# rev 1: delta against rev 0, with special flag
> > +# rev 2: delta against rev 1, without special flag
> 
> Good comment, thanks!
> 
> > +
> > +def appendrev(rlog, data, flags=revlog.REVIDX_DEFAULT_FLAGS):
> > +    nextrev = len(rlog)
> > +    p1 = rlog.node(nextrev - 1)
> > +    p2 = node.nullid
> > +    try:
> > +        n = rl.addrevision(data, tr, nextrev, p1, p2, flags=flags)
> > +        print(b'Appended rev %d as %s' % (nextrev, node.short(n)))
> > +    except Exception as ex:
> > +        print(b'Failed to append rev %d: %s' % (nextrev, ex))
> > +
> > +# revision 0
> > +data = b''.join(b'LINEPREFIX_%d\n' % i for i in range(5))
> > +appendrev(rl, data)
> 
> What is the purpose of using LINEPREFIX_ and \n? Why not just keep the 
> data nice and short and on one line so we can see the whole thing 
> without normalizing?
> 
> > +
> > +# revision 1
> > +data = data.replace(b'3\n', b'T\n3\n')
> > +appendrev(rl, data, flags=revlog.REVIDX_EXTSTORED)
> > +
> > +# revision 2
> > +data = data.replace(b'1\n', b'').replace('4\n', '5')
> > +appendrev(rl, data)
> 
> Is there a method to these replacements, or are you just making changes 
> of some sort? If there's a method, please explain it in a comment. If 
> it's just random changes, make the smallest possible changes that lead 
> to the result.
> 
> My guess now that I've been thinking about this: the LINEPREFIX_ stuff 
> it to make sure that there's enough "content" that revs 1 and 2 are 
> delta-encoded instead of fulltext encoded. Is that right? Then you need 
> to call it out in a comment.
> 
> The minimal changes would probably be just a character changed at the 
> end of the line.
> 
> > +
> > +tr.close()
> > +
> > +# examine content of the revlog
> > +print(b'\nREV RAW DATA    CACHE BEFORE -> AFTER')
> 
> It took me a while looking at this header and the data to see how 
> everything fit together here. I think we should be a little more 
> explicit that CACHE has a rev and text data in its output. I'd just 
> spread things out more so that the output of the test is more obvious.
> 
> > +
> > +def describecache(rlog):
> > +    if rlog._cache is None:
> > +        cache = b'None'
> > +    else:
> > +        n, r, text = rlog._cache
> > +        cache = b'%d %s' % (r, normalize(text))
> > +    return rightpad(cache, 9)
> > +
> > +def printrev(rlog, rev, raw):
> > +    cachebefore = describecache(rlog)
> > +    try:
> > +        data = rlog.revision(rev, raw=raw)
> > +    except Exception:
> > +        data = b'Error'
> > +    cacheafter = describecache(rlog).strip()
> > +    print(b'%3d %s   %s %s       %s'
> > +          % (rev, raw and 'y' or ' ', normalize(data), cachebefore, cacheafter))
> I'd prefer an 'n' for not raw for clarity.
> 
> > +
> > +for revorder in [[1, 2], [2, 1]]:
> > +    for raworder in [[True], [False], [True, False], [False, True]]:
> > +        rlog = newrevlog()
> > +        for rev in revorder:
> > +            for raw in raworder:
> > +                printrev(rlog, rev, raw)
> > +        print()
> 
> It's great to have each of these cases covered, but I'd argue that the 
> test would be more readable if you only had one case that exposed the 
> issue and the fix, and then added the rest of the cases once the issues 
> are all fixed to prevent regressions.

Although that's doable, that requires significant time. And the bugs are
cancelling each other somehow: When you fix none, test 1 fails. When you fix
test 1, test 2 previously passed fails. Since the bugs won't be able to be
fixed incrementally (fixing the second introduce regression on the first),
I'll prefer keeping it this way, that is to test all cases after the fixes.

> 
> > +
> > +# when run outside run-tests.py, clean-ups are needed
> > +if 'TESTTMP' not in encoding.environ:
> > +    rl = None
> > +    tvfs.tryunlink('foo.i')
> > +    tvfs.tryunlink('foo.d')
> 
> Hopefully we don't ever introduce a file by these names...
> 
> > diff --git a/tests/test-revlog-raw.py.out b/tests/test-revlog-raw.py.out
> > new file mode 100644
> > --- /dev/null
> > +++ b/tests/test-revlog-raw.py.out
> > @@ -0,0 +1,58 @@
> > +  # ea4d2d9efcc9 = hash(000000000000, 01234)
> > +  # ea4d2d9efcc9 = hash(000000000000, 01234)
> > +Appended rev 0 as ea4d2d9efcc9
> > +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> > +Appended rev 1 as 7bc003d2bcd3
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +Appended rev 2 as 3338e2e7da35
> > +
> > +REV RAW DATA    CACHE BEFORE -> AFTER
> > +  1 y   +012T34 None            1 +012T34
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2 y   02T35   1 +012T34       2 02T35
> > +
> > +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> > +  1     012T34  None            1 +012T34
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2     02T35   1 +012T34       2 02T35
> > +
> > +  1 y   +012T34 None            1 +012T34
> > +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> > +  1     012T34  1 +012T34       1 +012T34
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2 y   02T35   1 +012T34       2 02T35
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2     02T35   2 02T35         2 02T35
> > +
> > +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> > +  1     012T34  None            1 +012T34
> > +  1 y   +012T34 1 +012T34       1 +012T34
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2     02T35   1 +012T34       2 02T35
> > +  2 y   02T35   2 02T35         2 02T35
> > +
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2 y   02T35   None            2 02T35
> > +  1 y   +012T34 2 02T35         1 +012T34
> > +
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2     02T35   None            2 02T35
> > +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> > +  1     012T34  2 02T35         1 +012T34
> > +
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2 y   02T35   None            2 02T35
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2     02T35   2 02T35         2 02T35
> > +  1 y   +012T34 2 02T35         1 +012T34
> > +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> > +  1     012T34  1 +012T34       1 +012T34
> > +
> > +  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
> > +  2     02T35   None            2 02T35
> > +  2 y   02T35   2 02T35         2 02T35
> > +  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
> > +  1     012T34  2 02T35         1 +012T34
> > +  1 y   +012T34 1 +012T34       1 +012T34
> > +
Ryan McElroy - March 30, 2017, 8:03 a.m.
On 3/29/17 4:34 PM, Jun Wu wrote:
> Although that's doable, that requires significant time. And the bugs are
> cancelling each other somehow: When you fix none, test 1 fails. When you fix
> test 1, test 2 previously passed fails. Since the bugs won't be able to be
> fixed incrementally (fixing the second introduce regression on the first),
> I'll prefer keeping it this way, that is to test all cases after the fixes.

This is a great point. Carry on.

Patch

diff --git a/tests/test-revlog-raw.py b/tests/test-revlog-raw.py
new file mode 100644
--- /dev/null
+++ b/tests/test-revlog-raw.py
@@ -0,0 +1,121 @@ 
+# test revlog interaction about raw data (flagprocessor)
+
+from __future__ import absolute_import, print_function
+
+from mercurial import (
+    encoding,
+    extensions,
+    node,
+    revlog,
+    transaction,
+    vfs,
+)
+
+# TESTTMP is optional - so the test can run outside run-tests.py
+tvfs = vfs.vfs(encoding.environ.get('TESTTMP', b'/tmp'))
+
+# format utilities
+def rightpad(data, length):
+    return (data + b' ' * length)[:length]
+
+def normalize(data):
+    data = data.replace(b'LINEPREFIX_', b'').replace(b'\n', b'')
+    return rightpad(data, 7)
+
+# register a revlog processor for EXTSTORED
+def readprocessor(self, text):
+    return text[2:], True
+
+def writeprocessor(self, text):
+    return b'+\n' + text, False
+
+def rawprocessor(self, text):
+    # do not check hash
+    return False
+
+revlog.addflagprocessor(revlog.REVIDX_EXTSTORED,
+                        (readprocessor, writeprocessor, rawprocessor))
+
+# replace revlog.hash so the test is sensitive to anything that goes wrong
+def _hash(orig, text, p1, p2):
+    r = orig(text, p1, p2)
+    print('  # %s = hash(%s, %s)'
+          % (node.short(r), node.short(p1), normalize(text).strip()))
+    return r
+
+extensions.wrapfunction(revlog, 'hash', _hash)
+
+# writing revlog requires a transaction
+def report(msg):
+    print('transaction: %s' % msg.strip())
+
+tr = transaction.transaction(report, tvfs, {'plain': tvfs}, b'journal')
+
+def newrevlog(name='foo.i'):
+    return revlog.revlog(tvfs, 'foo.i')
+
+rl = newrevlog()
+
+# We are going to construct something like:
+# rev 0: delta-base
+# rev 1: delta against rev 0, with special flag
+# rev 2: delta against rev 1, without special flag
+
+def appendrev(rlog, data, flags=revlog.REVIDX_DEFAULT_FLAGS):
+    nextrev = len(rlog)
+    p1 = rlog.node(nextrev - 1)
+    p2 = node.nullid
+    try:
+        n = rl.addrevision(data, tr, nextrev, p1, p2, flags=flags)
+        print(b'Appended rev %d as %s' % (nextrev, node.short(n)))
+    except Exception as ex:
+        print(b'Failed to append rev %d: %s' % (nextrev, ex))
+
+# revision 0
+data = b''.join(b'LINEPREFIX_%d\n' % i for i in range(5))
+appendrev(rl, data)
+
+# revision 1
+data = data.replace(b'3\n', b'T\n3\n')
+appendrev(rl, data, flags=revlog.REVIDX_EXTSTORED)
+
+# revision 2
+data = data.replace(b'1\n', b'').replace('4\n', '5')
+appendrev(rl, data)
+
+tr.close()
+
+# examine content of the revlog
+print(b'\nREV RAW DATA    CACHE BEFORE -> AFTER')
+
+def describecache(rlog):
+    if rlog._cache is None:
+        cache = b'None'
+    else:
+        n, r, text = rlog._cache
+        cache = b'%d %s' % (r, normalize(text))
+    return rightpad(cache, 9)
+
+def printrev(rlog, rev, raw):
+    cachebefore = describecache(rlog)
+    try:
+        data = rlog.revision(rev, raw=raw)
+    except Exception:
+        data = b'Error'
+    cacheafter = describecache(rlog).strip()
+    print(b'%3d %s   %s %s       %s'
+          % (rev, raw and 'y' or ' ', normalize(data), cachebefore, cacheafter))
+
+for revorder in [[1, 2], [2, 1]]:
+    for raworder in [[True], [False], [True, False], [False, True]]:
+        rlog = newrevlog()
+        for rev in revorder:
+            for raw in raworder:
+                printrev(rlog, rev, raw)
+        print()
+
+# when run outside run-tests.py, clean-ups are needed
+if 'TESTTMP' not in encoding.environ:
+    rl = None
+    tvfs.tryunlink('foo.i')
+    tvfs.tryunlink('foo.d')
diff --git a/tests/test-revlog-raw.py.out b/tests/test-revlog-raw.py.out
new file mode 100644
--- /dev/null
+++ b/tests/test-revlog-raw.py.out
@@ -0,0 +1,58 @@ 
+  # ea4d2d9efcc9 = hash(000000000000, 01234)
+  # ea4d2d9efcc9 = hash(000000000000, 01234)
+Appended rev 0 as ea4d2d9efcc9
+  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
+Appended rev 1 as 7bc003d2bcd3
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+Appended rev 2 as 3338e2e7da35
+
+REV RAW DATA    CACHE BEFORE -> AFTER
+  1 y   +012T34 None            1 +012T34
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2 y   02T35   1 +012T34       2 02T35
+
+  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
+  1     012T34  None            1 +012T34
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2     02T35   1 +012T34       2 02T35
+
+  1 y   +012T34 None            1 +012T34
+  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
+  1     012T34  1 +012T34       1 +012T34
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2 y   02T35   1 +012T34       2 02T35
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2     02T35   2 02T35         2 02T35
+
+  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
+  1     012T34  None            1 +012T34
+  1 y   +012T34 1 +012T34       1 +012T34
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2     02T35   1 +012T34       2 02T35
+  2 y   02T35   2 02T35         2 02T35
+
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2 y   02T35   None            2 02T35
+  1 y   +012T34 2 02T35         1 +012T34
+
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2     02T35   None            2 02T35
+  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
+  1     012T34  2 02T35         1 +012T34
+
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2 y   02T35   None            2 02T35
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2     02T35   2 02T35         2 02T35
+  1 y   +012T34 2 02T35         1 +012T34
+  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
+  1     012T34  1 +012T34       1 +012T34
+
+  # 3338e2e7da35 = hash(7bc003d2bcd3, 02T35)
+  2     02T35   None            2 02T35
+  2 y   02T35   2 02T35         2 02T35
+  # 7bc003d2bcd3 = hash(ea4d2d9efcc9, 012T34)
+  1     012T34  2 02T35         1 +012T34
+  1 y   +012T34 1 +012T34       1 +012T34
+