Patchwork [2,of,2,STABLE] log: copy the way of ancestor traversal to --follow matcher (issue5376)

login
register
mail settings
Submitter Yuya Nishihara
Date Sept. 24, 2016, 2 p.m.
Message ID <4237fd4f828807a9f8d7.1474725638@mimosa>
Download mbox | patch
Permalink /patch/16778/
State Accepted
Headers show

Comments

Yuya Nishihara - Sept. 24, 2016, 2 p.m.
# HG changeset patch
# User Yuya Nishihara <yuya@tcha.org>
# Date 1474714703 -32400
#      Sat Sep 24 19:58:23 2016 +0900
# Branch stable
# Node ID 4237fd4f828807a9f8d711625da00b24a2ca9461
# Parent  bf616dd17b4244d4d5fc92b0ada65404596fe4b2
# EXP-Topic followmatcher
log: copy the way of ancestor traversal to --follow matcher (issue5376)

We can't use fctx.linkrev() because follow() revset tries hard to simulate
the traversal of changelog DAG, not filelog DAG. This patch fixes
_makefollowlogfilematcher() to walk file ancestors in the same way as
revset._follow().

I'll factor out a common function in future patches.
Jun Wu - Sept. 24, 2016, 5:59 p.m.
Thanks for the clean fix! I was trying a different approach but cannot get
it right easily. These patches should be the correct fix. Marked
Pre-Reviewed.

A tricky case could be when multiple files are followed, for example:

  $ hg init repo
  $ cd repo
  $ for i in 1 2; do
  >   echo $i >> a
  >   hg commit -A a -m $i
  > done
  $ hg update 0 -q
  $ echo 2 >> a
  $ echo 1 > b
  $ hg commit -A b -A a -m '2 with b' -q
  $ hg update 1 -q
  $ echo 1 > b
  $ echo 1 > a
  $ hg commit -A b a -m 'b'
  $ hg merge 2 -q
  $ echo 1 > b
  $ hg commit -m merge
  $ hg log -G -p -T '{rev}:{node|short} {desc}'        # more patches
  $ hg log -G -p -T '{rev}:{node|short} {desc}' -f a b # less patches

At first sight, it seemed wrong to output less patches for the second
command. But after a second thought, I believe showing less patches is the
sane and desired behavior.

- off topic -

By the way, I just encountered an _adjustlinkrev related perf issue when
working on fastannotate internally. introrev (_adjustlinkrev) can take up
seconds (and sadly it just returns ctx.linerev() most of the time). Seconds
is unacceptable comparing with the annotate logic which only takes tens of
milliseconds.

I think a stupid way to make it better is to have a map of {fctxnode:
[linknode]}. To make the map small, it only stores entries where
len([linknode]) > 1. The challenge would be how to keep the map up-to-date.


Excerpts from Yuya Nishihara's message of 2016-09-24 23:00:38 +0900:
> # HG changeset patch
> # User Yuya Nishihara <yuya@tcha.org>
> # Date 1474714703 -32400
> #      Sat Sep 24 19:58:23 2016 +0900
> # Branch stable
> # Node ID 4237fd4f828807a9f8d711625da00b24a2ca9461
> # Parent  bf616dd17b4244d4d5fc92b0ada65404596fe4b2
> # EXP-Topic followmatcher
> log: copy the way of ancestor traversal to --follow matcher (issue5376)
> 
> We can't use fctx.linkrev() because follow() revset tries hard to simulate
> the traversal of changelog DAG, not filelog DAG. This patch fixes
> _makefollowlogfilematcher() to walk file ancestors in the same way as
> revset._follow().
> 
> I'll factor out a common function in future patches.
> 
> diff --git a/mercurial/cmdutil.py b/mercurial/cmdutil.py
> --- a/mercurial/cmdutil.py
> +++ b/mercurial/cmdutil.py
> @@ -1940,7 +1940,7 @@ def _makefollowlogfilematcher(repo, file
>      # --follow, we want the names of the ancestors of FILE in the
>      # revision, stored in "fcache". "fcache" is populated by
>      # reproducing the graph traversal already done by --follow revset
> -    # and relating linkrevs to file names (which is not "correct" but
> +    # and relating revs to file names (which is not "correct" but
>      # good enough).
>      fcache = {}
>      fcacheready = [False]
> @@ -1949,9 +1949,9 @@ def _makefollowlogfilematcher(repo, file
>      def populate():
>          for fn in files:
>              fctx = pctx[fn]
> -            fcache.setdefault(fctx.linkrev(), set()).add(fctx.path())
> +            fcache.setdefault(fctx.introrev(), set()).add(fctx.path())
>              for c in fctx.ancestors(followfirst=followfirst):
> -                fcache.setdefault(c.linkrev(), set()).add(c.path())
> +                fcache.setdefault(c.rev(), set()).add(c.path())
>  
>      def filematcher(rev):
>          if not fcacheready[0]:
> diff --git a/tests/test-log.t b/tests/test-log.t
> --- a/tests/test-log.t
> +++ b/tests/test-log.t
> @@ -920,6 +920,78 @@ log -r tip --stat
>  
>    $ cd ..
>  
> +log --follow --patch FILE in repository where linkrev isn't trustworthy
> +(issue5376)
> +
> +  $ hg init follow-dup
> +  $ cd follow-dup
> +  $ cat <<EOF >> .hg/hgrc
> +  > [ui]
> +  > logtemplate = '=== {rev}: {desc}\n'
> +  > [diff]
> +  > nodates = True
> +  > EOF
> +  $ echo 0 >> a
> +  $ hg ci -qAm 'a0'
> +  $ echo 1 >> a
> +  $ hg ci -m 'a1'
> +  $ hg up -q 0
> +  $ echo 1 >> a
> +  $ touch b
> +  $ hg ci -qAm 'a1 with b'
> +  $ echo 3 >> a
> +  $ hg ci -m 'a3'
> +
> + fctx.rev() == 2, but fctx.linkrev() == 1
> +
> +  $ hg log -pf a
> +  === 3: a3
> +  diff -r 4ea02ba94d66 -r e7a6331a34f0 a
> +  --- a/a
> +  +++ b/a
> +  @@ -1,2 +1,3 @@
> +   0
> +   1
> +  +3
> +  
> +  === 2: a1 with b
> +  diff -r 49b5e81287e2 -r 4ea02ba94d66 a
> +  --- a/a
> +  +++ b/a
> +  @@ -1,1 +1,2 @@
> +   0
> +  +1
> +  
> +  === 0: a0
> +  diff -r 000000000000 -r 49b5e81287e2 a
> +  --- /dev/null
> +  +++ b/a
> +  @@ -0,0 +1,1 @@
> +  +0
> +  
> +
> + fctx.introrev() == 2, but fctx.linkrev() == 1
> +
> +  $ hg up -q 2
> +  $ hg log -pf a
> +  === 2: a1 with b
> +  diff -r 49b5e81287e2 -r 4ea02ba94d66 a
> +  --- a/a
> +  +++ b/a
> +  @@ -1,1 +1,2 @@
> +   0
> +  +1
> +  
> +  === 0: a0
> +  diff -r 000000000000 -r 49b5e81287e2 a
> +  --- /dev/null
> +  +++ b/a
> +  @@ -0,0 +1,1 @@
> +  +0
> +  
> +
> +  $ cd ..
> +
>  Test that log should respect the order of -rREV even if multiple OR conditions
>  are specified (issue5100):
>
Yuya Nishihara - Sept. 25, 2016, 3:31 p.m.
On Sat, 24 Sep 2016 18:59:45 +0100, Jun Wu wrote:
> A tricky case could be when multiple files are followed, for example:
> 
>   $ hg init repo
>   $ cd repo
>   $ for i in 1 2; do
>   >   echo $i >> a
>   >   hg commit -A a -m $i
>   > done
>   $ hg update 0 -q
>   $ echo 2 >> a
>   $ echo 1 > b
>   $ hg commit -A b -A a -m '2 with b' -q
>   $ hg update 1 -q
>   $ echo 1 > b
>   $ echo 1 > a
>   $ hg commit -A b a -m 'b'
>   $ hg merge 2 -q
>   $ echo 1 > b
>   $ hg commit -m merge
>   $ hg log -G -p -T '{rev}:{node|short} {desc}'        # more patches
>   $ hg log -G -p -T '{rev}:{node|short} {desc}' -f a b # less patches
> 
> At first sight, it seemed wrong to output less patches for the second
> command. But after a second thought, I believe showing less patches is the
> sane and desired behavior.

I'm not sure which is correct, but that's what follow() is currently doing
so is reasonable behavior. repo[4]['b'].ancestors() can't walk both parents
right now.

> - off topic -
> 
> By the way, I just encountered an _adjustlinkrev related perf issue when
> working on fastannotate internally. introrev (_adjustlinkrev) can take up
> seconds (and sadly it just returns ctx.linerev() most of the time). Seconds
> is unacceptable comparing with the annotate logic which only takes tens of
> milliseconds.
> 
> I think a stupid way to make it better is to have a map of {fctxnode:
> [linknode]}. To make the map small, it only stores entries where
> len([linknode]) > 1. The challenge would be how to keep the map up-to-date.

That will reduce some cost of _adjustlinkrev(), but maybe we'll still need to
find which linknode is the ancestor of the annotated revision.
Pierre-Yves David - Sept. 27, 2016, 2 p.m.
On 09/24/2016 04:00 PM, Yuya Nishihara wrote:
> # HG changeset patch
> # User Yuya Nishihara <yuya@tcha.org>
> # Date 1474714703 -32400
> #      Sat Sep 24 19:58:23 2016 +0900
> # Branch stable
> # Node ID 4237fd4f828807a9f8d711625da00b24a2ca9461
> # Parent  bf616dd17b4244d4d5fc92b0ada65404596fe4b2
> # EXP-Topic followmatcher
> log: copy the way of ancestor traversal to --follow matcher (issue5376)
>
> We can't use fctx.linkrev() because follow() revset tries hard to simulate
> the traversal of changelog DAG, not filelog DAG. This patch fixes
> _makefollowlogfilematcher() to walk file ancestors in the same way as
> revset._follow().
>
> I'll factor out a common function in future patches.

Pushed, thanks. (also Thanks for Jun for the pre-review)

Cheers,
Pierre-Yves David - Sept. 27, 2016, 2:08 p.m.
On 09/24/2016 07:59 PM, Jun Wu wrote:
> Thanks for the clean fix! I was trying a different approach but cannot get
> it right easily. These patches should be the correct fix. Marked
> Pre-Reviewed.
>
> A tricky case could be when multiple files are followed, for example:
>
>   $ hg init repo
>   $ cd repo
>   $ for i in 1 2; do
>   >   echo $i >> a
>   >   hg commit -A a -m $i
>   > done
>   $ hg update 0 -q
>   $ echo 2 >> a
>   $ echo 1 > b
>   $ hg commit -A b -A a -m '2 with b' -q
>   $ hg update 1 -q
>   $ echo 1 > b
>   $ echo 1 > a
>   $ hg commit -A b a -m 'b'
>   $ hg merge 2 -q
>   $ echo 1 > b
>   $ hg commit -m merge
>   $ hg log -G -p -T '{rev}:{node|short} {desc}'        # more patches
>   $ hg log -G -p -T '{rev}:{node|short} {desc}' -f a b # less patches
>
> At first sight, it seemed wrong to output less patches for the second
> command. But after a second thought, I believe showing less patches is the
> sane and desired behavior.
>
> - off topic -
>
> By the way, I just encountered an _adjustlinkrev related perf issue when
> working on fastannotate internally. introrev (_adjustlinkrev) can take up
> seconds (and sadly it just returns ctx.linerev() most of the time). Seconds
> is unacceptable comparing with the annotate logic which only takes tens of
> milliseconds.

Did you check that that the ancestry context where properly inherited 
from one check to another, this helps with the cost of adjustlinkrev.

> I think a stupid way to make it better is to have a map of {fctxnode:
> [linknode]}. To make the map small, it only stores entries where
> len([linknode]) > 1. The challenge would be how to keep the map up-to-date.

A smaller step would be to have a fast way to know if a fnode is used 
more than once. We can skip linkrev adjustment for anything with only 
one entry.

A "multiple entry flag" have this interesting similarity with other 
proper with are considering bitmap for: "public", "obs". It is a 
"forward only" property and we could reuse transaction safe logic for 
all of them.


Building a very basic version of this (eg: no cache invalidation, simple 
json on disk) would be a good way to get some number and see how much it 
actually improves performance.

Cheers,
Jun Wu - Sept. 27, 2016, 3:24 p.m.
I changed the subject to reflect the actual topic.

The map or whatever structure to replace one single linkrev to a list of
"introrev"s (or flags indicating it is not unique) is just a hack.

The long-term solution would be something different, ideally each fctx has
one unique linkrev, and any filelog is a subgraph of the changelog.

It seems we have to add some noise when creating a new filenode to address
the problem. But hash change is a big BC. So one crazier idea is to extend
the hash - we still have 16 bytes unused for nodeid everywhere. So it would
be like:

  - filenode would have extra N bytes for each revision, purely noise
  - manifest would be the complex part - we need to record those N bytes for
    each file, while we don't want to have BC or hash change.
    The solution would be like to have a separated manifest just storing
    those noises for each file path. So the hash of the main manifest does
    not change - thus compatible with old clients.
    Manifest nodeid would be extended by N bytes as well.
  - changelog nodeid would have a similar change - extra N bytes
    representing the noise.

That sounds ugly anyway. If we can make a big BC on the hash (just like
treemanifest), I think it's may be the time to change more things, like:

  - Include the order of p1, p2 in the hash
  - Like git, hash file blobs without parentids, say it's blobid, so we can
    reuse blobs for copies, reverts, and across repos.
  - Generate some noise (like, uuid) for each changeset being created
    (thanks Durham for this idea), in filelog, let nodeid = hash((blobid,
    p1, p2, intro-changeset-uuid)). The uuid could be reused for the amend
    case.

Notes:

  - The noise is a dig at my own "reduce-noise" attempt. I was planning to
    move amend_source etc. to obsstore. But after a second thought, I
    believe we have to have some kind of noises for performance regarding on
    single file traversal.
  - We could provide a command to hash the tree if we have the blob layer,
    for users who want to check if two commits have the exactly same file
    contents.

Excerpts from Pierre-Yves David's message of 2016-09-27 16:08:33 +0200:
> On 09/24/2016 07:59 PM, Jun Wu wrote:
> > By the way, I just encountered an _adjustlinkrev related perf issue when
> > working on fastannotate internally. introrev (_adjustlinkrev) can take up
> > seconds (and sadly it just returns ctx.linerev() most of the time). Seconds
> > is unacceptable comparing with the annotate logic which only takes tens of
> > milliseconds.
> 
> Did you check that that the ancestry context where properly inherited 
> from one check to another, this helps with the cost of adjustlinkrev.
> 
> > I think a stupid way to make it better is to have a map of {fctxnode:
> > [linknode]}. To make the map small, it only stores entries where
> > len([linknode]) > 1. The challenge would be how to keep the map up-to-date.
> 
> A smaller step would be to have a fast way to know if a fnode is used 
> more than once. We can skip linkrev adjustment for anything with only 
> one entry.
> 
> A "multiple entry flag" have this interesting similarity with other 
> proper with are considering bitmap for: "public", "obs". It is a 
> "forward only" property and we could reuse transaction safe logic for 
> all of them.
> 
> 
> Building a very basic version of this (eg: no cache invalidation, simple 
> json on disk) would be a good way to get some number and see how much it 
> actually improves performance.
Gregory Szorc - Sept. 27, 2016, 5:04 p.m.
On Tue, Sep 27, 2016 at 8:24 AM, Jun Wu <quark@fb.com> wrote:

> I changed the subject to reflect the actual topic.
>
> The map or whatever structure to replace one single linkrev to a list of
> "introrev"s (or flags indicating it is not unique) is just a hack.
>
> The long-term solution would be something different, ideally each fctx has
> one unique linkrev, and any filelog is a subgraph of the changelog.
>
> It seems we have to add some noise when creating a new filenode to address
> the problem. But hash change is a big BC. So one crazier idea is to extend
> the hash - we still have 16 bytes unused for nodeid everywhere. So it would
> be like:
>
>   - filenode would have extra N bytes for each revision, purely noise
>   - manifest would be the complex part - we need to record those N bytes
> for
>     each file, while we don't want to have BC or hash change.
>     The solution would be like to have a separated manifest just storing
>     those noises for each file path. So the hash of the main manifest does
>     not change - thus compatible with old clients.
>     Manifest nodeid would be extended by N bytes as well.
>   - changelog nodeid would have a similar change - extra N bytes
>     representing the noise.
>
> That sounds ugly anyway. If we can make a big BC on the hash (just like
> treemanifest), I think it's may be the time to change more things, like:
>
>   - Include the order of p1, p2 in the hash
>   - Like git, hash file blobs without parentids, say it's blobid, so we can
>     reuse blobs for copies, reverts, and across repos.
>   - Generate some noise (like, uuid) for each changeset being created
>     (thanks Durham for this idea), in filelog, let nodeid = hash((blobid,
>     p1, p2, intro-changeset-uuid)). The uuid could be reused for the amend
>     case.
>

As you noted, things get real ugly real quick. There are also implications
for changegroup transfer, which does hardcode 20 bytes for the hash.

I tend to think of linkrev as just another piece of revision-derived
metadata that could be "cached" better. When you view it through that lens,
you realize there are tons of other things we could also throw into a
generic "derived from revision" cache. This could even include parts of
changeset data, such as the author, branch, and files changed. Depending on
how the cache is implemented (say you used SQLite), this could result in
some significant performance wins for certain operations. e.g. an index of
file changes in changesets would make `hg log --removed` significantly
faster. If you moved the changelog to something that wasn't a revlog, it
also makes shallow and narrow clone much easier to implement since
out-of-order insertions would be allowed.

Thinking even bigger picture, there seem to be enough potential areas for
improvement across many areas to - and I don't say this lightly - consider
a new store format. Regardless of whether we decide to take that giant
leap, I think it is important to keep a list of "known problems" and
"things we would fix if we implemented things from scratch." Perhaps the
"internals" help pages could start documenting these things?


>
> Notes:
>
>   - The noise is a dig at my own "reduce-noise" attempt. I was planning to
>     move amend_source etc. to obsstore. But after a second thought, I
>     believe we have to have some kind of noises for performance regarding
> on
>     single file traversal.
>   - We could provide a command to hash the tree if we have the blob layer,
>     for users who want to check if two commits have the exactly same file
>     contents.
>
> Excerpts from Pierre-Yves David's message of 2016-09-27 16:08:33 +0200:
> > On 09/24/2016 07:59 PM, Jun Wu wrote:
> > > By the way, I just encountered an _adjustlinkrev related perf issue
> when
> > > working on fastannotate internally. introrev (_adjustlinkrev) can take
> up
> > > seconds (and sadly it just returns ctx.linerev() most of the time).
> Seconds
> > > is unacceptable comparing with the annotate logic which only takes
> tens of
> > > milliseconds.
> >
> > Did you check that that the ancestry context where properly inherited
> > from one check to another, this helps with the cost of adjustlinkrev.
> >
> > > I think a stupid way to make it better is to have a map of {fctxnode:
> > > [linknode]}. To make the map small, it only stores entries where
> > > len([linknode]) > 1. The challenge would be how to keep the map
> up-to-date.
> >
> > A smaller step would be to have a fast way to know if a fnode is used
> > more than once. We can skip linkrev adjustment for anything with only
> > one entry.
> >
> > A "multiple entry flag" have this interesting similarity with other
> > proper with are considering bitmap for: "public", "obs". It is a
> > "forward only" property and we could reuse transaction safe logic for
> > all of them.
> >
> >
> > Building a very basic version of this (eg: no cache invalidation, simple
> > json on disk) would be a good way to get some number and see how much it
> > actually improves performance.
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Durham Goode - Sept. 27, 2016, 8:31 p.m.
On 9/27/16 8:24 AM, Jun Wu wrote:
> I changed the subject to reflect the actual topic.
>
> The map or whatever structure to replace one single linkrev to a list of
> "introrev"s (or flags indicating it is not unique) is just a hack.
Sure it's imperfect, but it won't require any BC breakages, and it will 
likely fix the problem for the majority of use cases (by letting us only 
scan the changelog .i for potential ancestor linknodes and avoid 
scanning the changelog .d).
>
> The long-term solution would be something different, ideally each fctx has
> one unique linkrev, and any filelog is a subgraph of the changelog.
>
> It seems we have to add some noise when creating a new filenode to address
> the problem. But hash change is a big BC. So one crazier idea is to extend
> the hash - we still have 16 bytes unused for nodeid everywhere. So it would
> be like:
>
>    - filenode would have extra N bytes for each revision, purely noise
>    - manifest would be the complex part - we need to record those N bytes for
>      each file, while we don't want to have BC or hash change.
>      The solution would be like to have a separated manifest just storing
>      those noises for each file path. So the hash of the main manifest does
>      not change - thus compatible with old clients.
>      Manifest nodeid would be extended by N bytes as well.
>    - changelog nodeid would have a similar change - extra N bytes
>      representing the noise.
I don't fully understand.  You say the extra bytes don't affect the hash 
of the main manifest.  If it's not affecting the hash, how is it fixing 
the problem?  If it does affect the hash, where does it affect it.
>
> That sounds ugly anyway. If we can make a big BC on the hash (just like
> treemanifest), I think it's may be the time to change more things, like:
I honestly don't think we'll be able to make a BC breaking hash change 
in upstream Mercurial.  Nor do I think the benefits outweigh the 
downsides for the community.
>
>    - Include the order of p1, p2 in the hash
The order is already p1-then-p2. Why does it need recording?
>    - Like git, hash file blobs without parentids, say it's blobid, so we can
>      reuse blobs for copies, reverts, and across repos.
>    - Generate some noise (like, uuid) for each changeset being created
>      (thanks Durham for this idea), in filelog, let nodeid = hash((blobid,
>      p1, p2, intro-changeset-uuid)). The uuid could be reused for the amend
>      case.
(this was Matt's idea.  I just regurgitated it)
>
> Notes:
>
>    - The noise is a dig at my own "reduce-noise" attempt. I was planning to
>      move amend_source etc. to obsstore. But after a second thought, I
>      believe we have to have some kind of noises for performance regarding on
>      single file traversal.
>    - We could provide a command to hash the tree if we have the blob layer,
>      for users who want to check if two commits have the exactly same file
>      contents.
>
> Excerpts from Pierre-Yves David's message of 2016-09-27 16:08:33 +0200:
>> On 09/24/2016 07:59 PM, Jun Wu wrote:
>>> By the way, I just encountered an _adjustlinkrev related perf issue when
>>> working on fastannotate internally. introrev (_adjustlinkrev) can take up
>>> seconds (and sadly it just returns ctx.linerev() most of the time). Seconds
>>> is unacceptable comparing with the annotate logic which only takes tens of
>>> milliseconds.
>> Did you check that that the ancestry context where properly inherited
>> from one check to another, this helps with the cost of adjustlinkrev.
>>
>>> I think a stupid way to make it better is to have a map of {fctxnode:
>>> [linknode]}. To make the map small, it only stores entries where
>>> len([linknode]) > 1. The challenge would be how to keep the map up-to-date.
>> A smaller step would be to have a fast way to know if a fnode is used
>> more than once. We can skip linkrev adjustment for anything with only
>> one entry.
>>
>> A "multiple entry flag" have this interesting similarity with other
>> proper with are considering bitmap for: "public", "obs". It is a
>> "forward only" property and we could reuse transaction safe logic for
>> all of them.
>>
>>
>> Building a very basic version of this (eg: no cache invalidation, simple
>> json on disk) would be a good way to get some number and see how much it
>> actually improves performance.
Jun Wu - Sept. 27, 2016, 9:15 p.m.
Excerpts from Durham Goode's message of 2016-09-27 13:31:22 -0700:
> On 9/27/16 8:24 AM, Jun Wu wrote:
> > I changed the subject to reflect the actual topic.
> >
> > The map or whatever structure to replace one single linkrev to a list of
> > "introrev"s (or flags indicating it is not unique) is just a hack.
> Sure it's imperfect, but it won't require any BC breakages, and it will 
> likely fix the problem for the majority of use cases (by letting us only 
> scan the changelog .i for potential ancestor linknodes and avoid 
> scanning the changelog .d).

Suppose the map is {fctx: [linkrev]}.

First, the map should be as small as possible, so it shouldn't contain fctx
where its linkrev list contains only one revision - in that case, linkrev is
accurate.

Then the problem becomes how to make sure the map is up-to-date. It's not
easy imo. For example, say Alice has a map storing {fctx: [linkrev]} for
len([linkrev]) > 1, Bob also has such a map locally. When Bob pulls from
Alice, it is possible that we need new entires that exist in neither
Alice's, nor Bob's map - how do you find out those entries in an efficient
way?

> > The long-term solution would be something different, ideally each fctx has
> > one unique linkrev, and any filelog is a subgraph of the changelog.
> >
> > It seems we have to add some noise when creating a new filenode to address
> > the problem. But hash change is a big BC. So one crazier idea is to extend
> > the hash - we still have 16 bytes unused for nodeid everywhere. So it would
> > be like:
> >
> >    - filenode would have extra N bytes for each revision, purely noise
> >    - manifest would be the complex part - we need to record those N bytes for
> >      each file, while we don't want to have BC or hash change.
> >      The solution would be like to have a separated manifest just storing
> >      those noises for each file path. So the hash of the main manifest does
> >      not change - thus compatible with old clients.
> >      Manifest nodeid would be extended by N bytes as well.
> >    - changelog nodeid would have a similar change - extra N bytes
> >      representing the noise.
> I don't fully understand.  You say the extra bytes don't affect the hash 
> of the main manifest.  If it's not affecting the hash, how is it fixing 
> the problem?  If it does affect the hash, where does it affect it.

It won't affect the first 20-bytes of the manifest hash. For example, a
manifest entry is like:
  
  path: 20-byte hash, 8-byte noise
  ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^
   |                    |
   |                    +- only affect the extra 8 bytes of a manifest hash
   +- only affect first 20 bytes of a manifest hash

Manifest hash: 20-byte 8-byte
               ^^^^^^^ ^^^^^^
                |       |
                |       +- only read by new clients
                +-- compatible with old clients

Similar to changelog hash - first 20 bytes are for old clients, while the
extra 8 bytes are for new clients.

> > That sounds ugly anyway. If we can make a big BC on the hash (just like
> > treemanifest), I think it's may be the time to change more things, like:
> I honestly don't think we'll be able to make a BC breaking hash change 
> in upstream Mercurial.  Nor do I think the benefits outweigh the 
> downsides for the community.

Umm... I was just wondering because tree manifest is a hash BC. Since we are
in the progress of migrating to treemanifest, we still have chances to make
other hash changes along with the migration. Although it seems unlikely to
happen.

> >    - Include the order of p1, p2 in the hash
> The order is already p1-then-p2. Why does it need recording?

Currently we hash sorted([p1, p2]), so if you swap p1 and p2 directly in
revlog, "hg verify" still passes. Since the order does affect some behavior,
like "hg diff", it's better to "include the order in the hash" so if you
swap, "hg verify" will complain.

> >    - Like git, hash file blobs without parentids, say it's blobid, so we can
> >      reuse blobs for copies, reverts, and across repos.
> >    - Generate some noise (like, uuid) for each changeset being created
> >      (thanks Durham for this idea), in filelog, let nodeid = hash((blobid,
> >      p1, p2, intro-changeset-uuid)). The uuid could be reused for the amend
> >      case.
> (this was Matt's idea.  I just regurgitated it)
Jun Wu - Sept. 27, 2016, 9:52 p.m.
Excerpts from Gregory Szorc's message of 2016-09-27 10:04:43 -0700:
> As you noted, things get real ugly real quick. There are also implications
> for changegroup transfer, which does hardcode 20 bytes for the hash.

Hardcoded 20-byte is actually what I want - ideally old clients only read 20
bytes *everywhere* so if we increase the hash length, it would be a non-BC
for old clients. That said, I'm not encouraging that we go this way.

> I tend to think of linkrev as just another piece of revision-derived
> metadata that could be "cached" better. When you view it through that lens,
> you realize there are tons of other things we could also throw into a
> generic "derived from revision" cache. This could even include parts of
> changeset data, such as the author, branch, and files changed. Depending on
> how the cache is implemented (say you used SQLite), this could result in
> some significant performance wins for certain operations. e.g. an index of
> file changes in changesets would make `hg log --removed` significantly
> faster. If you moved the changelog to something that wasn't a revlog, it
> also makes shallow and narrow clone much easier to implement since
> out-of-order insertions would be allowed.

I'm +1 for things that do not introduce RNG noises, if possible. (I start
reconsidering the plan to move "amend_source" to "obsstore").

One of my earlier ideas that I haven't posted here is to build extra
"linknode-manifest"s. Every "linknode-manifest" is associated to a
changelog revision. The linknode-manifest is like a manifest, but stores
different things:

  fctx.path(): fctx.introrev()

First, it is a cache. Second, it could be seen as "revision-derived" data.
The nice things of this approach are:
  - No RNG noise
  - No BC
  - No cache-invalidation issue - once built, 100% correct permanently

Some people think disk usage is a concern. I think it's "acceptable", or if
we really want, we could merge this info with the normal manifest on disk.

If we cannot BC, I actually prefer this solution. This pattern also applies
to "how to make git scale".

> Thinking even bigger picture, there seem to be enough potential areas for
> improvement across many areas to - and I don't say this lightly - consider
> a new store format. Regardless of whether we decide to take that giant
> leap, I think it is important to keep a list of "known problems" and
> "things we would fix if we implemented things from scratch." Perhaps the
> "internals" help pages could start documenting these things?

I will update the wiki later.
Durham Goode - Sept. 27, 2016, 11:29 p.m.
On 9/27/16 2:15 PM, Jun Wu wrote:
> Excerpts from Durham Goode's message of 2016-09-27 13:31:22 -0700:
>> On 9/27/16 8:24 AM, Jun Wu wrote:
>>> I changed the subject to reflect the actual topic.
>>>
>>> The map or whatever structure to replace one single linkrev to a list of
>>> "introrev"s (or flags indicating it is not unique) is just a hack.
>> Sure it's imperfect, but it won't require any BC breakages, and it will
>> likely fix the problem for the majority of use cases (by letting us only
>> scan the changelog .i for potential ancestor linknodes and avoid
>> scanning the changelog .d).
> Suppose the map is {fctx: [linkrev]}.
>
> First, the map should be as small as possible, so it shouldn't contain fctx
> where its linkrev list contains only one revision - in that case, linkrev is
> accurate.
>
> Then the problem becomes how to make sure the map is up-to-date. It's not
> easy imo. For example, say Alice has a map storing {fctx: [linkrev]} for
> len([linkrev]) > 1, Bob also has such a map locally. When Bob pulls from
> Alice, it is possible that we need new entires that exist in neither
> Alice's, nor Bob's map - how do you find out those entries in an efficient
> way?
The map is just there to make the initial step of adjustlinkrev more 
likely to succeed.  When it scans the changelog.i, it knows of more 
nodes it can search for.  If it finds none of those, it reverts back to 
the old scan-changelog.d behavior and fills the cache with the result.  
commit, rebase, amend, etc can also be filling the cache as commits are 
added.
>
>>> The long-term solution would be something different, ideally each fctx has
>>> one unique linkrev, and any filelog is a subgraph of the changelog.
>>>
>>> It seems we have to add some noise when creating a new filenode to address
>>> the problem. But hash change is a big BC. So one crazier idea is to extend
>>> the hash - we still have 16 bytes unused for nodeid everywhere. So it would
>>> be like:
>>>
>>>     - filenode would have extra N bytes for each revision, purely noise
>>>     - manifest would be the complex part - we need to record those N bytes for
>>>       each file, while we don't want to have BC or hash change.
>>>       The solution would be like to have a separated manifest just storing
>>>       those noises for each file path. So the hash of the main manifest does
>>>       not change - thus compatible with old clients.
>>>       Manifest nodeid would be extended by N bytes as well.
>>>     - changelog nodeid would have a similar change - extra N bytes
>>>       representing the noise.
>> I don't fully understand.  You say the extra bytes don't affect the hash
>> of the main manifest.  If it's not affecting the hash, how is it fixing
>> the problem?  If it does affect the hash, where does it affect it.
> It won't affect the first 20-bytes of the manifest hash. For example, a
> manifest entry is like:
>    
>    path: 20-byte hash, 8-byte noise
>    ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^
>     |                    |
>     |                    +- only affect the extra 8 bytes of a manifest hash
>     +- only affect first 20 bytes of a manifest hash
>
> Manifest hash: 20-byte 8-byte
>                 ^^^^^^^ ^^^^^^
>                  |       |
>                  |       +- only read by new clients
>                  +-- compatible with old clients
>
> Similar to changelog hash - first 20 bytes are for old clients, while the
> extra 8 bytes are for new clients.
I'd have to think about this some more.  It sounds a little scary, and 
I'm not sure how it would work with the wireprotocol (does the wire 
protocol also have 12 bytes of free space?)
>>>     - Include the order of p1, p2 in the hash
>> The order is already p1-then-p2. Why does it need recording?
> Currently we hash sorted([p1, p2]), so if you swap p1 and p2 directly in
> revlog, "hg verify" still passes. Since the order does affect some behavior,
> like "hg diff", it's better to "include the order in the hash" so if you
> swap, "hg verify" will complain.
Ug, I didn't realize they were sorted.
Jun Wu - Sept. 28, 2016, 8:18 a.m.
Excerpts from Durham Goode's message of 2016-09-27 16:29:42 -0700:
> The map is just there to make the initial step of adjustlinkrev more 
> likely to succeed.  When it scans the changelog.i, it knows of more 
> nodes it can search for.

This may change the behavior of "_adjustlinkrev" - instead of returning the
first ancestor satisfying the condition, it may return the second, or the
third etc.

> If it finds none of those, it reverts back to the old scan-changelog.d
> behavior and fills the cache with the result.

_adjustlinkrev does not have the full view of all "introrev"s a fctx can
have, since it only walks part of the changelog graph. Therefore it only
fills part of the introrevs - the remaining are "yet to discover".

> commit, rebase, amend, etc can also be filling the cache as commits are
> added.

What if the linkrev does not need to be adjusted (i.e. it is unique)? It
seems it will fallback to the existing slow changelog walk in this case.
Since unique linkrev is not rare, so I guess this approach does not help
much.

Under the hood, I think it's the same question - cache invalidation - it's
really hard to mark "this map entry is up-to-date and all the introrevs of
the fctx is here, so slow changelog walk is unnecessary", vs. "this map
entry contains just a part of the introrevs, so you still need a slow
changelog walk".

> I'd have to think about this some more.  It sounds a little scary, and 
> I'm not sure how it would work with the wireprotocol (does the wire 
> protocol also have 12 bytes of free space?)

I didn't think too much about the wireprotocol. In the worst case, we assume
old clients by default, and translate manifest to the old format on the fly,
new clients need to explicitly tell the peer it is new.

That said, personally I also think this hash hack a bit scary, and prefer a
smaller change mentioned in:
https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/088581.html
Durham Goode - Sept. 28, 2016, 4:23 p.m.
On 9/28/16 1:18 AM, Jun Wu wrote:
> Excerpts from Durham Goode's message of 2016-09-27 16:29:42 -0700:
>> The map is just there to make the initial step of adjustlinkrev more
>> likely to succeed.  When it scans the changelog.i, it knows of more
>> nodes it can search for.
> This may change the behavior of "_adjustlinkrev" - instead of returning the
> first ancestor satisfying the condition, it may return the second, or the
> third etc.
We could continue the .i scan until we reach the bottom most available 
linkrev, so we could return the earliest one available.
>> commit, rebase, amend, etc can also be filling the cache as commits are
>> added.
> What if the linkrev does not need to be adjusted (i.e. it is unique)? It
> seems it will fallback to the existing slow changelog walk in this case.
> Since unique linkrev is not rare, so I guess this approach does not help
> much.
If it does not need adjustment, the changelog.i walk handles that case 
in a couple milliseconds.  It's the changelog.d walk that we want to 
avoid (where it looks at the contents of each commit), and the 
filectx:[linknodes] cache means the .i case will work in the vast 
majority of cases.
>
> Under the hood, I think it's the same question - cache invalidation - it's
> really hard to mark "this map entry is up-to-date and all the introrevs of
> the fctx is here, so slow changelog walk is unnecessary", vs. "this map
> entry contains just a part of the introrevs, so you still need a slow
> changelog walk".
There are two types of changelog walks.  Walking the .i is fast, walking 
the .d is slow.  I think we need to disambiguate here.  This cache only 
helps with the slow case.
Jun Wu - Sept. 28, 2016, 7:22 p.m.
I think there are some misunderstandings. Let me try again. Correct me if I
make mistakes or am unclear.

> When it scans the changelog.i, it knows of more nodes it can search for.
> If it finds none of those, it reverts back to the old scan-changelog.d
> behavior and fills the cache with the result.

IIUC, this means the map is populated on demand, and we probably don't want
to update the map for cases like "hg pull", which can be expensive. Having
the map always up-to-date also means the map can be very large, probably
unwanted.

Excerpts from Durham Goode's message of 2016-09-28 09:23:44 -0700:
> On 9/28/16 1:18 AM, Jun Wu wrote:
> > Excerpts from Durham Goode's message of 2016-09-27 16:29:42 -0700:
> >> The map is just there to make the initial step of adjustlinkrev more
> >> likely to succeed.  When it scans the changelog.i, it knows of more
> >> nodes it can search for.
> > This may change the behavior of "_adjustlinkrev" - instead of returning the
> > first ancestor satisfying the condition, it may return the second, or the
> > third etc.
> We could continue the .i scan until we reach the bottom most available 
> linkrev, so we could return the earliest one available.

This happens in a merge, where a file is the same at both sides. See the
graph below. Changeset "b", "c", "d" all map to filerev 1, where "a" maps
to filerev 0:

  d
  |\
  b c         1 linkrevs: ['b', 'c']
  |/          |
  a           0 linkrevs: ['a']

  changelog   filelog

"d"'s ancestors will include both "b" and "c", in some order. Both "b" and
"c" are valid results of adjusted linkrevs. We may pick a random one
depending on what the map currently has - if the map only contains "b", we
return "b", if the map only contains "c", we return "c". If the map missed
this entry, we fallback the old "_adjustlinkrev" logic and fill the map.

> >> commit, rebase, amend, etc can also be filling the cache as commits are
> >> added.
> > What if the linkrev does not need to be adjusted (i.e. it is unique)? It
> > seems it will fallback to the existing slow changelog walk in this case.
> > Since unique linkrev is not rare, so I guess this approach does not help
> > much.
> If it does not need adjustment, the changelog.i walk handles that case 
> in a couple milliseconds.  It's the changelog.d walk that we want to 
> avoid (where it looks at the contents of each commit), and the 
> filectx:[linknodes] cache means the .i case will work in the vast 
> majority of cases.

IIUC, this requires a linkrev (that the walk of .i will encounter) to be
existed in the map. Otherwise, we fallback to .d walk. This means, if the
user logs / annotates random files that they didn't touch for the first
time, it's likely we still need the .d walk.
Jun Wu - Sept. 29, 2016, 3:04 p.m.
Excerpts from Jun Wu's message of 2016-09-28 20:22:54 +0100:
> > If it does not need adjustment, the changelog.i walk handles that case 
> > in a couple milliseconds.  It's the changelog.d walk that we want to 
> > avoid (where it looks at the contents of each commit), and the 
> > filectx:[linknodes] cache means the .i case will work in the vast 
> > majority of cases.
> 
> IIUC, this requires a linkrev (that the walk of .i will encounter) to be
> existed in the map. Otherwise, we fallback to .d walk. This means, if the
> user logs / annotates random files that they didn't touch for the first
> time, it's likely we still need the .d walk.

I think I made a mistake here. The current code already has logic to try .i
first, .d next. So assuming linkrev is accurate / unique, the perf is not a
big concern. An extra on-demand built map may only help with relatively rare
cases where the linkrev is not unique and is locally generated - seems not
common to me. I will try to get more real world perf data to see if it is
really a problem.
Jun Wu - Sept. 30, 2016, 3:20 p.m.
Suppose the repo has linear history, the only reason we need .d walk is
because of obsoleted changesets.

Consider the simple amend case, where 6 is replaced by 7:

  o    7 : filerev=1
  |
  | x  6 : filerev=1        o  1 : linkrev=6
  |/                        |
  o    5 : filerev=0        o  0 : linkrev=5
  changelog                 filelog

It's obvious that if we rewrite linkrev from 6 to 7 in this case, we no
longer need .d walk, since the new linkrev will be in the visible branch.

Hence I suggest rewrite linkrev when creating an obsolete marker: if the
linkrev was pointed to a obsoleted changeset, and one of the successor
has a same filerev, rewrite the linkrev to the earliest successor with a
same filerev.

It could be done synchronously (rewrite immediately when a marker is being
created) or asynchronously (cleanup in crontab), if the former is too
expensive or has issues with transactions.

Excerpts from Jun Wu's message of 2016-09-29 16:04:42 +0100:
> Excerpts from Jun Wu's message of 2016-09-28 20:22:54 +0100:
> > > If it does not need adjustment, the changelog.i walk handles that case 
> > > in a couple milliseconds.  It's the changelog.d walk that we want to 
> > > avoid (where it looks at the contents of each commit), and the 
> > > filectx:[linknodes] cache means the .i case will work in the vast 
> > > majority of cases.
> > 
> > IIUC, this requires a linkrev (that the walk of .i will encounter) to be
> > existed in the map. Otherwise, we fallback to .d walk. This means, if the
> > user logs / annotates random files that they didn't touch for the first
> > time, it's likely we still need the .d walk.
> 
> I think I made a mistake here. The current code already has logic to try .i
> first, .d next. So assuming linkrev is accurate / unique, the perf is not a
> big concern. An extra on-demand built map may only help with relatively rare
> cases where the linkrev is not unique and is locally generated - seems not
> common to me. I will try to get more real world perf data to see if it is
> really a problem.
Durham Goode - Sept. 30, 2016, 3:31 p.m.
On 9/30/16 8:20 AM, Jun Wu wrote:
> Suppose the repo has linear history, the only reason we need .d walk is
> because of obsoleted changesets.
I don't think we can assume the repo is linear though.  Even for a repo 
without merges, if a user performs a graft onto a release branch, that 
can result in duplicate linkrevs without any obs markers being created. 
(and at that point we can't blindly rewrite the linkrev if a future obs 
marker comes along, because we don't know what other commits have file 
nodes that use that linkrev).
>
> Consider the simple amend case, where 6 is replaced by 7:
>
>    o    7 : filerev=1
>    |
>    | x  6 : filerev=1        o  1 : linkrev=6
>    |/                        |
>    o    5 : filerev=0        o  0 : linkrev=5
>    changelog                 filelog
>
> It's obvious that if we rewrite linkrev from 6 to 7 in this case, we no
> longer need .d walk, since the new linkrev will be in the visible branch.
>
> Hence I suggest rewrite linkrev when creating an obsolete marker: if the
> linkrev was pointed to a obsoleted changeset, and one of the successor
> has a same filerev, rewrite the linkrev to the earliest successor with a
> same filerev.
Rewrite it where? In the revlog?  That sounds dangerous.  Other 
algorithms currently depend on the linkrev pointing at the earliest 
instance, so I don't think we'd want to break that invariant entirely.
>
> It could be done synchronously (rewrite immediately when a marker is being
> created) or asynchronously (cleanup in crontab), if the former is too
> expensive or has issues with transactions.
Jun Wu - Sept. 30, 2016, 4:22 p.m.
Excerpts from Durham Goode's message of 2016-09-30 08:31:17 -0700:
> On 9/30/16 8:20 AM, Jun Wu wrote:
> > Suppose the repo has linear history, the only reason we need .d walk is
> > because of obsoleted changesets.
> I don't think we can assume the repo is linear though.  Even for a repo 
> without merges, if a user performs a graft onto a release branch, that 

note: I should have added "single head" as a requirement of being "linear". 
But the approach should be helpful even if the repo is not linear, see
below.

> can result in duplicate linkrevs without any obs markers being created. 
> (and at that point we can't blindly rewrite the linkrev if a future obs 
> marker comes along, because we don't know what other commits have file 
> nodes that use that linkrev).

If we do nothing when obs markers are being created and filerev points to
obsoleted changesets, we will *always* fallback to .d walk later. This
change will reduce the chance from "always" to "maybe".

The graft case is a valid concern. But it could be detected easily: only
rewrite the linkrev to the successor, if the successor modifies (introduces)
the file. That's just a quick "if file in changelog[succ].files" test.
Otherwise we can either do a .d walk (should be cheaper than usually as we
expect the first changeset modifying the file will match, if the history is
linear) or do not rewrite the linkrev this time for better perf (i.e.
assuming the graft case happens much less common then the obsolete case).

> > Consider the simple amend case, where 6 is replaced by 7:
> >
> >    o    7 : filerev=1
> >    |
> >    | x  6 : filerev=1        o  1 : linkrev=6
> >    |/                        |
> >    o    5 : filerev=0        o  0 : linkrev=5
> >    changelog                 filelog
> >
> > It's obvious that if we rewrite linkrev from 6 to 7 in this case, we no
> > longer need .d walk, since the new linkrev will be in the visible branch.
> >
> > Hence I suggest rewrite linkrev when creating an obsolete marker: if the
> > linkrev was pointed to a obsoleted changeset, and one of the successor
> > has a same filerev, rewrite the linkrev to the earliest successor with a
> > same filerev.
> Rewrite it where? In the revlog?  That sounds dangerous.  Other 
> algorithms currently depend on the linkrev pointing at the earliest 
> instance, so I don't think we'd want to break that invariant entirely.

Yeah, if the revlog must be strictly "append-only", we need to find a
different place to write the info (like an extra map, but that increases the
complexity significantly).

That's also why I mentioned "asynchronously", let a "fsck"-like "cleanup"
program do the job. In fact, since it's just touching 4 bytes of a file
without moving content around, I think it should be much safer then things
like "strip". 

In general, the high-level ideas here are:
1. linkrev points to a hidden revision is probably always not what we want
2. if we can know a better "linkrev" easily, we should record it one
   somewhere.
3. the better "linkrev" in these cases are good enough for linear repos.
4. it seems hooking the "obsolete marker creation" is brilliant and handles
   a lot of real world cases, like it fits well with "pullcreatemarkers".

Whether we do it directly at revlog or not is implementation details. I'd
prefer changing revlog since it's simple, strightforward and fits with other
parts well.

Patch

diff --git a/mercurial/cmdutil.py b/mercurial/cmdutil.py
--- a/mercurial/cmdutil.py
+++ b/mercurial/cmdutil.py
@@ -1940,7 +1940,7 @@  def _makefollowlogfilematcher(repo, file
     # --follow, we want the names of the ancestors of FILE in the
     # revision, stored in "fcache". "fcache" is populated by
     # reproducing the graph traversal already done by --follow revset
-    # and relating linkrevs to file names (which is not "correct" but
+    # and relating revs to file names (which is not "correct" but
     # good enough).
     fcache = {}
     fcacheready = [False]
@@ -1949,9 +1949,9 @@  def _makefollowlogfilematcher(repo, file
     def populate():
         for fn in files:
             fctx = pctx[fn]
-            fcache.setdefault(fctx.linkrev(), set()).add(fctx.path())
+            fcache.setdefault(fctx.introrev(), set()).add(fctx.path())
             for c in fctx.ancestors(followfirst=followfirst):
-                fcache.setdefault(c.linkrev(), set()).add(c.path())
+                fcache.setdefault(c.rev(), set()).add(c.path())
 
     def filematcher(rev):
         if not fcacheready[0]:
diff --git a/tests/test-log.t b/tests/test-log.t
--- a/tests/test-log.t
+++ b/tests/test-log.t
@@ -920,6 +920,78 @@  log -r tip --stat
 
   $ cd ..
 
+log --follow --patch FILE in repository where linkrev isn't trustworthy
+(issue5376)
+
+  $ hg init follow-dup
+  $ cd follow-dup
+  $ cat <<EOF >> .hg/hgrc
+  > [ui]
+  > logtemplate = '=== {rev}: {desc}\n'
+  > [diff]
+  > nodates = True
+  > EOF
+  $ echo 0 >> a
+  $ hg ci -qAm 'a0'
+  $ echo 1 >> a
+  $ hg ci -m 'a1'
+  $ hg up -q 0
+  $ echo 1 >> a
+  $ touch b
+  $ hg ci -qAm 'a1 with b'
+  $ echo 3 >> a
+  $ hg ci -m 'a3'
+
+ fctx.rev() == 2, but fctx.linkrev() == 1
+
+  $ hg log -pf a
+  === 3: a3
+  diff -r 4ea02ba94d66 -r e7a6331a34f0 a
+  --- a/a
+  +++ b/a
+  @@ -1,2 +1,3 @@
+   0
+   1
+  +3
+  
+  === 2: a1 with b
+  diff -r 49b5e81287e2 -r 4ea02ba94d66 a
+  --- a/a
+  +++ b/a
+  @@ -1,1 +1,2 @@
+   0
+  +1
+  
+  === 0: a0
+  diff -r 000000000000 -r 49b5e81287e2 a
+  --- /dev/null
+  +++ b/a
+  @@ -0,0 +1,1 @@
+  +0
+  
+
+ fctx.introrev() == 2, but fctx.linkrev() == 1
+
+  $ hg up -q 2
+  $ hg log -pf a
+  === 2: a1 with b
+  diff -r 49b5e81287e2 -r 4ea02ba94d66 a
+  --- a/a
+  +++ b/a
+  @@ -1,1 +1,2 @@
+   0
+  +1
+  
+  === 0: a0
+  diff -r 000000000000 -r 49b5e81287e2 a
+  --- /dev/null
+  +++ b/a
+  @@ -0,0 +1,1 @@
+  +0
+  
+
+  $ cd ..
+
 Test that log should respect the order of -rREV even if multiple OR conditions
 are specified (issue5100):