Patchwork A thought for dealing with linkrevs

login
register
mail settings
Submitter Matt Mackall
Date Sept. 18, 2014, 4:08 a.m.
Message ID <1411013310.8277.25.camel@calx>
Download mbox | patch
Permalink /patch/5864/
State Not Applicable
Headers show

Comments

Matt Mackall - Sept. 18, 2014, 4:08 a.m.
So I've been thinking about the linkrev problem a bit again. I've
occasionally suggested we can deduce the possible positions of the
missing linkrevs from the shape of the changelog graph as compared to
the filelog graph. This turned out to be complicated so it never really
went anywhere.

I'm now tinkering with the idea of treating the linkrev as a simple
hint: "this is one possibility, and any aliases are numerically after
it". So if we're interested in following the history of file F from
changeset C, we can say:

- is F.linkrev an ancestor of C? great, done
- otherwise, let's look in that vicinity for changesets touching F

Since the usual case for caring about this is log -f or annotate, which
incrementally walk back through file history, we can imagine a
"corrector" object that knows where we're following from and can correct
related linkrevs. We can even imagine this state being automatically
shared as we traverse contexts, thus making fctx.linkrev() magically do
the right thing.

I tried hacking this up, and the below code will incrementally correct
all the linkrevs in the 2639 ancestors of mercurial/commands.py in
0.13s, including fixing up a ton of obsolete linkrevs, which seems like
a pretty good start. There are probably a bunch of ways it could be
faster/smarter.

(One can also imagine an alternate corrector that uses a persistent
cache.)
Pierre-Yves David - Sept. 18, 2014, 7:44 p.m.
On 09/17/2014 09:08 PM, Matt Mackall wrote:
> So I've been thinking about the linkrev problem a bit again. I've
> occasionally suggested we can deduce the possible positions of the
> missing linkrevs from the shape of the changelog graph as compared to
> the filelog graph. This turned out to be complicated so it never really
> went anywhere.
>
> I'm now tinkering with the idea of treating the linkrev as a simple
> hint: "this is one possibility, and any aliases are numerically after
> it". So if we're interested in following the history of file F from
> changeset C, we can say:
>
> - is F.linkrev an ancestor of C? great, done
> - otherwise, let's look in that vicinity for changesets touching F
>
> Since the usual case for caring about this is log -f or annotate, which
> incrementally walk back through file history, we can imagine a
> "corrector" object that knows where we're following from and can correct
> related linkrevs. We can even imagine this state being automatically
> shared as we traverse contexts, thus making fctx.linkrev() magically do
> the right thing.
>
> I tried hacking this up, and the below code will incrementally correct
> all the linkrevs in the 2639 ancestors of mercurial/commands.py in
> 0.13s, including fixing up a ton of obsolete linkrevs, which seems like
> a pretty good start. There are probably a bunch of ways it could be
> faster/smarter.

This a pretty cool idea. I love how it provide a bounded solution 
without too much complexity added. Do you have the timing without the 
linkcorrector to get an idea of the slow down ?
Matt Mackall - Sept. 18, 2014, 8:53 p.m.
On Thu, 2014-09-18 at 12:44 -0700, Pierre-Yves David wrote:
> 
> On 09/17/2014 09:08 PM, Matt Mackall wrote:
> > So I've been thinking about the linkrev problem a bit again. I've
> > occasionally suggested we can deduce the possible positions of the
> > missing linkrevs from the shape of the changelog graph as compared to
> > the filelog graph. This turned out to be complicated so it never really
> > went anywhere.
> >
> > I'm now tinkering with the idea of treating the linkrev as a simple
> > hint: "this is one possibility, and any aliases are numerically after
> > it". So if we're interested in following the history of file F from
> > changeset C, we can say:
> >
> > - is F.linkrev an ancestor of C? great, done
> > - otherwise, let's look in that vicinity for changesets touching F
> >
> > Since the usual case for caring about this is log -f or annotate, which
> > incrementally walk back through file history, we can imagine a
> > "corrector" object that knows where we're following from and can correct
> > related linkrevs. We can even imagine this state being automatically
> > shared as we traverse contexts, thus making fctx.linkrev() magically do
> > the right thing.
> >
> > I tried hacking this up, and the below code will incrementally correct
> > all the linkrevs in the 2639 ancestors of mercurial/commands.py in
> > 0.13s, including fixing up a ton of obsolete linkrevs, which seems like
> > a pretty good start. There are probably a bunch of ways it could be
> > faster/smarter.
> 
> This a pretty cool idea. I love how it provide a bounded solution 
> without too much complexity added. Do you have the timing without the 
> linkcorrector to get an idea of the slow down ?

.03s to simply follow the filelog and get linkrevs
.16s to do the above while correcting links

Compared to a full annotate, which is taking around 7s (including 1s of
parsing obsmarkers...), it's pretty manageable. The current version has
an O(revs) component, but it's still much faster than reading the entire
changelog.
Durham Goode - Sept. 19, 2014, 5:08 p.m.
On 9/17/14, 9:08 PM, Matt Mackall wrote:
> So I've been thinking about the linkrev problem a bit again. I've
> occasionally suggested we can deduce the possible positions of the
> missing linkrevs from the shape of the changelog graph as compared to
> the filelog graph. This turned out to be complicated so it never really
> went anywhere.
>
> I'm now tinkering with the idea of treating the linkrev as a simple
> hint: "this is one possibility, and any aliases are numerically after
> it". So if we're interested in following the history of file F from
> changeset C, we can say:
>
> - is F.linkrev an ancestor of C? great, done
> - otherwise, let's look in that vicinity for changesets touching F
>
> Since the usual case for caring about this is log -f or annotate, which
> incrementally walk back through file history, we can imagine a
> "corrector" object that knows where we're following from and can correct
> related linkrevs. We can even imagine this state being automatically
> shared as we traverse contexts, thus making fctx.linkrev() magically do
> the right thing.

I've thought about this a bit, and I think it could work for 
remotefilelog too.

One of the issues I have in remotefilelog, is that when I receive a 
changegroup, I need to immediately know which changelog revs introduce 
which filelog revs.  Right now I can only see the first changelog rev 
that produced a given filelog rev.  If we have this linkcorrector thing, 
we could do the correction on the server side and include a bundle2 part 
with 1->N (filelogrev->changelogrevs) mapping for that changegroup.  
That might be useful for non-remotefilelog cases too, for clients to 
build an early cache of corrected linkrevs.

The other linkrev issue I have in remotefilelog, is that the file 
revision blobs produced by the server include a history of that file 
revision that contains the server's linkrev'd commit.  So we assume the 
client pulls every server head, so that they are guaranteed to have 
whatever commit the server's linkrev pointed at.  If we have a 
linkcorrector, I could probably use it to resolve linkrevs when the 
client hasn't pulled everything.  I'd have to override it with my own 
implementation (since I don't actually have a filelog with linkrevs), 
but the concept should still apply.
Augie Fackler - Sept. 24, 2014, 3:25 p.m.
On Wed, Sep 17, 2014 at 11:08:30PM -0500, Matt Mackall wrote:
> So I've been thinking about the linkrev problem a bit again. I've
> occasionally suggested we can deduce the possible positions of the
> missing linkrevs from the shape of the changelog graph as compared to
> the filelog graph. This turned out to be complicated so it never really
> went anywhere.
>
> I'm now tinkering with the idea of treating the linkrev as a simple
> hint: "this is one possibility, and any aliases are numerically after
> it". So if we're interested in following the history of file F from
> changeset C, we can say:
>
> - is F.linkrev an ancestor of C? great, done
> - otherwise, let's look in that vicinity for changesets touching F
>
> Since the usual case for caring about this is log -f or annotate, which
> incrementally walk back through file history, we can imagine a
> "corrector" object that knows where we're following from and can correct
> related linkrevs. We can even imagine this state being automatically
> shared as we traverse contexts, thus making fctx.linkrev() magically do
> the right thing.

Per-file log also cares about this. martinvonz and I recently had some
surprising 'hg log' output in the presence of evolution when looking
at the log for a single file.

(That is: -f doesn't really matter - vanilla log on a single file will
also give wrong answers AFAICT.)

>
> I tried hacking this up, and the below code will incrementally correct
> all the linkrevs in the 2639 ancestors of mercurial/commands.py in
> 0.13s, including fixing up a ton of obsolete linkrevs, which seems like
> a pretty good start. There are probably a bunch of ways it could be
> faster/smarter.
>
> (One can also imagine an alternate corrector that uses a persistent
> cache.)
>
> diff -r 48791c2bea1c mercurial/context.py
> --- a/mercurial/context.py	Tue Sep 16 14:49:56 2014 -0500
> +++ b/mercurial/context.py	Wed Sep 17 22:32:49 2014 -0500
> @@ -17,6 +17,75 @@
>
>  propertycache = util.propertycache
>
> +class linkcorrector(object):
> +    """report (an) appropriate linkrev for file revisions in ancestors of
> +    ctx
> +    """
> +    def __init__(self, ctx, filelog):
> +        self._start = ctx
> +        self._filelog = filelog
> +        self._linkcache = {} # filerev -> rev
> +        self._visit = []
> +        self._scanpos = ctx.rev() + 1
> +        self._ancestors = set([ctx.rev()]) # known ancestors
> +
> +    def correct(self, fctx):
> +        fr = fctx._filerev
> +        linkcache = self._linkcache
> +
> +        # fastest path: already cached
> +        if fr in linkcache:
> +            return linkcache[fr]
> +
> +        # get the actual linkrev
> +        flog = self._filelog
> +        lr = flog.linkrev(fr)
> +
> +        # if linkrev is a known ancestor, return immediately
> +        ancestors = self._ancestors
> +        if lr in ancestors:
> +            linkcache[fr] = lr
> +            return lr
> +
> +        # find new revs visit (without unpacking)
> +        # if we get lucky, we'll hit linkrev and be done
> +        repo = self._start._repo
> +        parentrevs = repo.changelog.parentrevs
> +        vadd = self._visit.append
> +        while True:
> +            r = self._scanpos = self._scanpos - 1
> +            if r in ancestors:
> +                for p in parentrevs(r):
> +                    ancestors.add(p)
> +                if r == lr:
> +                    # good, cache and return
> +                    linkcache[fr] = lr
> +                    return lr
> +                # examine later
> +                vadd(r)
> +            if r <= lr or r == 0:
> +                break
> +
> +        # check possible revs
> +        path = fctx._path
> +        fnode = fctx._filenode
> +        visit = self._visit
> +        while visit:
> +            # these are added breadfirst, take ones from the end,
> +            # hopefully closest to linkrev
> +            r = visit.pop()
> +            ctx = repo[r]
> +            if path in ctx.files():
> +                # maybe fast-pathed
> +                newfnode = ctx.filenode(path)
> +                if fnode == newfnode:
> +                    # hit!
> +                    linkcache[fr] = r
> +                    return r
> +                else:
> +                    # miss, cache for later
> +                    linkcache[flog.rev(newfnode)] = r
> +
>  class basectx(object):
>      """A basectx object represents the common logic for its children:
>      changectx: read-only context that is already present in the repo,
>
> --
> Mathematics is the supreme nostalgia of our time.
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel

Patch

diff -r 48791c2bea1c mercurial/context.py
--- a/mercurial/context.py	Tue Sep 16 14:49:56 2014 -0500
+++ b/mercurial/context.py	Wed Sep 17 22:32:49 2014 -0500
@@ -17,6 +17,75 @@ 
 
 propertycache = util.propertycache
 
+class linkcorrector(object):
+    """report (an) appropriate linkrev for file revisions in ancestors of
+    ctx
+    """
+    def __init__(self, ctx, filelog):
+        self._start = ctx
+        self._filelog = filelog
+        self._linkcache = {} # filerev -> rev
+        self._visit = []
+        self._scanpos = ctx.rev() + 1
+        self._ancestors = set([ctx.rev()]) # known ancestors
+
+    def correct(self, fctx):
+        fr = fctx._filerev
+        linkcache = self._linkcache
+
+        # fastest path: already cached
+        if fr in linkcache:
+            return linkcache[fr]
+
+        # get the actual linkrev
+        flog = self._filelog
+        lr = flog.linkrev(fr)
+
+        # if linkrev is a known ancestor, return immediately
+        ancestors = self._ancestors
+        if lr in ancestors:
+            linkcache[fr] = lr
+            return lr
+
+        # find new revs visit (without unpacking)
+        # if we get lucky, we'll hit linkrev and be done
+        repo = self._start._repo
+        parentrevs = repo.changelog.parentrevs
+        vadd = self._visit.append
+        while True:
+            r = self._scanpos = self._scanpos - 1
+            if r in ancestors:
+                for p in parentrevs(r):
+                    ancestors.add(p)
+                if r == lr:
+                    # good, cache and return
+                    linkcache[fr] = lr
+                    return lr
+                # examine later
+                vadd(r)
+            if r <= lr or r == 0:
+                break
+
+        # check possible revs
+        path = fctx._path
+        fnode = fctx._filenode
+        visit = self._visit
+        while visit:
+            # these are added breadfirst, take ones from the end,
+            # hopefully closest to linkrev
+            r = visit.pop()
+            ctx = repo[r]
+            if path in ctx.files():
+                # maybe fast-pathed
+                newfnode = ctx.filenode(path)
+                if fnode == newfnode:
+                    # hit!
+                    linkcache[fr] = r
+                    return r
+                else:
+                    # miss, cache for later
+                    linkcache[flog.rev(newfnode)] = r
+
 class basectx(object):
     """A basectx object represents the common logic for its children:
     changectx: read-only context that is already present in the repo,