Patchwork lfs: add a progress bar when searching for blobs to upload

login
register
mail settings
Submitter via Mercurial-devel
Date Aug. 31, 2018, 6:16 a.m.
Message ID <CAESOdVDm7Km78pB2XOeLprP3VE0b0kExdxhmUNR_QNbRjEACiA@mail.gmail.com>
Download mbox | patch
Permalink /patch/34203/
State New
Headers show

Comments

via Mercurial-devel - Aug. 31, 2018, 6:16 a.m.
On Thu, Aug 30, 2018 at 11:02 PM Matt Harbison <mharbison72@gmail.com>
wrote:

>
> On Aug 31, 2018, at 1:19 AM, Martin von Zweigbergk <martinvonz@google.com>
> wrote:
>
>
>
> On Thu, Aug 30, 2018 at 10:14 PM Martin von Zweigbergk <
> martinvonz@google.com> wrote:
>
>>
>>
>> On Wed, Aug 29, 2018 at 8:18 PM Matt Harbison <mharbison72@gmail.com>
>> wrote:
>>
>>> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <mharbison72@gmail.com>
>>>
>>> wrote:
>>>
>>> > # HG changeset patch
>>> > # User Matt Harbison <matt_harbison@yahoo.com>
>>> > # Date 1535147146 14400
>>> > #      Fri Aug 24 17:45:46 2018 -0400
>>> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
>>> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
>>> > lfs: add a progress bar when searching for blobs to upload
>>> >
>>> > The search itself can take an extreme amount of time if there are a
>>> lot
>>> > of
>>> > revisions involved.  I've got a local repo that took 6 minutes to
>>> push
>>> > 1850
>>> > commits, and 60% of that time was spent here (there are ~70K files):
>>> >
>>> >      \ 58.1%  wrapper.py:     extractpointers      line 297:  pointers
>>> =
>>> > extractpointers(...
>>> >        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p
>>> in
>>> > pointersfromctx(ct...
>>> >        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =
>>> > pointerfromctx(ctx, f, ...
>>> >          \ 38.7%  context.py:     __contains__     line 368:  if f
>>> not
>>> > in ctx:
>>> >            | 38.7%  util.py:        __get__        line 82:  return
>>> key
>>> > in self._manifest
>>> >            | 38.7%  context.py:     _manifest      line 1416:  result
>>> =
>>> > self.func(obj)
>>> >            | 38.7%  manifest.py:    read           line 472:  return
>>> > self._manifestctx.re...
>>> >              \ 25.6%  revlog.py:      revision     line 1562:  text =
>>> > rl.revision(self._node)
>>> >                \ 12.8%  revlog.py:      _chunks    line 2217:  bins =
>>> > self._chunks(chain, ...
>>> >                  | 12.0%  revlog.py:      decompressline 2112:
>>> > ladd(decomp(buffer(data, ch...
>>> >                \  7.8%  revlog.py:      checkhash  line 2232:
>>> > self.checkhash(text, node, ...
>>> >                  |  7.8%  revlog.py:      hash     line 2315:  if
>>> node
>>> > != self.hash(text, ...
>>> >                  |  7.8%  revlog.py:      hash     line 2242:  return
>>> > hash(text, p1, p2)
>>> >              \ 12.0%  manifest.py:    __init__     line 1565:
>>> > self._data = manifestdict(t...
>>> >          \ 16.8%  context.py:     filenode         line 378:  if not
>>> > _islfs(fctx.filelog(...
>>> >            | 15.7%  util.py:        __get__        line 706:  return
>>> > self._filelog
>>> >            | 14.8%  context.py:     _filelog       line 1416:  result
>>> =
>>> > self.func(obj)
>>> >            | 14.8%  localrepo.py:   file           line 629:  return
>>> > self._repo.file(self...
>>> >            | 14.8%  filelog.py:     __init__       line 1134:  return
>>> > filelog.filelog(self...
>>> >            | 14.5%  revlog.py:      __init__       line 24:
>>> > censorable=True)
>>>
>>> Any ideas how to trim down some of this overhead?
>>
>>
>> You can possibly save on some of that manifest-reading time by calling
>> manifestlog.readfast() like changegroup (and verify, I think) does.
>>
>>
>>>   revset._matchfiles()
>>> has a comment about reading the changelog directly because of the
>>> overhead
>>> of creating changectx[1].  I think that could work here too, but falls
>>> apart because of the need to access the filelogs too.
>>
>>
> I don't see changectx-creation in the profile output. What makes you think
> that's a significant cost here?
>
>
> The footnote below is to a comment that says it’s expensive to create each
> one, which is basically what happened here in this case. Maybe the comment
> is stale now?  I wasn’t sure why it would be that much more expensive
> either, but reading the changelog seems more direct.
>

I saw that comment (in the link), but since changelog-reading doesn't seem
to appear in the profiling output, it doesn't seem like the first thing to
tackle at least. Since most of the time seems to be spent reading
manifests, I think readfast() might help. Can you see if the following
patch helps? (The special handling of the working copy is because
workingctx apparently doesn't have a manifestctx() method. We should
perhaps fix that.)



>   It seems like
>>> reading the changelog and accessing the filelogs directly here is too
>>> low
>>> level, especially with @indygreg trying to add support for non-filelog
>>> storage.
>>>
>>> [1]
>>>
>>> https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
>>> _______________________________________________
>>> Mercurial-devel mailing list
>>> Mercurial-devel@mercurial-scm.org
>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>>
>>
via Mercurial-devel - Aug. 31, 2018, 6:23 a.m.
On Thu, Aug 30, 2018 at 11:16 PM Martin von Zweigbergk <
martinvonz@google.com> wrote:

>
>
> On Thu, Aug 30, 2018 at 11:02 PM Matt Harbison <mharbison72@gmail.com>
> wrote:
>
>>
>> On Aug 31, 2018, at 1:19 AM, Martin von Zweigbergk <martinvonz@google.com>
>> wrote:
>>
>>
>>
>> On Thu, Aug 30, 2018 at 10:14 PM Martin von Zweigbergk <
>> martinvonz@google.com> wrote:
>>
>>>
>>>
>>> On Wed, Aug 29, 2018 at 8:18 PM Matt Harbison <mharbison72@gmail.com>
>>> wrote:
>>>
>>>> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <
>>>> mharbison72@gmail.com>
>>>> wrote:
>>>>
>>>> > # HG changeset patch
>>>> > # User Matt Harbison <matt_harbison@yahoo.com>
>>>> > # Date 1535147146 14400
>>>> > #      Fri Aug 24 17:45:46 2018 -0400
>>>> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
>>>> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
>>>> > lfs: add a progress bar when searching for blobs to upload
>>>> >
>>>> > The search itself can take an extreme amount of time if there are a
>>>> lot
>>>> > of
>>>> > revisions involved.  I've got a local repo that took 6 minutes to
>>>> push
>>>> > 1850
>>>> > commits, and 60% of that time was spent here (there are ~70K files):
>>>> >
>>>> >      \ 58.1%  wrapper.py:     extractpointers      line 297:
>>>> pointers =
>>>> > extractpointers(...
>>>> >        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p
>>>> in
>>>> > pointersfromctx(ct...
>>>> >        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =
>>>> > pointerfromctx(ctx, f, ...
>>>> >          \ 38.7%  context.py:     __contains__     line 368:  if f
>>>> not
>>>> > in ctx:
>>>> >            | 38.7%  util.py:        __get__        line 82:  return
>>>> key
>>>> > in self._manifest
>>>> >            | 38.7%  context.py:     _manifest      line 1416:  result
>>>> =
>>>> > self.func(obj)
>>>> >            | 38.7%  manifest.py:    read           line 472:  return
>>>> > self._manifestctx.re...
>>>> >              \ 25.6%  revlog.py:      revision     line 1562:  text
>>>> =
>>>> > rl.revision(self._node)
>>>> >                \ 12.8%  revlog.py:      _chunks    line 2217:  bins
>>>> =
>>>> > self._chunks(chain, ...
>>>> >                  | 12.0%  revlog.py:      decompressline 2112:
>>>> > ladd(decomp(buffer(data, ch...
>>>> >                \  7.8%  revlog.py:      checkhash  line 2232:
>>>> > self.checkhash(text, node, ...
>>>> >                  |  7.8%  revlog.py:      hash     line 2315:  if
>>>> node
>>>> > != self.hash(text, ...
>>>> >                  |  7.8%  revlog.py:      hash     line 2242:
>>>> return
>>>> > hash(text, p1, p2)
>>>> >              \ 12.0%  manifest.py:    __init__     line 1565:
>>>> > self._data = manifestdict(t...
>>>> >          \ 16.8%  context.py:     filenode         line 378:  if not
>>>> > _islfs(fctx.filelog(...
>>>> >            | 15.7%  util.py:        __get__        line 706:  return
>>>> > self._filelog
>>>> >            | 14.8%  context.py:     _filelog       line 1416:  result
>>>> =
>>>> > self.func(obj)
>>>> >            | 14.8%  localrepo.py:   file           line 629:  return
>>>> > self._repo.file(self...
>>>> >            | 14.8%  filelog.py:     __init__       line 1134:
>>>> return
>>>> > filelog.filelog(self...
>>>> >            | 14.5%  revlog.py:      __init__       line 24:
>>>> > censorable=True)
>>>>
>>>> Any ideas how to trim down some of this overhead?
>>>
>>>
>>> You can possibly save on some of that manifest-reading time by calling
>>> manifestlog.readfast() like changegroup (and verify, I think) does.
>>>
>>>
>>>>   revset._matchfiles()
>>>> has a comment about reading the changelog directly because of the
>>>> overhead
>>>> of creating changectx[1].  I think that could work here too, but falls
>>>> apart because of the need to access the filelogs too.
>>>
>>>
>> I don't see changectx-creation in the profile output. What makes you
>> think that's a significant cost here?
>>
>>
>> The footnote below is to a comment that says it’s expensive to create
>> each one, which is basically what happened here in this case. Maybe the
>> comment is stale now?  I wasn’t sure why it would be that much more
>> expensive either, but reading the changelog seems more direct.
>>
>
> I saw that comment (in the link), but since changelog-reading doesn't seem
> to appear in the profiling output, it doesn't seem like the first thing to
> tackle at least. Since most of the time seems to be spent reading
> manifests, I think readfast() might help. Can you see if the following
> patch helps? (The special handling of the working copy is because
> workingctx apparently doesn't have a manifestctx() method. We should
> perhaps fix that.)
>
> diff --git a/hgext/lfs/wrapper.py b/hgext/lfs/wrapper.py
> --- a/hgext/lfs/wrapper.py
> +++ b/hgext/lfs/wrapper.py
> @@ -259,7 +259,11 @@ def _prefetchfiles(repo, revs, match):
>
>      for rev in revs:
>          ctx = repo[rev]
> -        for f in ctx.walk(match):
> +        if ctx.rev() is None:
> +            manifest = ctx.manifest()
> +        else:
> +            manifest = ctx.manifestctx().readfast()
> +        for f in manifest.walk(match):
>              p = pointerfromctx(ctx, f)
>              if p and p.oid() not in oids and not localstore.has(p.oid()):
>                  p.filename = f
>
>
Never mind, seems I patched the wrong method :P Hopefully you get the idea
and you can figure out how to do it yourself in pointersfromctx()
and pointerfromctx(). You want to avoid the "if f not in ctx" there.


>
>>   It seems like
>>>> reading the changelog and accessing the filelogs directly here is too
>>>> low
>>>> level, especially with @indygreg trying to add support for non-filelog
>>>> storage.
>>>>
>>>> [1]
>>>>
>>>> https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
>>>> _______________________________________________
>>>> Mercurial-devel mailing list
>>>> Mercurial-devel@mercurial-scm.org
>>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>>>
>>>
Matt Harbison - Sept. 1, 2018, 3:28 a.m.
On Fri, 31 Aug 2018 02:23:47 -0400, Martin von Zweigbergk  
<martinvonz@google.com> wrote:

> On Thu, Aug 30, 2018 at 11:16 PM Martin von Zweigbergk <
> martinvonz@google.com> wrote:
>
>>
>>
>> On Thu, Aug 30, 2018 at 11:02 PM Matt Harbison <mharbison72@gmail.com>
>> wrote:
>>
>>>
>>> On Aug 31, 2018, at 1:19 AM, Martin von Zweigbergk  
>>> <martinvonz@google.com>
>>> wrote:
>>>
>>>
>>>
>>> On Thu, Aug 30, 2018 at 10:14 PM Martin von Zweigbergk <
>>> martinvonz@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Aug 29, 2018 at 8:18 PM Matt Harbison <mharbison72@gmail.com>
>>>> wrote:
>>>>
>>>>> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <
>>>>> mharbison72@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > # HG changeset patch
>>>>> > # User Matt Harbison <matt_harbison@yahoo.com>
>>>>> > # Date 1535147146 14400
>>>>> > #      Fri Aug 24 17:45:46 2018 -0400
>>>>> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
>>>>> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
>>>>> > lfs: add a progress bar when searching for blobs to upload
>>>>> >
>>>>> > The search itself can take an extreme amount of time if there are a
>>>>> lot
>>>>> > of
>>>>> > revisions involved.  I've got a local repo that took 6 minutes to
>>>>> push
>>>>> > 1850
>>>>> > commits, and 60% of that time was spent here (there are ~70K  
>>>>> files):
>>>>> >
>>>>> >      \ 58.1%  wrapper.py:     extractpointers      line 297:
>>>>> pointers =
>>>>> > extractpointers(...
>>>>> >        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p
>>>>> in
>>>>> > pointersfromctx(ct...
>>>>> >        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =
>>>>> > pointerfromctx(ctx, f, ...
>>>>> >          \ 38.7%  context.py:     __contains__     line 368:  if f
>>>>> not
>>>>> > in ctx:
>>>>> >            | 38.7%  util.py:        __get__        line 82:  return
>>>>> key
>>>>> > in self._manifest
>>>>> >            | 38.7%  context.py:     _manifest      line 1416:   
>>>>> result
>>>>> =
>>>>> > self.func(obj)
>>>>> >            | 38.7%  manifest.py:    read           line 472:   
>>>>> return
>>>>> > self._manifestctx.re...
>>>>> >              \ 25.6%  revlog.py:      revision     line 1562:  text
>>>>> =
>>>>> > rl.revision(self._node)
>>>>> >                \ 12.8%  revlog.py:      _chunks    line 2217:  bins
>>>>> =
>>>>> > self._chunks(chain, ...
>>>>> >                  | 12.0%  revlog.py:      decompressline 2112:
>>>>> > ladd(decomp(buffer(data, ch...
>>>>> >                \  7.8%  revlog.py:      checkhash  line 2232:
>>>>> > self.checkhash(text, node, ...
>>>>> >                  |  7.8%  revlog.py:      hash     line 2315:  if
>>>>> node
>>>>> > != self.hash(text, ...
>>>>> >                  |  7.8%  revlog.py:      hash     line 2242:
>>>>> return
>>>>> > hash(text, p1, p2)
>>>>> >              \ 12.0%  manifest.py:    __init__     line 1565:
>>>>> > self._data = manifestdict(t...
>>>>> >          \ 16.8%  context.py:     filenode         line 378:  if  
>>>>> not
>>>>> > _islfs(fctx.filelog(...
>>>>> >            | 15.7%  util.py:        __get__        line 706:   
>>>>> return
>>>>> > self._filelog
>>>>> >            | 14.8%  context.py:     _filelog       line 1416:   
>>>>> result
>>>>> =
>>>>> > self.func(obj)
>>>>> >            | 14.8%  localrepo.py:   file           line 629:   
>>>>> return
>>>>> > self._repo.file(self...
>>>>> >            | 14.8%  filelog.py:     __init__       line 1134:
>>>>> return
>>>>> > filelog.filelog(self...
>>>>> >            | 14.5%  revlog.py:      __init__       line 24:
>>>>> > censorable=True)
>>>>>
>>>>> Any ideas how to trim down some of this overhead?
>>>>
>>>>
>>>> You can possibly save on some of that manifest-reading time by calling
>>>> manifestlog.readfast() like changegroup (and verify, I think) does.
>>>>
>>>>
>>>>>   revset._matchfiles()
>>>>> has a comment about reading the changelog directly because of the
>>>>> overhead
>>>>> of creating changectx[1].  I think that could work here too, but  
>>>>> falls
>>>>> apart because of the need to access the filelogs too.
>>>>
>>>>
>>> I don't see changectx-creation in the profile output. What makes you
>>> think that's a significant cost here?
>>>
>>>
>>> The footnote below is to a comment that says it’s expensive to create
>>> each one, which is basically what happened here in this case. Maybe the
>>> comment is stale now?  I wasn’t sure why it would be that much more
>>> expensive either, but reading the changelog seems more direct.
>>>
>>
>> I saw that comment (in the link), but since changelog-reading doesn't  
>> seem
>> to appear in the profiling output, it doesn't seem like the first thing  
>> to
>> tackle at least. Since most of the time seems to be spent reading
>> manifests, I think readfast() might help. Can you see if the following
>> patch helps? (The special handling of the working copy is because
>> workingctx apparently doesn't have a manifestctx() method. We should
>> perhaps fix that.)
>>
>> diff --git a/hgext/lfs/wrapper.py b/hgext/lfs/wrapper.py
>> --- a/hgext/lfs/wrapper.py
>> +++ b/hgext/lfs/wrapper.py
>> @@ -259,7 +259,11 @@ def _prefetchfiles(repo, revs, match):
>>
>>      for rev in revs:
>>          ctx = repo[rev]
>> -        for f in ctx.walk(match):
>> +        if ctx.rev() is None:
>> +            manifest = ctx.manifest()
>> +        else:
>> +            manifest = ctx.manifestctx().readfast()
>> +        for f in manifest.walk(match):
>>              p = pointerfromctx(ctx, f)
>>              if p and p.oid() not in oids and not  
>> localstore.has(p.oid()):
>>                  p.filename = f
>>
>>
> Never mind, seems I patched the wrong method :P Hopefully you get the  
> idea
> and you can figure out how to do it yourself in pointersfromctx()
> and pointerfromctx(). You want to avoid the "if f not in ctx" there.

OK, thanks!  I'll submit something as an RFC after the current tests  
finish, because I saw some wildly differing results and will probably need  
help figuring out how to accurately benchmark this.  (e.g., the first run  
with the change was 9 minutes, then the next run without it 5:34, and then  
I ran with the change again, and it was 5:09.  This is from an nvme drive,  
so I can't believe caching could have that much influence.)

Is there any documentation for this sort of thing?  There are not a lot of  
function docs in this area.  I've got a high level understanding of  
changelog -> manifest -> filelog relationships, and internals.revlog  
provides some format detail.  But I'm pretty lost looking at the code.   
ctx.manifestctx().readfast() returns a manifest, and is annotated in  
localrepo with '00manifest.i'.  `hg debugdata` on that file and `hg  
manifest` both print out every single file in a revision.  But this  
manifest only contains added and modified files, so it's like it has been  
logically AND'd with the changelog.  That's very helpful here, but not at  
all obvious.

>>>   It seems like
>>>>> reading the changelog and accessing the filelogs directly here is too
>>>>> low
>>>>> level, especially with @indygreg trying to add support for  
>>>>> non-filelog
>>>>> storage.
>>>>>
>>>>> [1]
>>>>>
>>>>> https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
>>>>> _______________________________________________
>>>>> Mercurial-devel mailing list
>>>>> Mercurial-devel@mercurial-scm.org
>>>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>>>>

Patch

diff --git a/hgext/lfs/wrapper.py b/hgext/lfs/wrapper.py
--- a/hgext/lfs/wrapper.py
+++ b/hgext/lfs/wrapper.py
@@ -259,7 +259,11 @@  def _prefetchfiles(repo, revs, match):

     for rev in revs:
         ctx = repo[rev]
-        for f in ctx.walk(match):
+        if ctx.rev() is None:
+            manifest = ctx.manifest()
+        else:
+            manifest = ctx.manifestctx().readfast()
+        for f in manifest.walk(match):
             p = pointerfromctx(ctx, f)
             if p and p.oid() not in oids and not localstore.has(p.oid()):
                 p.filename = f