Patchwork lfs: add a progress bar when searching for blobs to upload

login
register
mail settings
Submitter Matt Harbison
Date Aug. 24, 2018, 10:18 p.m.
Message ID <76eca3ae345b261c0049.1535149112@mharbison-pc.attotech.com>
Download mbox | patch
Permalink /patch/34035/
State Accepted
Headers show

Comments

Matt Harbison - Aug. 24, 2018, 10:18 p.m.
# HG changeset patch
# User Matt Harbison <matt_harbison@yahoo.com>
# Date 1535147146 14400
#      Fri Aug 24 17:45:46 2018 -0400
# Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
# Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
lfs: add a progress bar when searching for blobs to upload

The search itself can take an extreme amount of time if there are a lot of
revisions involved.  I've got a local repo that took 6 minutes to push 1850
commits, and 60% of that time was spent here (there are ~70K files):

     \ 58.1%  wrapper.py:     extractpointers      line 297:  pointers = extractpointers(...
       | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p in pointersfromctx(ct...
       | 57.4%  wrapper.py:     pointerfromctx     line 397:  p = pointerfromctx(ctx, f, ...
         \ 38.7%  context.py:     __contains__     line 368:  if f not in ctx:
           | 38.7%  util.py:        __get__        line 82:  return key in self._manifest
           | 38.7%  context.py:     _manifest      line 1416:  result = self.func(obj)
           | 38.7%  manifest.py:    read           line 472:  return self._manifestctx.re...
             \ 25.6%  revlog.py:      revision     line 1562:  text = rl.revision(self._node)
               \ 12.8%  revlog.py:      _chunks    line 2217:  bins = self._chunks(chain, ...
                 | 12.0%  revlog.py:      decompressline 2112:  ladd(decomp(buffer(data, ch...
               \  7.8%  revlog.py:      checkhash  line 2232:  self.checkhash(text, node, ...
                 |  7.8%  revlog.py:      hash     line 2315:  if node != self.hash(text, ...
                 |  7.8%  revlog.py:      hash     line 2242:  return hash(text, p1, p2)
             \ 12.0%  manifest.py:    __init__     line 1565:  self._data = manifestdict(t...
         \ 16.8%  context.py:     filenode         line 378:  if not _islfs(fctx.filelog(...
           | 15.7%  util.py:        __get__        line 706:  return self._filelog
           | 14.8%  context.py:     _filelog       line 1416:  result = self.func(obj)
           | 14.8%  localrepo.py:   file           line 629:  return self._repo.file(self...
           | 14.8%  filelog.py:     __init__       line 1134:  return filelog.filelog(self...
           | 14.5%  revlog.py:      __init__       line 24:  censorable=True)
Yuya Nishihara - Aug. 25, 2018, 8:21 a.m.
On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison wrote:
> # HG changeset patch
> # User Matt Harbison <matt_harbison@yahoo.com>
> # Date 1535147146 14400
> #      Fri Aug 24 17:45:46 2018 -0400
> # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
> # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
> lfs: add a progress bar when searching for blobs to upload

Queued, thanks.

> --- a/hgext/lfs/wrapper.py
> +++ b/hgext/lfs/wrapper.py
> @@ -343,11 +343,18 @@ def extractpointers(repo, revs):
>      """return a list of lfs pointers added by given revs"""
>      repo.ui.debug('lfs: computing set of blobs to upload\n')
>      pointers = {}
> -    for r in revs:
> -        ctx = repo[r]
> -        for p in pointersfromctx(ctx).values():
> -            pointers[p.oid()] = p
> -    return sorted(pointers.values())
> +
> +    progress = repo.ui.makeprogress(_('lfs search'), _('changesets'), len(revs))
> +
> +    try:

This could be "with ... as progress:", but I have no preference.
Matt Harbison - Aug. 30, 2018, 3:17 a.m.
On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <mharbison72@gmail.com>  
wrote:

> # HG changeset patch
> # User Matt Harbison <matt_harbison@yahoo.com>
> # Date 1535147146 14400
> #      Fri Aug 24 17:45:46 2018 -0400
> # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
> # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
> lfs: add a progress bar when searching for blobs to upload
>
> The search itself can take an extreme amount of time if there are a lot  
> of
> revisions involved.  I've got a local repo that took 6 minutes to push  
> 1850
> commits, and 60% of that time was spent here (there are ~70K files):
>
>      \ 58.1%  wrapper.py:     extractpointers      line 297:  pointers =  
> extractpointers(...
>        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p in  
> pointersfromctx(ct...
>        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =  
> pointerfromctx(ctx, f, ...
>          \ 38.7%  context.py:     __contains__     line 368:  if f not  
> in ctx:
>            | 38.7%  util.py:        __get__        line 82:  return key  
> in self._manifest
>            | 38.7%  context.py:     _manifest      line 1416:  result =  
> self.func(obj)
>            | 38.7%  manifest.py:    read           line 472:  return  
> self._manifestctx.re...
>              \ 25.6%  revlog.py:      revision     line 1562:  text =  
> rl.revision(self._node)
>                \ 12.8%  revlog.py:      _chunks    line 2217:  bins =  
> self._chunks(chain, ...
>                  | 12.0%  revlog.py:      decompressline 2112:   
> ladd(decomp(buffer(data, ch...
>                \  7.8%  revlog.py:      checkhash  line 2232:   
> self.checkhash(text, node, ...
>                  |  7.8%  revlog.py:      hash     line 2315:  if node  
> != self.hash(text, ...
>                  |  7.8%  revlog.py:      hash     line 2242:  return  
> hash(text, p1, p2)
>              \ 12.0%  manifest.py:    __init__     line 1565:   
> self._data = manifestdict(t...
>          \ 16.8%  context.py:     filenode         line 378:  if not  
> _islfs(fctx.filelog(...
>            | 15.7%  util.py:        __get__        line 706:  return  
> self._filelog
>            | 14.8%  context.py:     _filelog       line 1416:  result =  
> self.func(obj)
>            | 14.8%  localrepo.py:   file           line 629:  return  
> self._repo.file(self...
>            | 14.8%  filelog.py:     __init__       line 1134:  return  
> filelog.filelog(self...
>            | 14.5%  revlog.py:      __init__       line 24:   
> censorable=True)

Any ideas how to trim down some of this overhead?  revset._matchfiles()  
has a comment about reading the changelog directly because of the overhead  
of creating changectx[1].  I think that could work here too, but falls  
apart because of the need to access the filelogs too.  It seems like  
reading the changelog and accessing the filelogs directly here is too low  
level, especially with @indygreg trying to add support for non-filelog  
storage.

[1]  
https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
Yuya Nishihara - Aug. 30, 2018, 11:41 a.m.
On Wed, 29 Aug 2018 23:17:45 -0400, Matt Harbison wrote:
> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <mharbison72@gmail.com>  
> wrote:
> 
> > # HG changeset patch
> > # User Matt Harbison <matt_harbison@yahoo.com>
> > # Date 1535147146 14400
> > #      Fri Aug 24 17:45:46 2018 -0400
> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
> > lfs: add a progress bar when searching for blobs to upload
> >
> > The search itself can take an extreme amount of time if there are a lot  
> > of
> > revisions involved.  I've got a local repo that took 6 minutes to push  
> > 1850
> > commits, and 60% of that time was spent here (there are ~70K files):
> >
> >      \ 58.1%  wrapper.py:     extractpointers      line 297:  pointers =  
> > extractpointers(...
> >        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p in  
> > pointersfromctx(ct...
> >        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =  
> > pointerfromctx(ctx, f, ...
> >          \ 38.7%  context.py:     __contains__     line 368:  if f not  
> > in ctx:
> >            | 38.7%  util.py:        __get__        line 82:  return key  
> > in self._manifest
> >            | 38.7%  context.py:     _manifest      line 1416:  result =  
> > self.func(obj)
> >            | 38.7%  manifest.py:    read           line 472:  return  
> > self._manifestctx.re...
> >              \ 25.6%  revlog.py:      revision     line 1562:  text =  
> > rl.revision(self._node)
> >                \ 12.8%  revlog.py:      _chunks    line 2217:  bins =  
> > self._chunks(chain, ...
> >                  | 12.0%  revlog.py:      decompressline 2112:   
> > ladd(decomp(buffer(data, ch...
> >                \  7.8%  revlog.py:      checkhash  line 2232:   
> > self.checkhash(text, node, ...
> >                  |  7.8%  revlog.py:      hash     line 2315:  if node  
> > != self.hash(text, ...
> >                  |  7.8%  revlog.py:      hash     line 2242:  return  
> > hash(text, p1, p2)
> >              \ 12.0%  manifest.py:    __init__     line 1565:   
> > self._data = manifestdict(t...
> >          \ 16.8%  context.py:     filenode         line 378:  if not  
> > _islfs(fctx.filelog(...
> >            | 15.7%  util.py:        __get__        line 706:  return  
> > self._filelog
> >            | 14.8%  context.py:     _filelog       line 1416:  result =  
> > self.func(obj)
> >            | 14.8%  localrepo.py:   file           line 629:  return  
> > self._repo.file(self...
> >            | 14.8%  filelog.py:     __init__       line 1134:  return  
> > filelog.filelog(self...
> >            | 14.5%  revlog.py:      __init__       line 24:   
> > censorable=True)
> 
> Any ideas how to trim down some of this overhead?  revset._matchfiles()  
> has a comment about reading the changelog directly because of the overhead  
> of creating changectx[1].  I think that could work here too, but falls  
> apart because of the need to access the filelogs too.  It seems like  
> reading the changelog and accessing the filelogs directly here is too low  
> level, especially with @indygreg trying to add support for non-filelog  
> storage.

Is there any way to filter lfs files without reading filelog? I suspect it
would spend time scanning each filelog revision.

FWIW, I think it's okay to use storage-level API to scan lfs pointers
linearly.
Matt Harbison - Aug. 31, 2018, 4:49 a.m.
On Thu, 30 Aug 2018 07:41:02 -0400, Yuya Nishihara <yuya@tcha.org> wrote:

> On Wed, 29 Aug 2018 23:17:45 -0400, Matt Harbison wrote:
>> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison  
>> <mharbison72@gmail.com>
>> wrote:
>>
>> > # HG changeset patch
>> > # User Matt Harbison <matt_harbison@yahoo.com>
>> > # Date 1535147146 14400
>> > #      Fri Aug 24 17:45:46 2018 -0400
>> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
>> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
>> > lfs: add a progress bar when searching for blobs to upload
>> >
>>
>> Any ideas how to trim down some of this overhead?  revset._matchfiles()
>> has a comment about reading the changelog directly because of the  
>> overhead
>> of creating changectx[1].  I think that could work here too, but falls
>> apart because of the need to access the filelogs too.  It seems like
>> reading the changelog and accessing the filelogs directly here is too  
>> low
>> level, especially with @indygreg trying to add support for non-filelog
>> storage.
>
> Is there any way to filter lfs files without reading filelog? I suspect  
> it would spend time scanning each filelog revision.

I don't think so.  It looks like REVIDX_EXTSTORED is added to the flags  
when a file log revision is added, and that's really the only marker.

> FWIW, I think it's okay to use storage-level API to scan lfs pointers
> linearly.
via Mercurial-devel - Aug. 31, 2018, 5:14 a.m.
On Wed, Aug 29, 2018 at 8:18 PM Matt Harbison <mharbison72@gmail.com> wrote:

> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <mharbison72@gmail.com>
>
> wrote:
>
> > # HG changeset patch
> > # User Matt Harbison <matt_harbison@yahoo.com>
> > # Date 1535147146 14400
> > #      Fri Aug 24 17:45:46 2018 -0400
> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
> > lfs: add a progress bar when searching for blobs to upload
> >
> > The search itself can take an extreme amount of time if there are a lot
> > of
> > revisions involved.  I've got a local repo that took 6 minutes to push
> > 1850
> > commits, and 60% of that time was spent here (there are ~70K files):
> >
> >      \ 58.1%  wrapper.py:     extractpointers      line 297:  pointers
> =
> > extractpointers(...
> >        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p in
> > pointersfromctx(ct...
> >        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =
> > pointerfromctx(ctx, f, ...
> >          \ 38.7%  context.py:     __contains__     line 368:  if f not
> > in ctx:
> >            | 38.7%  util.py:        __get__        line 82:  return key
> > in self._manifest
> >            | 38.7%  context.py:     _manifest      line 1416:  result =
> > self.func(obj)
> >            | 38.7%  manifest.py:    read           line 472:  return
> > self._manifestctx.re...
> >              \ 25.6%  revlog.py:      revision     line 1562:  text =
> > rl.revision(self._node)
> >                \ 12.8%  revlog.py:      _chunks    line 2217:  bins =
> > self._chunks(chain, ...
> >                  | 12.0%  revlog.py:      decompressline 2112:
> > ladd(decomp(buffer(data, ch...
> >                \  7.8%  revlog.py:      checkhash  line 2232:
> > self.checkhash(text, node, ...
> >                  |  7.8%  revlog.py:      hash     line 2315:  if node
> > != self.hash(text, ...
> >                  |  7.8%  revlog.py:      hash     line 2242:  return
> > hash(text, p1, p2)
> >              \ 12.0%  manifest.py:    __init__     line 1565:
> > self._data = manifestdict(t...
> >          \ 16.8%  context.py:     filenode         line 378:  if not
> > _islfs(fctx.filelog(...
> >            | 15.7%  util.py:        __get__        line 706:  return
> > self._filelog
> >            | 14.8%  context.py:     _filelog       line 1416:  result =
> > self.func(obj)
> >            | 14.8%  localrepo.py:   file           line 629:  return
> > self._repo.file(self...
> >            | 14.8%  filelog.py:     __init__       line 1134:  return
> > filelog.filelog(self...
> >            | 14.5%  revlog.py:      __init__       line 24:
> > censorable=True)
>
> Any ideas how to trim down some of this overhead?


You can possibly save on some of that manifest-reading time by calling
manifestlog.readfast() like changegroup (and verify, I think) does.


>   revset._matchfiles()
> has a comment about reading the changelog directly because of the
> overhead
> of creating changectx[1].  I think that could work here too, but falls
> apart because of the need to access the filelogs too.  It seems like
> reading the changelog and accessing the filelogs directly here is too low
> level, especially with @indygreg trying to add support for non-filelog
> storage.
>
> [1]
>
> https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
via Mercurial-devel - Aug. 31, 2018, 5:19 a.m.
On Thu, Aug 30, 2018 at 10:14 PM Martin von Zweigbergk <
martinvonz@google.com> wrote:

>
>
> On Wed, Aug 29, 2018 at 8:18 PM Matt Harbison <mharbison72@gmail.com>
> wrote:
>
>> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <mharbison72@gmail.com>
>>
>> wrote:
>>
>> > # HG changeset patch
>> > # User Matt Harbison <matt_harbison@yahoo.com>
>> > # Date 1535147146 14400
>> > #      Fri Aug 24 17:45:46 2018 -0400
>> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
>> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
>> > lfs: add a progress bar when searching for blobs to upload
>> >
>> > The search itself can take an extreme amount of time if there are a
>> lot
>> > of
>> > revisions involved.  I've got a local repo that took 6 minutes to push
>> > 1850
>> > commits, and 60% of that time was spent here (there are ~70K files):
>> >
>> >      \ 58.1%  wrapper.py:     extractpointers      line 297:  pointers
>> =
>> > extractpointers(...
>> >        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p in
>> > pointersfromctx(ct...
>> >        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =
>> > pointerfromctx(ctx, f, ...
>> >          \ 38.7%  context.py:     __contains__     line 368:  if f not
>> > in ctx:
>> >            | 38.7%  util.py:        __get__        line 82:  return
>> key
>> > in self._manifest
>> >            | 38.7%  context.py:     _manifest      line 1416:  result
>> =
>> > self.func(obj)
>> >            | 38.7%  manifest.py:    read           line 472:  return
>> > self._manifestctx.re...
>> >              \ 25.6%  revlog.py:      revision     line 1562:  text =
>> > rl.revision(self._node)
>> >                \ 12.8%  revlog.py:      _chunks    line 2217:  bins =
>> > self._chunks(chain, ...
>> >                  | 12.0%  revlog.py:      decompressline 2112:
>> > ladd(decomp(buffer(data, ch...
>> >                \  7.8%  revlog.py:      checkhash  line 2232:
>> > self.checkhash(text, node, ...
>> >                  |  7.8%  revlog.py:      hash     line 2315:  if node
>> > != self.hash(text, ...
>> >                  |  7.8%  revlog.py:      hash     line 2242:  return
>> > hash(text, p1, p2)
>> >              \ 12.0%  manifest.py:    __init__     line 1565:
>> > self._data = manifestdict(t...
>> >          \ 16.8%  context.py:     filenode         line 378:  if not
>> > _islfs(fctx.filelog(...
>> >            | 15.7%  util.py:        __get__        line 706:  return
>> > self._filelog
>> >            | 14.8%  context.py:     _filelog       line 1416:  result
>> =
>> > self.func(obj)
>> >            | 14.8%  localrepo.py:   file           line 629:  return
>> > self._repo.file(self...
>> >            | 14.8%  filelog.py:     __init__       line 1134:  return
>> > filelog.filelog(self...
>> >            | 14.5%  revlog.py:      __init__       line 24:
>> > censorable=True)
>>
>> Any ideas how to trim down some of this overhead?
>
>
> You can possibly save on some of that manifest-reading time by calling
> manifestlog.readfast() like changegroup (and verify, I think) does.
>
>
>>   revset._matchfiles()
>> has a comment about reading the changelog directly because of the
>> overhead
>> of creating changectx[1].  I think that could work here too, but falls
>> apart because of the need to access the filelogs too.
>
>
I don't see changectx-creation in the profile output. What makes you think
that's a significant cost here?


>   It seems like
>> reading the changelog and accessing the filelogs directly here is too
>> low
>> level, especially with @indygreg trying to add support for non-filelog
>> storage.
>>
>> [1]
>>
>> https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>
Matt Harbison - Aug. 31, 2018, 6:02 a.m.
> On Aug 31, 2018, at 1:19 AM, Martin von Zweigbergk <martinvonz@google.com> wrote:
> 
> 
> 
>> On Thu, Aug 30, 2018 at 10:14 PM Martin von Zweigbergk <martinvonz@google.com> wrote:
>> 
>> 
>>> On Wed, Aug 29, 2018 at 8:18 PM Matt Harbison <mharbison72@gmail.com> wrote:
>>> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <mharbison72@gmail.com>  
>>> wrote:
>>> 
>>> > # HG changeset patch
>>> > # User Matt Harbison <matt_harbison@yahoo.com>
>>> > # Date 1535147146 14400
>>> > #      Fri Aug 24 17:45:46 2018 -0400
>>> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
>>> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
>>> > lfs: add a progress bar when searching for blobs to upload
>>> >
>>> > The search itself can take an extreme amount of time if there are a lot  
>>> > of
>>> > revisions involved.  I've got a local repo that took 6 minutes to push  
>>> > 1850
>>> > commits, and 60% of that time was spent here (there are ~70K files):
>>> >
>>> >      \ 58.1%  wrapper.py:     extractpointers      line 297:  pointers =  
>>> > extractpointers(...
>>> >        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p in  
>>> > pointersfromctx(ct...
>>> >        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =  
>>> > pointerfromctx(ctx, f, ...
>>> >          \ 38.7%  context.py:     __contains__     line 368:  if f not  
>>> > in ctx:
>>> >            | 38.7%  util.py:        __get__        line 82:  return key  
>>> > in self._manifest
>>> >            | 38.7%  context.py:     _manifest      line 1416:  result =  
>>> > self.func(obj)
>>> >            | 38.7%  manifest.py:    read           line 472:  return  
>>> > self._manifestctx.re...
>>> >              \ 25.6%  revlog.py:      revision     line 1562:  text =  
>>> > rl.revision(self._node)
>>> >                \ 12.8%  revlog.py:      _chunks    line 2217:  bins =  
>>> > self._chunks(chain, ...
>>> >                  | 12.0%  revlog.py:      decompressline 2112:   
>>> > ladd(decomp(buffer(data, ch...
>>> >                \  7.8%  revlog.py:      checkhash  line 2232:   
>>> > self.checkhash(text, node, ...
>>> >                  |  7.8%  revlog.py:      hash     line 2315:  if node  
>>> > != self.hash(text, ...
>>> >                  |  7.8%  revlog.py:      hash     line 2242:  return  
>>> > hash(text, p1, p2)
>>> >              \ 12.0%  manifest.py:    __init__     line 1565:   
>>> > self._data = manifestdict(t...
>>> >          \ 16.8%  context.py:     filenode         line 378:  if not  
>>> > _islfs(fctx.filelog(...
>>> >            | 15.7%  util.py:        __get__        line 706:  return  
>>> > self._filelog
>>> >            | 14.8%  context.py:     _filelog       line 1416:  result =  
>>> > self.func(obj)
>>> >            | 14.8%  localrepo.py:   file           line 629:  return  
>>> > self._repo.file(self...
>>> >            | 14.8%  filelog.py:     __init__       line 1134:  return  
>>> > filelog.filelog(self...
>>> >            | 14.5%  revlog.py:      __init__       line 24:   
>>> > censorable=True)
>>> 
>>> Any ideas how to trim down some of this overhead?
>> 
>> You can possibly save on some of that manifest-reading time by calling manifestlog.readfast() like changegroup (and verify, I think) does. 
>>  
>>>   revset._matchfiles()  
>>> has a comment about reading the changelog directly because of the overhead  
>>> of creating changectx[1].  I think that could work here too, but falls  
>>> apart because of the need to access the filelogs too.
> 
> I don't see changectx-creation in the profile output. What makes you think that's a significant cost here?

The footnote below is to a comment that says it’s expensive to create each one, which is basically what happened here in this case. Maybe the comment is stale now?  I wasn’t sure why it would be that much more expensive either, but reading the changelog seems more direct.

>>>   It seems like  
>>> reading the changelog and accessing the filelogs directly here is too low  
>>> level, especially with @indygreg trying to add support for non-filelog  
>>> storage.
>>> 
>>> [1]  
>>> https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
>>> _______________________________________________
>>> Mercurial-devel mailing list
>>> Mercurial-devel@mercurial-scm.org
>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Yuya Nishihara - Aug. 31, 2018, 11:32 a.m.
On Fri, 31 Aug 2018 00:49:09 -0400, Matt Harbison wrote:
> On Thu, 30 Aug 2018 07:41:02 -0400, Yuya Nishihara <yuya@tcha.org> wrote:
> 
> > On Wed, 29 Aug 2018 23:17:45 -0400, Matt Harbison wrote:
> >> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison  
> >> <mharbison72@gmail.com>
> >> wrote:
> >>
> >> > # HG changeset patch
> >> > # User Matt Harbison <matt_harbison@yahoo.com>
> >> > # Date 1535147146 14400
> >> > #      Fri Aug 24 17:45:46 2018 -0400
> >> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
> >> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
> >> > lfs: add a progress bar when searching for blobs to upload
> >> >
> >>
> >> Any ideas how to trim down some of this overhead?  revset._matchfiles()
> >> has a comment about reading the changelog directly because of the  
> >> overhead
> >> of creating changectx[1].  I think that could work here too, but falls
> >> apart because of the need to access the filelogs too.  It seems like
> >> reading the changelog and accessing the filelogs directly here is too  
> >> low
> >> level, especially with @indygreg trying to add support for non-filelog
> >> storage.
> >
> > Is there any way to filter lfs files without reading filelog? I suspect  
> > it would spend time scanning each filelog revision.
> 
> I don't think so.  It looks like REVIDX_EXTSTORED is added to the flags  
> when a file log revision is added, and that's really the only marker.

Maybe we can reuse the process or result of changegroup generation. I don't
have expertise, but it should at least scan the changelog and filelogs
involved. An extra cost to be added would be decompressing revisions which
are marked as REVIDX_EXTSTORED.

Patch

diff --git a/hgext/lfs/wrapper.py b/hgext/lfs/wrapper.py
--- a/hgext/lfs/wrapper.py
+++ b/hgext/lfs/wrapper.py
@@ -343,11 +343,18 @@  def extractpointers(repo, revs):
     """return a list of lfs pointers added by given revs"""
     repo.ui.debug('lfs: computing set of blobs to upload\n')
     pointers = {}
-    for r in revs:
-        ctx = repo[r]
-        for p in pointersfromctx(ctx).values():
-            pointers[p.oid()] = p
-    return sorted(pointers.values())
+
+    progress = repo.ui.makeprogress(_('lfs search'), _('changesets'), len(revs))
+
+    try:
+        for r in revs:
+            ctx = repo[r]
+            for p in pointersfromctx(ctx).values():
+                pointers[p.oid()] = p
+            progress.increment()
+        return sorted(pointers.values())
+    finally:
+        progress.complete()
 
 def pointerfromctx(ctx, f, removed=False):
     """return a pointer for the named file from the given changectx, or None if