Patchwork [RFC] revlog/changegroup: use callbacks for populating efiles

login
register
mail settings
Submitter Gregory Szorc
Date July 10, 2015, 11:46 p.m.
Message ID <bca84aa6ea80b386a01c.1436572002@gps-mbp.local>
Download mbox | patch
Permalink /patch/9951/
State Changes Requested
Headers show

Comments

Gregory Szorc - July 10, 2015, 11:46 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1436571961 25200
#      Fri Jul 10 16:46:01 2015 -0700
# Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
# Parent  648323f41a89619d9eeeb7287213378c340866c8
revlog/changegroup: use callbacks for populating efiles

This fails with:

$ HGRCPATH=/dev/null time -p ~/src/hg/hg unbundle ../f34a7120f46bdfcf9991a0c89a72753f86622df0.gzip.hg --debug --traceback
adding changesets
add changeset 8ba995b74e18
add changeset 9b2a99adc05e
transaction abort!
rollback completed
Traceback (most recent call last):
  File "/Users/gps/src/hg/mercurial/dispatch.py", line 162, in _runcatch
    return _dispatch(req)
  File "/Users/gps/src/hg/mercurial/dispatch.py", line 894, in _dispatch
    cmdpats, cmdoptions)
  File "/Users/gps/src/hg/mercurial/dispatch.py", line 655, in runcommand
    ret = _runcommand(ui, options, cmd, d)
  File "/Users/gps/src/hg/mercurial/dispatch.py", line 1011, in _runcommand
    return checkargs()
  File "/Users/gps/src/hg/mercurial/dispatch.py", line 980, in checkargs return cmdfunc()
  File "/Users/gps/src/hg/mercurial/dispatch.py", line 891, in <lambda>
    d = lambda: util.checksignature(func)(ui, *args, **cmdoptions)
  File "/Users/gps/src/hg/mercurial/util.py", line 792, in check
    return func(*args, **kwargs)
  File "/Users/gps/src/hg/mercurial/commands.py", line 6382, in unbundle
    'bundle:' + fname)
  File "/Users/gps/src/hg/mercurial/changegroup.py", line 761, in addchangegroup
    addrevisioncb=changelogadded)
  File "/Users/gps/src/hg/mercurial/revlog.py", line 1466, in addgroup
    addrevisioncb(self, node)
  File "/Users/gps/src/hg/mercurial/changegroup.py", line 757, in changelogadded
    efiles.update(rl.read(node)[3])
  File "/Users/gps/src/hg/mercurial/changelog.py", line 323, in read
    text = self.revision(node)
  File "/Users/gps/src/hg/mercurial/revlog.py", line 1094, in revision
    bins = self._chunks(chain)
  File "/Users/gps/src/hg/mercurial/revlog.py", line 1015, in _chunks
    ladd(decompress(buffer(data, chunkstart - offset, chunklength)))
  File "/Users/gps/src/hg/mercurial/revlog.py", line 93, in decompress
    raise RevlogError(_("revlog decompress error: %s") % str(e))
RevlogError: revlog decompress error: Error -5 while decompressing data: incomplete or truncated stream
abort: revlog decompress error: Error -5 while decompressing data: incomplete or truncated stream!
Matt Mackall - July 11, 2015, 12:03 a.m.
On Fri, 2015-07-10 at 16:46 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1436571961 25200
> #      Fri Jul 10 16:46:01 2015 -0700
> # Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
> # Parent  648323f41a89619d9eeeb7287213378c340866c8
> revlog/changegroup: use callbacks for populating efiles

Probably needs to use the flush() thingy that the nearby censor code is
using.
Gregory Szorc - July 11, 2015, 6:26 p.m.
On Fri, Jul 10, 2015 at 5:03 PM, Matt Mackall <mpm@selenic.com> wrote:

> On Fri, 2015-07-10 at 16:46 -0700, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1436571961 25200
> > #      Fri Jul 10 16:46:01 2015 -0700
> > # Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
> > # Parent  648323f41a89619d9eeeb7287213378c340866c8
> > revlog/changegroup: use callbacks for populating efiles
>
> Probably needs to use the flush() thingy that the nearby censor code is
> using.
>

Calling flush() makes the error go away!

While my SSD equipped MBP didn't appear to exhibit a slowdown unbundling
mozilla-central, flushing file descriptors after applying each changeset
feels a bit excessive and it makes me uncomfortable due to potential
performance implications. I just don't think an extra I/O related syscall
should be required for this feature.

But you have a stronger grasp on these things than me. Is calling flush()
after every changeset add acceptable? Keep in mind this will be the only
behavior for addchangegroup(), regardless of whether progress bars are
displayed.
Matt Mackall - July 11, 2015, 8:17 p.m.
On Sat, 2015-07-11 at 11:26 -0700, Gregory Szorc wrote:
> On Fri, Jul 10, 2015 at 5:03 PM, Matt Mackall <mpm@selenic.com> wrote:
> 
> > On Fri, 2015-07-10 at 16:46 -0700, Gregory Szorc wrote:
> > > # HG changeset patch
> > > # User Gregory Szorc <gregory.szorc@gmail.com>
> > > # Date 1436571961 25200
> > > #      Fri Jul 10 16:46:01 2015 -0700
> > > # Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
> > > # Parent  648323f41a89619d9eeeb7287213378c340866c8
> > > revlog/changegroup: use callbacks for populating efiles
> >
> > Probably needs to use the flush() thingy that the nearby censor code is
> > using.
> >
> 
> Calling flush() makes the error go away!
> 
> While my SSD equipped MBP didn't appear to exhibit a slowdown unbundling
> mozilla-central, flushing file descriptors after applying each changeset
> feels a bit excessive and it makes me uncomfortable due to potential
> performance implications. I just don't think an extra I/O related syscall
> should be required for this feature.

The flush is only a flush from the userspace file buffers to kernel-side
buffers, it won't actually force I/O. It'll still be lazily written by
the kernel's caching layer at roughly the same rate as without the
buffering because the userspace buffers are tiny (4k) relative a modern
system's disk cache. But yes, it is a syscall and syscalls aren't free.

> But you have a stronger grasp on these things than me. Is calling flush()
> after every changeset add acceptable? Keep in mind this will be the only
> behavior for addchangegroup(), regardless of whether progress bars are
> displayed.

Well, it can be conditional on there being a callback.

In the slightly bigger picture, we might want to make the revlog object
aware of the file handles it might need to flush so that rather than
explicitly flushing on the write side, the read side can flush only if
needed. Something like:

for f in self._flushonread:
    f.flush()

..with the appropriate try/finally around the writers.

In the even bigger still picture, we might want to make it optionally
possible for revlog to hold onto file handles for the object's lifetime
to reduce syscall count and filesystem lookup overhead. Historically we
don't do this because Windows.
Gregory Szorc - July 11, 2015, 10:26 p.m.
On Sat, Jul 11, 2015 at 1:17 PM, Matt Mackall <mpm@selenic.com> wrote:

> On Sat, 2015-07-11 at 11:26 -0700, Gregory Szorc wrote:
> > On Fri, Jul 10, 2015 at 5:03 PM, Matt Mackall <mpm@selenic.com> wrote:
> >
> > > On Fri, 2015-07-10 at 16:46 -0700, Gregory Szorc wrote:
> > > > # HG changeset patch
> > > > # User Gregory Szorc <gregory.szorc@gmail.com>
> > > > # Date 1436571961 25200
> > > > #      Fri Jul 10 16:46:01 2015 -0700
> > > > # Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
> > > > # Parent  648323f41a89619d9eeeb7287213378c340866c8
> > > > revlog/changegroup: use callbacks for populating efiles
> > >
> > > Probably needs to use the flush() thingy that the nearby censor code is
> > > using.
> > >
> >
> > Calling flush() makes the error go away!
> >
> > While my SSD equipped MBP didn't appear to exhibit a slowdown unbundling
> > mozilla-central, flushing file descriptors after applying each changeset
> > feels a bit excessive and it makes me uncomfortable due to potential
> > performance implications. I just don't think an extra I/O related syscall
> > should be required for this feature.
>
> The flush is only a flush from the userspace file buffers to kernel-side
> buffers, it won't actually force I/O. It'll still be lazily written by
> the kernel's caching layer at roughly the same rate as without the
> buffering because the userspace buffers are tiny (4k) relative a modern
> system's disk cache. But yes, it is a syscall and syscalls aren't free.
>
> > But you have a stronger grasp on these things than me. Is calling flush()
> > after every changeset add acceptable? Keep in mind this will be the only
> > behavior for addchangegroup(), regardless of whether progress bars are
> > displayed.
>
> Well, it can be conditional on there being a callback.
>

Right. But for changegroup.addchangegroup(), there will always be a
callback so this will be the default behavior for many cases.


>
> In the slightly bigger picture, we might want to make the revlog object
> aware of the file handles it might need to flush so that rather than
> explicitly flushing on the write side, the read side can flush only if
> needed. Something like:
>
> for f in self._flushonread:
>     f.flush()
>
> ..with the appropriate try/finally around the writers.
>
> In the even bigger still picture, we might want to make it optionally
> possible for revlog to hold onto file handles for the object's lifetime
> to reduce syscall count and filesystem lookup overhead. Historically we
> don't do this because Windows.
>

Looking at things a bit more, my original patch series also added
additional flushes because buildtext() inside revlog._addrevision() does
flushing and my "return text" patch made the call to buildtext()
unconditional.

Tracing the system calls when applying a bunch of changesets, I'm seeing
what I think is "odd" behavior. .hg/store/00changelog.i.a is opened,
fstat()d twice, lseek()d twice, read() a few times, maybe write()n, and
closed. This cycle repeats over and over. I'm still trying to grok what is
happening. But it certainly doesn't feel very efficient. I suspect it has
something to do with the flushing, which happens every couple of changesets
inside buildtext(). I'm pretty sure this corresponds to starting/ending a
delta chain.

This is the first time I've dug into the bowels of revlogs. It's a bit
overwhelming. On the bright side, I think I'm starting to identify some
potential performance improvements!
Gregory Szorc - July 13, 2015, 4:16 p.m.
On Sat, Jul 11, 2015 at 3:26 PM, Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> On Sat, Jul 11, 2015 at 1:17 PM, Matt Mackall <mpm@selenic.com> wrote:
>
>> On Sat, 2015-07-11 at 11:26 -0700, Gregory Szorc wrote:
>> > On Fri, Jul 10, 2015 at 5:03 PM, Matt Mackall <mpm@selenic.com> wrote:
>> >
>> > > On Fri, 2015-07-10 at 16:46 -0700, Gregory Szorc wrote:
>> > > > # HG changeset patch
>> > > > # User Gregory Szorc <gregory.szorc@gmail.com>
>> > > > # Date 1436571961 25200
>> > > > #      Fri Jul 10 16:46:01 2015 -0700
>> > > > # Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
>> > > > # Parent  648323f41a89619d9eeeb7287213378c340866c8
>> > > > revlog/changegroup: use callbacks for populating efiles
>> > >
>> > > Probably needs to use the flush() thingy that the nearby censor code
>> is
>> > > using.
>> > >
>> >
>> > Calling flush() makes the error go away!
>> >
>> > While my SSD equipped MBP didn't appear to exhibit a slowdown unbundling
>> > mozilla-central, flushing file descriptors after applying each changeset
>> > feels a bit excessive and it makes me uncomfortable due to potential
>> > performance implications. I just don't think an extra I/O related
>> syscall
>> > should be required for this feature.
>>
>> The flush is only a flush from the userspace file buffers to kernel-side
>> buffers, it won't actually force I/O. It'll still be lazily written by
>> the kernel's caching layer at roughly the same rate as without the
>> buffering because the userspace buffers are tiny (4k) relative a modern
>> system's disk cache. But yes, it is a syscall and syscalls aren't free.
>>
>> > But you have a stronger grasp on these things than me. Is calling
>> flush()
>> > after every changeset add acceptable? Keep in mind this will be the only
>> > behavior for addchangegroup(), regardless of whether progress bars are
>> > displayed.
>>
>> Well, it can be conditional on there being a callback.
>>
>
> Right. But for changegroup.addchangegroup(), there will always be a
> callback so this will be the default behavior for many cases.
>
>
>>
>> In the slightly bigger picture, we might want to make the revlog object
>> aware of the file handles it might need to flush so that rather than
>> explicitly flushing on the write side, the read side can flush only if
>> needed. Something like:
>>
>> for f in self._flushonread:
>>     f.flush()
>>
>> ..with the appropriate try/finally around the writers.
>>
>> In the even bigger still picture, we might want to make it optionally
>> possible for revlog to hold onto file handles for the object's lifetime
>> to reduce syscall count and filesystem lookup overhead. Historically we
>> don't do this because Windows.
>>
>
> Looking at things a bit more, my original patch series also added
> additional flushes because buildtext() inside revlog._addrevision() does
> flushing and my "return text" patch made the call to buildtext()
> unconditional.
>
> Tracing the system calls when applying a bunch of changesets, I'm seeing
> what I think is "odd" behavior. .hg/store/00changelog.i.a is opened,
> fstat()d twice, lseek()d twice, read() a few times, maybe write()n, and
> closed. This cycle repeats over and over. I'm still trying to grok what is
> happening. But it certainly doesn't feel very efficient. I suspect it has
> something to do with the flushing, which happens every couple of changesets
> inside buildtext(). I'm pretty sure this corresponds to starting/ending a
> delta chain.
>
> This is the first time I've dug into the bowels of revlogs. It's a bit
> overwhelming. On the bright side, I think I'm starting to identify some
> potential performance improvements!
>

I tracked down the source of all the extra opens. The stack looks like this:

  /Users/gps/src/hg/mercurial/changegroup.py(757)addchangegroup()
-> srccontent = cl.addgroup(source, csmap, trp)
  /Users/gps/src/hg/mercurial/revlog.py(1461)addgroup()
-> ifh, dfh)
  /Users/gps/src/hg/mercurial/revlog.py(1346)_addrevision()
-> text = buildtext()
  /Users/gps/src/hg/mercurial/revlog.py(1265)buildtext()
-> basetext = self.revision(self.node(baserev))
  /Users/gps/src/hg/mercurial/revlog.py(1094)revision()
-> bins = self._chunks(chain)
  /Users/gps/src/hg/mercurial/revlog.py(1001)_chunks()
-> self._chunkraw(revs[0], revs[-1])
  /Users/gps/src/hg/mercurial/revlog.py(976)_chunkraw()
-> return self._getchunk(start, length)
  /Users/gps/src/hg/mercurial/revlog.py(967)_getchunk()
-> return self._loadchunk(offset, length)
  /Users/gps/src/hg/mercurial/revlog.py(936)_loadchunk()
-> df = self.opener(self.indexfile)

Essentially, every time we start/end a new delta chain, we need to flush
the revlog (+ index), open a new file handle, and read in revisions
necessary to construct the full revision text. In the case of the changelog
during addgroup() (where delta chains are typically 2-4 in length), this
results in tons of extra file I/O. On mozilla-central, we perform ~100,000
flush + opens over ~250,000 changesets.

I'll continue to poke around with creative solutions that don't regress
performance.
Matt Mackall - July 13, 2015, 9:06 p.m.
On Mon, 2015-07-13 at 09:16 -0700, Gregory Szorc wrote:
> On Sat, Jul 11, 2015 at 3:26 PM, Gregory Szorc <gregory.szorc@gmail.com>
> wrote:
> 
> > On Sat, Jul 11, 2015 at 1:17 PM, Matt Mackall <mpm@selenic.com> wrote:
> >
> >> On Sat, 2015-07-11 at 11:26 -0700, Gregory Szorc wrote:
> >> > On Fri, Jul 10, 2015 at 5:03 PM, Matt Mackall <mpm@selenic.com> wrote:
> >> >
> >> > > On Fri, 2015-07-10 at 16:46 -0700, Gregory Szorc wrote:
> >> > > > # HG changeset patch
> >> > > > # User Gregory Szorc <gregory.szorc@gmail.com>
> >> > > > # Date 1436571961 25200
> >> > > > #      Fri Jul 10 16:46:01 2015 -0700
> >> > > > # Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
> >> > > > # Parent  648323f41a89619d9eeeb7287213378c340866c8
> >> > > > revlog/changegroup: use callbacks for populating efiles
> >> > >
> >> > > Probably needs to use the flush() thingy that the nearby censor code
> >> is
> >> > > using.
> >> > >
> >> >
> >> > Calling flush() makes the error go away!
> >> >
> >> > While my SSD equipped MBP didn't appear to exhibit a slowdown unbundling
> >> > mozilla-central, flushing file descriptors after applying each changeset
> >> > feels a bit excessive and it makes me uncomfortable due to potential
> >> > performance implications. I just don't think an extra I/O related
> >> syscall
> >> > should be required for this feature.
> >>
> >> The flush is only a flush from the userspace file buffers to kernel-side
> >> buffers, it won't actually force I/O. It'll still be lazily written by
> >> the kernel's caching layer at roughly the same rate as without the
> >> buffering because the userspace buffers are tiny (4k) relative a modern
> >> system's disk cache. But yes, it is a syscall and syscalls aren't free.
> >>
> >> > But you have a stronger grasp on these things than me. Is calling
> >> flush()
> >> > after every changeset add acceptable? Keep in mind this will be the only
> >> > behavior for addchangegroup(), regardless of whether progress bars are
> >> > displayed.
> >>
> >> Well, it can be conditional on there being a callback.
> >>
> >
> > Right. But for changegroup.addchangegroup(), there will always be a
> > callback so this will be the default behavior for many cases.
> >
> >
> >>
> >> In the slightly bigger picture, we might want to make the revlog object
> >> aware of the file handles it might need to flush so that rather than
> >> explicitly flushing on the write side, the read side can flush only if
> >> needed. Something like:
> >>
> >> for f in self._flushonread:
> >>     f.flush()
> >>
> >> ..with the appropriate try/finally around the writers.
> >>
> >> In the even bigger still picture, we might want to make it optionally
> >> possible for revlog to hold onto file handles for the object's lifetime
> >> to reduce syscall count and filesystem lookup overhead. Historically we
> >> don't do this because Windows.
> >>
> >
> > Looking at things a bit more, my original patch series also added
> > additional flushes because buildtext() inside revlog._addrevision() does
> > flushing and my "return text" patch made the call to buildtext()
> > unconditional.
> >
> > Tracing the system calls when applying a bunch of changesets, I'm seeing
> > what I think is "odd" behavior. .hg/store/00changelog.i.a is opened,
> > fstat()d twice, lseek()d twice, read() a few times, maybe write()n, and
> > closed. This cycle repeats over and over. I'm still trying to grok what is
> > happening. But it certainly doesn't feel very efficient. I suspect it has
> > something to do with the flushing, which happens every couple of changesets
> > inside buildtext(). I'm pretty sure this corresponds to starting/ending a
> > delta chain.
> >
> > This is the first time I've dug into the bowels of revlogs. It's a bit
> > overwhelming. On the bright side, I think I'm starting to identify some
> > potential performance improvements!
> >
> 
> I tracked down the source of all the extra opens. The stack looks like this:
> 
>   /Users/gps/src/hg/mercurial/changegroup.py(757)addchangegroup()
> -> srccontent = cl.addgroup(source, csmap, trp)
>   /Users/gps/src/hg/mercurial/revlog.py(1461)addgroup()
> -> ifh, dfh)
>   /Users/gps/src/hg/mercurial/revlog.py(1346)_addrevision()
> -> text = buildtext()
>   /Users/gps/src/hg/mercurial/revlog.py(1265)buildtext()
> -> basetext = self.revision(self.node(baserev))
>   /Users/gps/src/hg/mercurial/revlog.py(1094)revision()
> -> bins = self._chunks(chain)
>   /Users/gps/src/hg/mercurial/revlog.py(1001)_chunks()
> -> self._chunkraw(revs[0], revs[-1])
>   /Users/gps/src/hg/mercurial/revlog.py(976)_chunkraw()
> -> return self._getchunk(start, length)
>   /Users/gps/src/hg/mercurial/revlog.py(967)_getchunk()
> -> return self._loadchunk(offset, length)
>   /Users/gps/src/hg/mercurial/revlog.py(936)_loadchunk()
> -> df = self.opener(self.indexfile)
> 
> Essentially, every time we start/end a new delta chain, we need to flush
> the revlog (+ index), open a new file handle, and read in revisions
> necessary to construct the full revision text. In the case of the changelog
> during addgroup() (where delta chains are typically 2-4 in length), this
> results in tons of extra file I/O. On mozilla-central, we perform ~100,000
> flush + opens over ~250,000 changesets.
> 
> I'll continue to poke around with creative solutions that don't regress
> performance.

One possibility would be to prime the chunk cache as we go.
Gregory Szorc - July 13, 2015, 11:36 p.m.
On Mon, Jul 13, 2015 at 2:06 PM, Matt Mackall <mpm@selenic.com> wrote:

> On Mon, 2015-07-13 at 09:16 -0700, Gregory Szorc wrote:
> > On Sat, Jul 11, 2015 at 3:26 PM, Gregory Szorc <gregory.szorc@gmail.com>
> > wrote:
> >
> > > On Sat, Jul 11, 2015 at 1:17 PM, Matt Mackall <mpm@selenic.com> wrote:
> > >
> > >> On Sat, 2015-07-11 at 11:26 -0700, Gregory Szorc wrote:
> > >> > On Fri, Jul 10, 2015 at 5:03 PM, Matt Mackall <mpm@selenic.com>
> wrote:
> > >> >
> > >> > > On Fri, 2015-07-10 at 16:46 -0700, Gregory Szorc wrote:
> > >> > > > # HG changeset patch
> > >> > > > # User Gregory Szorc <gregory.szorc@gmail.com>
> > >> > > > # Date 1436571961 25200
> > >> > > > #      Fri Jul 10 16:46:01 2015 -0700
> > >> > > > # Node ID bca84aa6ea80b386a01c245e7f4e4ebdb720cd8c
> > >> > > > # Parent  648323f41a89619d9eeeb7287213378c340866c8
> > >> > > > revlog/changegroup: use callbacks for populating efiles
> > >> > >
> > >> > > Probably needs to use the flush() thingy that the nearby censor
> code
> > >> is
> > >> > > using.
> > >> > >
> > >> >
> > >> > Calling flush() makes the error go away!
> > >> >
> > >> > While my SSD equipped MBP didn't appear to exhibit a slowdown
> unbundling
> > >> > mozilla-central, flushing file descriptors after applying each
> changeset
> > >> > feels a bit excessive and it makes me uncomfortable due to potential
> > >> > performance implications. I just don't think an extra I/O related
> > >> syscall
> > >> > should be required for this feature.
> > >>
> > >> The flush is only a flush from the userspace file buffers to
> kernel-side
> > >> buffers, it won't actually force I/O. It'll still be lazily written by
> > >> the kernel's caching layer at roughly the same rate as without the
> > >> buffering because the userspace buffers are tiny (4k) relative a
> modern
> > >> system's disk cache. But yes, it is a syscall and syscalls aren't
> free.
> > >>
> > >> > But you have a stronger grasp on these things than me. Is calling
> > >> flush()
> > >> > after every changeset add acceptable? Keep in mind this will be the
> only
> > >> > behavior for addchangegroup(), regardless of whether progress bars
> are
> > >> > displayed.
> > >>
> > >> Well, it can be conditional on there being a callback.
> > >>
> > >
> > > Right. But for changegroup.addchangegroup(), there will always be a
> > > callback so this will be the default behavior for many cases.
> > >
> > >
> > >>
> > >> In the slightly bigger picture, we might want to make the revlog
> object
> > >> aware of the file handles it might need to flush so that rather than
> > >> explicitly flushing on the write side, the read side can flush only if
> > >> needed. Something like:
> > >>
> > >> for f in self._flushonread:
> > >>     f.flush()
> > >>
> > >> ..with the appropriate try/finally around the writers.
> > >>
> > >> In the even bigger still picture, we might want to make it optionally
> > >> possible for revlog to hold onto file handles for the object's
> lifetime
> > >> to reduce syscall count and filesystem lookup overhead. Historically
> we
> > >> don't do this because Windows.
> > >>
> > >
> > > Looking at things a bit more, my original patch series also added
> > > additional flushes because buildtext() inside revlog._addrevision()
> does
> > > flushing and my "return text" patch made the call to buildtext()
> > > unconditional.
> > >
> > > Tracing the system calls when applying a bunch of changesets, I'm
> seeing
> > > what I think is "odd" behavior. .hg/store/00changelog.i.a is opened,
> > > fstat()d twice, lseek()d twice, read() a few times, maybe write()n, and
> > > closed. This cycle repeats over and over. I'm still trying to grok
> what is
> > > happening. But it certainly doesn't feel very efficient. I suspect it
> has
> > > something to do with the flushing, which happens every couple of
> changesets
> > > inside buildtext(). I'm pretty sure this corresponds to
> starting/ending a
> > > delta chain.
> > >
> > > This is the first time I've dug into the bowels of revlogs. It's a bit
> > > overwhelming. On the bright side, I think I'm starting to identify some
> > > potential performance improvements!
> > >
> >
> > I tracked down the source of all the extra opens. The stack looks like
> this:
> >
> >   /Users/gps/src/hg/mercurial/changegroup.py(757)addchangegroup()
> > -> srccontent = cl.addgroup(source, csmap, trp)
> >   /Users/gps/src/hg/mercurial/revlog.py(1461)addgroup()
> > -> ifh, dfh)
> >   /Users/gps/src/hg/mercurial/revlog.py(1346)_addrevision()
> > -> text = buildtext()
> >   /Users/gps/src/hg/mercurial/revlog.py(1265)buildtext()
> > -> basetext = self.revision(self.node(baserev))
> >   /Users/gps/src/hg/mercurial/revlog.py(1094)revision()
> > -> bins = self._chunks(chain)
> >   /Users/gps/src/hg/mercurial/revlog.py(1001)_chunks()
> > -> self._chunkraw(revs[0], revs[-1])
> >   /Users/gps/src/hg/mercurial/revlog.py(976)_chunkraw()
> > -> return self._getchunk(start, length)
> >   /Users/gps/src/hg/mercurial/revlog.py(967)_getchunk()
> > -> return self._loadchunk(offset, length)
> >   /Users/gps/src/hg/mercurial/revlog.py(936)_loadchunk()
> > -> df = self.opener(self.indexfile)
> >
> > Essentially, every time we start/end a new delta chain, we need to flush
> > the revlog (+ index), open a new file handle, and read in revisions
> > necessary to construct the full revision text. In the case of the
> changelog
> > during addgroup() (where delta chains are typically 2-4 in length), this
> > results in tons of extra file I/O. On mozilla-central, we perform
> ~100,000
> > flush + opens over ~250,000 changesets.
> >
> > I'll continue to poke around with creative solutions that don't regress
> > performance.
>
> One possibility would be to prime the chunk cache as we go.
>

I tried this and it reduced performance significantly. I suspect the
immutable nature of the chunk cache being a str resulted in tons of
allocations, copies, and GC pressure.

The next thing I want to try is to cache the full text revision as we go
(probably only for changelog, since doing this for large manifests will be
slow). If that doesn't pan out, rewriting the chunk cache as a mutable
buffer (potentially in C) or restructuring the chunk cache so it is an
iterable of segments feel like viable alternatives.

Also, the chunk cache feels somewhat primitive to me. e.g. it throws away
old data instead of attempting to create a sliding window, etc (
https://selenic.com/repo/hg/file/tip/mercurial/revlog.py#l926). It feels
like there is room to optimize this cache. I wouldn't be surprised if
making it a bit smarter would benefit read performance as well.

Patch

diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
--- a/mercurial/changegroup.py
+++ b/mercurial/changegroup.py
@@ -752,16 +752,18 @@  def addchangegroup(repo, source, srctype
                                  total=self._total)
                 self._count += 1
         source.callback = prog(_('changesets'), expectedtotal)
 
+        def changelogadded(rl, node):
+            efiles.update(rl.read(node)[3])
+
         source.changelogheader()
-        srccontent = cl.addgroup(source, csmap, trp)
+        srccontent = cl.addgroup(source, csmap, trp,
+                                 addrevisioncb=changelogadded)
         if not (srccontent or emptyok):
             raise util.Abort(_("received changelog group is empty"))
         clend = len(cl)
         changesets = clend - clstart
-        for c in xrange(clstart, clend):
-            efiles.update(repo[c].files())
         efiles = len(efiles)
         repo.ui.progress(_('changesets'), None)
 
         # pull off the manifest group
diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -1378,15 +1378,18 @@  class revlog(object):
             ifh.write(data[0])
             ifh.write(data[1])
             self.checkinlinesize(transaction, ifh)
 
-    def addgroup(self, bundle, linkmapper, transaction):
+    def addgroup(self, bundle, linkmapper, transaction, addrevisioncb=None):
         """
         add a delta group
 
         given a set of deltas, add them to the revision log. the
         first delta is against its parent, which should be in our
         log, the rest are against the previous delta.
+
+        An optional ``addrevisioncb`` will be called after each added revision.
+        Its arguments are this revlog instance and the added node.
         """
 
         # track the base of the current delta log
         content = []
@@ -1458,8 +1461,10 @@  class revlog(object):
 
                 chain = self._addrevision(node, None, transaction, link,
                                           p1, p2, flags, (baserev, delta),
                                           ifh, dfh)
+                if addrevisioncb:
+                    addrevisioncb(self, node)
                 if not dfh and not self._inline:
                     # addrevision switched from inline to conventional
                     # reopen the index
                     ifh.close()