Patchwork localrepo: introduce persistent caching of revset revision's branch names

login
register
mail settings
Submitter Mads Kiilerich
Date Oct. 15, 2014, 12:33 a.m.
Message ID <85189122b49b31dddbcd.1413333192@ssl.google-analytics.com>
Download mbox | patch
Permalink /patch/6265/
State Superseded
Headers show

Comments

Mads Kiilerich - Oct. 15, 2014, 12:33 a.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1413333190 -7200
#      Wed Oct 15 02:33:10 2014 +0200
# Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
# Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
localrepo: introduce persistent caching of revset revision's branch names

It is expensive to create a changectx and extract the branch name. That shows
up when filtering on branches in revsets.

To speed things up, cache the results on disk. To avoid using too much space,
all branch names are only stored once and each revision references one of these
names. To verify that the cache is valid, we also store the tip hash in the
cache file.

Test case, run on the relatively small hg repo:
  rm -f .hg/cache/branchnames
  HGRCPATH=/dev/null ./hg --time log -r 'branch(stable)&branch(default)'
  HGRCPATH=/dev/null ./hg --time log -r 'branch(stable)&branch(default)'
  HGRCPATH=/dev/null ./hg --time log -r 'branch(stable)&branch(default)'

Before:

time: real 1.230 secs (user 1.220+0.000 sys 0.010+0.000)
time: real 1.200 secs (user 1.180+0.000 sys 0.020+0.000)
time: real 1.180 secs (user 1.170+0.000 sys 0.010+0.000)

After (3 runs):

time: real 1.330 secs (user 1.320+0.000 sys 0.010+0.000)
time: real 0.080 secs (user 0.080+0.000 sys 0.000+0.000)
time: real 0.080 secs (user 0.080+0.000 sys 0.000+0.000)

time: real 1.310 secs (user 1.310+0.000 sys 0.020+0.000)
time: real 0.070 secs (user 0.080+0.000 sys 0.000+0.000)
time: real 0.080 secs (user 0.080+0.000 sys 0.000+0.000)

time: real 1.300 secs (user 1.280+0.000 sys 0.020+0.000)
time: real 0.070 secs (user 0.070+0.000 sys 0.000+0.000)
time: real 0.080 secs (user 0.090+0.000 sys 0.000+0.000)

Tests on bigger repo shows the same: First run is slightly slower while not
getting cache hits, next run will be up to 20 times faster.
Matt Mackall - Oct. 15, 2014, 12:43 a.m.
On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1413333190 -7200
> #      Wed Oct 15 02:33:10 2014 +0200
> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
> localrepo: introduce persistent caching of revset revision's branch names
> 
> It is expensive to create a changectx and extract the branch name. That shows
> up when filtering on branches in revsets.
> 
> To speed things up, cache the results on disk. To avoid using too much space,
> all branch names are only stored once and each revision references one of these
> names. To verify that the cache is valid, we also store the tip hash in the
> cache file.

If we're going to add such a cache, I think it needs to not need
rebuilding across a strip.
Pierre-Yves David - Oct. 15, 2014, 12:45 a.m.
On 10/14/2014 05:33 PM, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1413333190 -7200
> #      Wed Oct 15 02:33:10 2014 +0200
> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
> localrepo: introduce persistent caching of revset revision's branch names
>
> It is expensive to create a changectx and extract the branch name. That shows
> up when filtering on branches in revsets.
>
> To speed things up, cache the results on disk. To avoid using too much space,
> all branch names are only stored once and each revision references one of these
> names. To verify that the cache is valid, we also store the tip hash in the
> cache file.

I cannot see what is your invalidation strategy here. (no cache key). 
The use in revset should go in another changesets. And we could use this 
in the branchmap computation too. (even if this is probably only about 
filling the cache)
Pierre-Yves David - Oct. 15, 2014, 12:51 a.m.
On 10/14/2014 05:43 PM, Matt Mackall wrote:
> On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
>> # HG changeset patch
>> # User Mads Kiilerich <madski@unity3d.com>
>> # Date 1413333190 -7200
>> #      Wed Oct 15 02:33:10 2014 +0200
>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>> localrepo: introduce persistent caching of revset revision's branch names
>>
>> It is expensive to create a changectx and extract the branch name. That shows
>> up when filtering on branches in revsets.
>>
>> To speed things up, cache the results on disk. To avoid using too much space,
>> all branch names are only stored once and each revision references one of these
>> names. To verify that the cache is valid, we also store the tip hash in the
>> cache file.
>
> If we're going to add such a cache, I think it needs to not need
> rebuilding across a strip.

I'm not sure I get you. Do you mean you want the cache to be permanent 
(so using hash as key instead of rev?) Or do you want it to to be 
properly invalidated in case of strip (some kind of cache key) or any of 
the previous or something else.
Matt Mackall - Oct. 15, 2014, 12:58 a.m.
On Tue, 2014-10-14 at 17:51 -0700, Pierre-Yves David wrote:
> 
> On 10/14/2014 05:43 PM, Matt Mackall wrote:
> > On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
> >> # HG changeset patch
> >> # User Mads Kiilerich <madski@unity3d.com>
> >> # Date 1413333190 -7200
> >> #      Wed Oct 15 02:33:10 2014 +0200
> >> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
> >> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
> >> localrepo: introduce persistent caching of revset revision's branch names
> >>
> >> It is expensive to create a changectx and extract the branch name. That shows
> >> up when filtering on branches in revsets.
> >>
> >> To speed things up, cache the results on disk. To avoid using too much space,
> >> all branch names are only stored once and each revision references one of these
> >> names. To verify that the cache is valid, we also store the tip hash in the
> >> cache file.
> >
> > If we're going to add such a cache, I think it needs to not need
> > rebuilding across a strip.
> 
> I'm not sure I get you. Do you mean you want the cache to be permanent 
> (so using hash as key instead of rev?) Or do you want it to to be 
> properly invalidated in case of strip (some kind of cache key) or any of 
> the previous or something else.

If it takes a minute to build the cache, and takes a minute to rebuild
it after strip.. then it needs rethinking.
Mads Kiilerich - Oct. 15, 2014, 2:58 a.m.
On 10/15/2014 02:58 AM, Matt Mackall wrote:
> On Tue, 2014-10-14 at 17:51 -0700, Pierre-Yves David wrote:
>> On 10/14/2014 05:43 PM, Matt Mackall wrote:
>>> On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
>>>> # HG changeset patch
>>>> # User Mads Kiilerich <madski@unity3d.com>
>>>> # Date 1413333190 -7200
>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>> localrepo: introduce persistent caching of revset revision's branch names
>>>>
>>>> It is expensive to create a changectx and extract the branch name. That shows
>>>> up when filtering on branches in revsets.
>>>>
>>>> To speed things up, cache the results on disk. To avoid using too much space,
>>>> all branch names are only stored once and each revision references one of these
>>>> names. To verify that the cache is valid, we also store the tip hash in the
>>>> cache file.
>>> If we're going to add such a cache, I think it needs to not need
>>> rebuilding across a strip.
>> I'm not sure I get you. Do you mean you want the cache to be permanent
>> (so using hash as key instead of rev?) Or do you want it to to be
>> properly invalidated in case of strip (some kind of cache key) or any of
>> the previous or something else.
> If it takes a minute to build the cache, and takes a minute to rebuild
> it after strip.. then it needs rethinking.

I think the root cause of the problem is that it takes a minute to 
retrieve this information. The best solution would be if the data could 
be stored in a way where there was no need for this. That would be a 
different trade-off in the basic design of Mercurial and is out of scope.

Another root cause of the problem is that strip works against the basic 
append-only design. I dislike the intrusiveness of phases and obsoletion 
markers but I guess something like that is better than strip. I think it 
is fair enough that a real garbage collection invalidates all caches.

This cache I propose works efficiently with the normal append-only mode 
of operation. The size of the cachefile is currently 4 times the number 
of revisions in the repo (plus each branch name once). (Btw: this array 
is a perfect use case for blosc. The compressed size could probably in 
most cases be significantly smaller than the number of revisions.) The 
current design also works efficiently when operating in rev-land.

The cache file _could_ also store the hash of each revision so it could 
reuse as much as possible after a strip. That would make it at least 5 
times bigger and less efficient. Would you prefer that? Or perhaps some 
"random" synchronization points could be a good trade-off?

A "better" solution could be to maintain it at a lower level, 
integrating it closely with the changelog index. That would however be 
too intrusive, in my opinion.

But mainly: This is just a cache, build on demand. There is no kind of 
lock-in for using it. It can always be ripped out and replaced with 
something else without any other cost that would put us in a worse 
position than if we didn't have it.

It is annoying if a strip requires a rebuild that takes a minute. It is 
however a much bigger problem that a simple query for all changesets on 
a branch takes a minute every time. This patch is about addressing the 
bigger problem. It can always be improved to / replaced by something 
that also solves the smaller problems.

/Mads
Martin von Zweigbergk - Oct. 15, 2014, 4:27 a.m.
On Tue, Oct 14, 2014 at 7:58 PM, Mads Kiilerich <mads@kiilerich.com> wrote:
> On 10/15/2014 02:58 AM, Matt Mackall wrote:
>>
>> On Tue, 2014-10-14 at 17:51 -0700, Pierre-Yves David wrote:
>>>
>>> On 10/14/2014 05:43 PM, Matt Mackall wrote:
>>>>
>>>> On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
>>>>>
>>>>> # HG changeset patch
>>>>> # User Mads Kiilerich <madski@unity3d.com>
>>>>> # Date 1413333190 -7200
>>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>>> localrepo: introduce persistent caching of revset revision's branch
>>>>> names
>>>>>
>>>>> It is expensive to create a changectx and extract the branch name. That
>>>>> shows
>>>>> up when filtering on branches in revsets.

I'm curious what takes most of this time. My limited understanding
suggests only the changelog would be read (no manifest revlog, no
filelog). Correct? Sorry if this is obvious to others. Feel free to
take the discussion elsewhere if that makes more sense.
Gregory Szorc - Oct. 15, 2014, 4:53 a.m.
On 10/14/14 7:58 PM, Mads Kiilerich wrote:
> On 10/15/2014 02:58 AM, Matt Mackall wrote:
>> On Tue, 2014-10-14 at 17:51 -0700, Pierre-Yves David wrote:
>>> On 10/14/2014 05:43 PM, Matt Mackall wrote:
>>>> On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
>>>>> # HG changeset patch
>>>>> # User Mads Kiilerich <madski@unity3d.com>
>>>>> # Date 1413333190 -7200
>>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>>> localrepo: introduce persistent caching of revset revision's branch
>>>>> names
>>>>>
>>>>> It is expensive to create a changectx and extract the branch name.
>>>>> That shows
>>>>> up when filtering on branches in revsets.
>>>>>
>>>>> To speed things up, cache the results on disk. To avoid using too
>>>>> much space,
>>>>> all branch names are only stored once and each revision references
>>>>> one of these
>>>>> names. To verify that the cache is valid, we also store the tip
>>>>> hash in the
>>>>> cache file.
>>>> If we're going to add such a cache, I think it needs to not need
>>>> rebuilding across a strip.
>>> I'm not sure I get you. Do you mean you want the cache to be permanent
>>> (so using hash as key instead of rev?) Or do you want it to to be
>>> properly invalidated in case of strip (some kind of cache key) or any of
>>> the previous or something else.
>> If it takes a minute to build the cache, and takes a minute to rebuild
>> it after strip.. then it needs rethinking.
>
> I think the root cause of the problem is that it takes a minute to
> retrieve this information. The best solution would be if the data could
> be stored in a way where there was no need for this. That would be a
> different trade-off in the basic design of Mercurial and is out of scope.
>
> Another root cause of the problem is that strip works against the basic
> append-only design. I dislike the intrusiveness of phases and obsoletion
> markers but I guess something like that is better than strip. I think it
> is fair enough that a real garbage collection invalidates all caches.
>
> This cache I propose works efficiently with the normal append-only mode
> of operation. The size of the cachefile is currently 4 times the number
> of revisions in the repo (plus each branch name once). (Btw: this array
> is a perfect use case for blosc. The compressed size could probably in
> most cases be significantly smaller than the number of revisions.) The
> current design also works efficiently when operating in rev-land.
>
> The cache file _could_ also store the hash of each revision so it could
> reuse as much as possible after a strip. That would make it at least 5
> times bigger and less efficient. Would you prefer that? Or perhaps some
> "random" synchronization points could be a good trade-off?
>
> A "better" solution could be to maintain it at a lower level,
> integrating it closely with the changelog index. That would however be
> too intrusive, in my opinion.
>
> But mainly: This is just a cache, build on demand. There is no kind of
> lock-in for using it. It can always be ripped out and replaced with
> something else without any other cost that would put us in a worse
> position than if we didn't have it.
>
> It is annoying if a strip requires a rebuild that takes a minute. It is
> however a much bigger problem that a simple query for all changesets on
> a branch takes a minute every time. This patch is about addressing the
> bigger problem. It can always be improved to / replaced by something
> that also solves the smaller problems.

Append only is a nice ideal and just that: an ideal. Things like 
transaction rollbacks are effectively strips. And transaction rollbacks 
can happen when e.g. a server-side hook rejects a push. And if that hook 
(or a hook that ran before) accesses branch data and causes a cache 
update that would trigger invalidation on rollback, the next 
unsuspecting user triggers a fresh cache rebuild and experiences extreme 
latency (if the repo is moderately sized).

We see this at Mozilla with rejected pushes causing branchcache 
rebuilds. On our Try repo, this leads to CPU exhaustion from multiple 
clients all triggering cache generation simultaneously (because there is 
no lock on the cache). This is why we spent time at the Summit coming up 
with a better way to add non-revlog files into transactions.

Whatever you do, please avoid full cache (re)populations wherever possible.
Pierre-Yves David - Oct. 15, 2014, 7:01 a.m.
On 10/14/2014 09:53 PM, Gregory Szorc wrote:
> Append only is a nice ideal and just that: an ideal. Things like
> transaction rollbacks are effectively strips. And transaction rollbacks
> can happen when e.g. a server-side hook rejects a push. And if that hook
> (or a hook that ran before) accesses branch data and causes a cache
> update that would trigger invalidation on rollback, the next
> unsuspecting user triggers a fresh cache rebuild and experiences extreme
> latency (if the repo is moderately sized).

Note that we have plan (and will soon be working on) having this cache 
this cache transactional and gracefully not impacted by aborted transaction.
Pierre-Yves David - Oct. 15, 2014, 7:04 a.m.
On 10/14/2014 09:27 PM, Martin von Zweigbergk wrote:
> On Tue, Oct 14, 2014 at 7:58 PM, Mads Kiilerich <mads@kiilerich.com> wrote:
>> On 10/15/2014 02:58 AM, Matt Mackall wrote:
>>>
>>> On Tue, 2014-10-14 at 17:51 -0700, Pierre-Yves David wrote:
>>>>
>>>> On 10/14/2014 05:43 PM, Matt Mackall wrote:
>>>>>
>>>>> On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
>>>>>>
>>>>>> # HG changeset patch
>>>>>> # User Mads Kiilerich <madski@unity3d.com>
>>>>>> # Date 1413333190 -7200
>>>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>>>> localrepo: introduce persistent caching of revset revision's branch
>>>>>> names
>>>>>>
>>>>>> It is expensive to create a changectx and extract the branch name. That
>>>>>> shows
>>>>>> up when filtering on branches in revsets.
>
> I'm curious what takes most of this time. My limited understanding
> suggests only the changelog would be read (no manifest revlog, no
> filelog). Correct? Sorry if this is obvious to others. Feel free to
> take the discussion elsewhere if that makes more sense.

There is three main cost centers:

- reading the changelog data for all commit
- decoding the "extra" field to extract the branch name
- creating a changectx (changeset context object) for each entry.

I'm curious about how each of those cost rank compared to each other 
(and If I missed one).
Mads Kiilerich - Oct. 15, 2014, 11:40 a.m.
On 10/15/2014 02:45 AM, Pierre-Yves David wrote:
>
>
> On 10/14/2014 05:33 PM, Mads Kiilerich wrote:
>> # HG changeset patch
>> # User Mads Kiilerich <madski@unity3d.com>
>> # Date 1413333190 -7200
>> #      Wed Oct 15 02:33:10 2014 +0200
>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>> localrepo: introduce persistent caching of revset revision's branch 
>> names
>>
>> It is expensive to create a changectx and extract the branch name. 
>> That shows
>> up when filtering on branches in revsets.
>>
>> To speed things up, cache the results on disk. To avoid using too 
>> much space,
>> all branch names are only stored once and each revision references 
>> one of these
>> names. To verify that the cache is valid, we also store the tip hash 
>> in the
>> cache file.
>
> I cannot see what is your invalidation strategy here.

??? It is right there. Obviously, if the cache not is valid it is 
invalid. But I must assume you are looking for something else. What?

Or to rephrase it: When writing the cache file, it writes the tip 
revision number (+1) and the hash of tip. When reading the file, if that 
revision still has the same hash (even if it no longer is tip), the 
cache is considered valid as far as it goes.

It would probably be better to also explicitly invalidate the cache when 
stripping. I haven't stumbled upon the way to do that yet (haven't tried 
that hard). Any hints?

/Mads
Pierre-Yves David - Oct. 15, 2014, 11:47 a.m.
On 10/15/2014 04:40 AM, Mads Kiilerich wrote:
> On 10/15/2014 02:45 AM, Pierre-Yves David wrote:
>>
>>
>> On 10/14/2014 05:33 PM, Mads Kiilerich wrote:
>>> # HG changeset patch
>>> # User Mads Kiilerich <madski@unity3d.com>
>>> # Date 1413333190 -7200
>>> #      Wed Oct 15 02:33:10 2014 +0200
>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>> localrepo: introduce persistent caching of revset revision's branch
>>> names
>>>
>>> It is expensive to create a changectx and extract the branch name.
>>> That shows
>>> up when filtering on branches in revsets.
>>>
>>> To speed things up, cache the results on disk. To avoid using too
>>> much space,
>>> all branch names are only stored once and each revision references
>>> one of these
>>> names. To verify that the cache is valid, we also store the tip hash
>>> in the
>>> cache file.
>>
>> I cannot see what is your invalidation strategy here.
>
> ??? It is right there. Obviously, if the cache not is valid it is
> invalid. But I must assume you are looking for something else. What?
>
> Or to rephrase it: When writing the cache file, it writes the tip
> revision number (+1) and the hash of tip. When reading the file, if that
> revision still has the same hash (even if it no longer is tip), the
> cache is considered valid as far as it goes.

Hum, standard blindness here, my bad… This cache strategy is vulnerable 
to operation (or chain of operation) that change repository content 
without altering the hash number of commit or its hash. We should 
probably use something stronger.

I know that this is used for branchmap as well. So yes branchmap is 
vulnerable too. But branchmap is less vulnerable because it does not 
depends of the order of the revision in the repo (revnum) while you do 
here. We should build something stronger here.
Mads Kiilerich - Oct. 15, 2014, 11:55 a.m.
On 10/15/2014 06:53 AM, Gregory Szorc wrote:
> Append only is a nice ideal and just that: an ideal. Things like 
> transaction rollbacks are effectively strips. And transaction 
> rollbacks can happen when e.g. a server-side hook rejects a push. And 
> if that hook (or a hook that ran before) accesses branch data and 
> causes a cache update that would trigger invalidation on rollback, the 
> next unsuspecting user triggers a fresh cache rebuild and experiences 
> extreme latency (if the repo is moderately sized).

(It sounds like _that_ issue could be mitigated by disabling saving of 
caches while inside transactions and leave it to the next user to update 
the cache. A real solution would however be nice.)

> We see this at Mozilla with rejected pushes causing branchcache 
> rebuilds. On our Try repo, this leads to CPU exhaustion from multiple 
> clients all triggering cache generation simultaneously (because there 
> is no lock on the cache). This is why we spent time at the Summit 
> coming up with a better way to add non-revlog files into transactions.
>
> Whatever you do, please avoid full cache (re)populations wherever 
> possible.

Sure, perfect would be good ... but also its enemy.

If you have a plan for a general solution, shouldn't we use that instead 
of coming up with something special here? Until then, wouldn't it be 
better to give the lucky users the speed-up instead of giving everybody 
the worst case?

/Mads
Mads Kiilerich - Oct. 15, 2014, 12:36 p.m.
On 10/15/2014 01:47 PM, Pierre-Yves David wrote:
>
>
> On 10/15/2014 04:40 AM, Mads Kiilerich wrote:
>> On 10/15/2014 02:45 AM, Pierre-Yves David wrote:
>>>
>>>
>>> On 10/14/2014 05:33 PM, Mads Kiilerich wrote:
>>>> # HG changeset patch
>>>> # User Mads Kiilerich <madski@unity3d.com>
>>>> # Date 1413333190 -7200
>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>> localrepo: introduce persistent caching of revset revision's branch
>>>> names
>>>>
>>>> It is expensive to create a changectx and extract the branch name.
>>>> That shows
>>>> up when filtering on branches in revsets.
>>>>
>>>> To speed things up, cache the results on disk. To avoid using too
>>>> much space,
>>>> all branch names are only stored once and each revision references
>>>> one of these
>>>> names. To verify that the cache is valid, we also store the tip hash
>>>> in the
>>>> cache file.
>>>
>>> I cannot see what is your invalidation strategy here.
>>
>> ??? It is right there. Obviously, if the cache not is valid it is
>> invalid. But I must assume you are looking for something else. What?
>>
>> Or to rephrase it: When writing the cache file, it writes the tip
>> revision number (+1) and the hash of tip. When reading the file, if that
>> revision still has the same hash (even if it no longer is tip), the
>> cache is considered valid as far as it goes.
>
> Hum, standard blindness here, my bad… This cache strategy is 
> vulnerable to operation (or chain of operation) that change repository 
> content without altering the hash number of commit or its hash. We 
> should probably use something stronger.

Yes.

Something like a linearly chained hash of the hashes comes to mind. It 
would however be too slow to validate the whole chain when accessing the 
cache, so it would be vulnerable to the same problems unless it has good 
low level support for invalidation and recomputation. And then, why not 
just invalidate/truncate caches at that point.

Something vector clock or bloom filter could perhaps also be used. A 
simpler solution would probably be an ever-increasing counter that are 
incremented on commits/changegroup additions _and_ on strip (and amend).

> I know that this is used for branchmap as well. So yes branchmap is 
> vulnerable too. But branchmap is less vulnerable because it does not 
> depends of the order of the revision in the repo (revnum) while you do 
> here. We should build something stronger here.

The current branchmap validation scheme can just as easily be tricked if 
the two tipmost revs are unrelated, both are stripped, a new evil commit 
is made on some branch, and then the old tip is pulled back in.

A stronger solution would be nice. When we get it we can apply the same 
solution everywhere. Until we get the "cannot break" solution, I think 
it is ok to extend the use of the existing imperfect but "doesn't break" 
solution.

/Mads
Mads Kiilerich - Oct. 15, 2014, 12:47 p.m.
On 10/15/2014 04:58 AM, Mads Kiilerich wrote:
> The cache file _could_ also store the hash of each revision so it 
> could reuse as much as possible after a strip. That would make it at 
> least 5 times bigger and less efficient. Would you prefer that?

Some further thoughts on this:

Storing the hashes would not be a big deal. The main problem would be 
that given a rev number, we would have to fetch the hash and compare it 
every time we access the cache. It would nice to avoid that ... but 
probably still faster than not caching ...

/Mads
Gregory Szorc - Oct. 15, 2014, 6:02 p.m.
On 10/15/14 4:55 AM, Mads Kiilerich wrote:
> On 10/15/2014 06:53 AM, Gregory Szorc wrote:
>> Append only is a nice ideal and just that: an ideal. Things like
>> transaction rollbacks are effectively strips. And transaction
>> rollbacks can happen when e.g. a server-side hook rejects a push. And
>> if that hook (or a hook that ran before) accesses branch data and
>> causes a cache update that would trigger invalidation on rollback, the
>> next unsuspecting user triggers a fresh cache rebuild and experiences
>> extreme latency (if the repo is moderately sized).
>
> (It sounds like _that_ issue could be mitigated by disabling saving of
> caches while inside transactions and leave it to the next user to update
> the cache. A real solution would however be nice.)

How do you propose we do that? Innocuous read-only-looking code like 
repo.branches() and repo.tags() results in writing cache files. We've 
considered doing crazy things. We could even submit patches to Mercurial 
core. But I'm pretty sure marmoute would reject everything, telling us 
to implement the next generation of transactions instead. He would be right.

>> We see this at Mozilla with rejected pushes causing branchcache
>> rebuilds. On our Try repo, this leads to CPU exhaustion from multiple
>> clients all triggering cache generation simultaneously (because there
>> is no lock on the cache). This is why we spent time at the Summit
>> coming up with a better way to add non-revlog files into transactions.
>>
>> Whatever you do, please avoid full cache (re)populations wherever
>> possible.
>
> Sure, perfect would be good ... but also its enemy.

I'm not asking for perfect: I'm asking for a solution that will scale to 
large repositories. Mercurial scales for us today on all but 
repositories with thousands of heads. The eager cache invalidation 
demonstrated here patch smells like something that will reduce the 
scaling ceiling. (Benchmark the impact of a rollback to mozilla-central 
and mozilla-release to prove me wrong and I'll shut up.)

> If you have a plan for a general solution, shouldn't we use that instead
> of coming up with something special here? Until then, wouldn't it be
> better to give the lucky users the speed-up instead of giving everybody
> the worst case?

The general solution is fixing transactions. See the "Transactions" 
write-up on https://titanpad.com/mercurial32-sprint.
Pierre-Yves David - Oct. 15, 2014, 6:37 p.m.
On 10/15/2014 05:36 AM, Mads Kiilerich wrote:
> On 10/15/2014 01:47 PM, Pierre-Yves David wrote:
>>
>>
>> On 10/15/2014 04:40 AM, Mads Kiilerich wrote:
>>> On 10/15/2014 02:45 AM, Pierre-Yves David wrote:
>>>>
>>>>
>>>> On 10/14/2014 05:33 PM, Mads Kiilerich wrote:
>>>>> # HG changeset patch
>>>>> # User Mads Kiilerich <madski@unity3d.com>
>>>>> # Date 1413333190 -7200
>>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>>> localrepo: introduce persistent caching of revset revision's branch
>>>>> names
>>>>>
>>>>> It is expensive to create a changectx and extract the branch name.
>>>>> That shows
>>>>> up when filtering on branches in revsets.
>>>>>
>>>>> To speed things up, cache the results on disk. To avoid using too
>>>>> much space,
>>>>> all branch names are only stored once and each revision references
>>>>> one of these
>>>>> names. To verify that the cache is valid, we also store the tip hash
>>>>> in the
>>>>> cache file.
>>>>
>>>> I cannot see what is your invalidation strategy here.
>>>
>>> ??? It is right there. Obviously, if the cache not is valid it is
>>> invalid. But I must assume you are looking for something else. What?
>>>
>>> Or to rephrase it: When writing the cache file, it writes the tip
>>> revision number (+1) and the hash of tip. When reading the file, if that
>>> revision still has the same hash (even if it no longer is tip), the
>>> cache is considered valid as far as it goes.
>>
>> Hum, standard blindness here, my bad… This cache strategy is
>> vulnerable to operation (or chain of operation) that change repository
>> content without altering the hash number of commit or its hash. We
>> should probably use something stronger.
>
> Yes.
>
> Something like a linearly chained hash of the hashes comes to mind. It
> would however be too slow to validate the whole chain when accessing the
> cache, so it would be vulnerable to the same problems unless it has good
> low level support for invalidation and recomputation. And then, why not
> just invalidate/truncate caches at that point.

I'm thinking about this for some time and I think that storing such 
chained chain of hash in the repo itself could do the trick. (probably 
for each main happen only file in the repo, changelog and obsstore). If 
we also had such data stored we could quickly check if the "combined 
hash" for revnum X match the "combined hash" stored in the hash.

We could extend this logic by storing revnum + "combined hash
  at various checkpoint in your branch cache file to allow quick partial 
truncating.

> Something vector clock or bloom filter could perhaps also be used. A
> simpler solution would probably be an ever-increasing counter that are
> incremented on commits/changegroup additions _and_ on strip (and amend).

Not super fan of the ever increasing counter because you haveto rely on 
all clients bumping it and therefor requires a new .hg/requires entry.

Another approach would be to use a hash of all heads somewhere. But it 
does allow forward update of the cache.

I think the "combined" hash approach is worth poking at.

>> I know that this is used for branchmap as well. So yes branchmap is
>> vulnerable too. But branchmap is less vulnerable because it does not
>> depends of the order of the revision in the repo (revnum) while you do
>> here. We should build something stronger here.
>
> The current branchmap validation scheme can just as easily be tricked if
> the two tipmost revs are unrelated, both are stripped, a new evil commit
> is made on some branch, and then the old tip is pulled back in.

Yes, the current branchmap cache is too fragile too (but a still a bit 
less fragile because it resist to reordering the exact same head)

> A stronger solution would be nice. When we get it we can apply the same
> solution everywhere. Until we get the "cannot break" solution, I think
> it is ok to extend the use of the existing imperfect but "doesn't break"
> solution.

I think the current solution is significantly weaker for you type of 
cash. I do not think it is wise to use it.
Pierre-Yves David - Oct. 15, 2014, 6:37 p.m.
On 10/15/2014 04:55 AM, Mads Kiilerich wrote:
> On 10/15/2014 06:53 AM, Gregory Szorc wrote:
>> Append only is a nice ideal and just that: an ideal. Things like
>> transaction rollbacks are effectively strips. And transaction
>> rollbacks can happen when e.g. a server-side hook rejects a push. And
>> if that hook (or a hook that ran before) accesses branch data and
>> causes a cache update that would trigger invalidation on rollback, the
>> next unsuspecting user triggers a fresh cache rebuild and experiences
>> extreme latency (if the repo is moderately sized).
>
> (It sounds like _that_ issue could be mitigated by disabling saving of
> caches while inside transactions and leave it to the next user to update
> the cache. A real solution would however be nice.)

But if you do that, you got a new issue that read only puller can never 
update the cache and end up recomputing it every time (old issue who 
triggered the current strategy of recomputing the cache)

Patch

diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -6,7 +6,7 @@ 
 # GNU General Public License version 2 or any later version.
 from node import hex, nullid, short
 from i18n import _
-import urllib
+import urllib, struct, array
 import peer, changegroup, subrepo, pushkey, obsolete, repoview
 import changelog, dirstate, filelog, manifest, context, bookmarks, phases
 import lock as lockmod
@@ -21,6 +21,13 @@  import branchmap, pathutil
 propertycache = util.propertycache
 filecache = scmutil.filecache
 
+branchcachefn = 'cache/branchnames'
+branchcachever = 2345164374 # TODO: reset when upstreaming
+headerfmt = '>LL20sL'
+headersize = struct.calcsize(headerfmt)
+reffmt = '>H'
+refsize = struct.calcsize(reffmt)
+
 class repofilecache(filecache):
     """All filecache usage on repo are done for logic that should be unfiltered
     """
@@ -179,6 +186,7 @@  class localrepository(object):
     openerreqs = set(('revlogv1', 'generaldelta'))
     requirements = ['revlogv1']
     filtername = None
+    _branchcachedirty = None
 
     # a list of (ui, featureset) functions.
     # only functions defined in module of enabled extensions are invoked
@@ -298,7 +306,7 @@  class localrepository(object):
         self.filteredrevcache = {}
 
     def close(self):
-        pass
+        self._branchcachesave()
 
     def _restrictcapabilities(self, caps):
         # bundle2 is not ready for prime time, drop it unless explicitly
@@ -723,6 +731,80 @@  class localrepository(object):
         repo = (remote and remote.local()) and remote or self
         return repo[key].branch()
 
+    def branch(self, rev):
+        """return name of rev's branch as self[rev].branch() but cache it.
+        First invocation of this method will load cached values and replace
+        this method with a more efficient getter."""
+        try:
+            data = self.vfs.open(branchcachefn).read()
+        except IOError:
+            data = ''
+
+        self._branches = []
+        self._branchrefs = array.array('c') # bytes from struct type reffmt
+        self.__dict__['_branchcachedirty'] = True
+        if len(data) >= headersize:
+            v, revs, headnode, bnsize = struct.unpack_from(headerfmt, data)
+            if (v == branchcachever and
+                (revs - 1) in self and
+                self[revs - 1].node() == headnode and
+                len(data) == headersize + bnsize + revs * refsize
+                ):
+                if bnsize:
+                    self._branches = \
+                        data[headersize : headersize + bnsize].split('\0')
+                self._branchrefs.fromstring(
+                    buffer(data, headersize + bnsize, revs * refsize))
+                self.__dict__['_branchcachedirty'] = False
+            else:
+                self.ui.debug('branch cache was outdated\n')
+        else:
+            self.ui.debug('branch cache was invalid\n')
+
+        if len(self._branchrefs) < len(self) * refsize:
+            self._branchrefs.extend(
+                '\xff' * (len(self) * refsize - len(self._branchrefs)))
+
+        self._branchnamesindex = dict((b, r)
+                                      for r, b in enumerate(self._branches))
+
+        self.__dict__['branch'] = self._branch
+        return self.branch(rev)
+
+    def _branch(self, rev):
+        """return name of rev's branch as self[rev].branch() using cache."""
+        branchref = struct.unpack_from(reffmt, self._branchrefs,
+                                       rev * refsize)[0]
+        if branchref < len(self._branches):
+            return self._branches[branchref]
+        b = self[rev].branch()
+        if b in self._branchnamesindex:
+            branchref = self._branchnamesindex[b]
+        else:
+            branchref = len(self._branches)
+            self._branches.append(b)
+            self._branchnamesindex[b] = branchref
+        struct.pack_into(reffmt, self._branchrefs, rev * refsize, branchref)
+        self.__dict__['_branchcachedirty'] = True
+        return b
+
+    def _branchcachesave(self):
+        """save branch cache if it is dirty"""
+        if self._branchcachedirty:
+            self.ui. debug('writing branch cache\n')
+            try:
+                f = self.vfs.open(branchcachefn, 'w', atomictemp=True)
+                s = '\0'.join(self._branches)
+                revs = len(self._branchrefs) / refsize
+                f.write(struct.pack(headerfmt, branchcachever, revs,
+                        self[revs - 1].node(), len(s)))
+                f.write(s)
+                f.write(self._branchrefs)
+                f.close()
+            except IOError:
+                pass
+            self.__dict__['_branchcachedirty'] = False
+
     def known(self, nodes):
         nm = self.changelog.nodemap
         pc = self._phasecache
@@ -1003,10 +1085,11 @@  class localrepository(object):
         return 0
 
     def invalidatecaches(self):
-
+        # can't use delattr on proxy
         if '_tagscache' in vars(self):
-            # can't use delattr on proxy
             del self.__dict__['_tagscache']
+        if 'branch' in vars(self):
+            del self.__dict__['branch']
 
         self.unfiltered()._branchcaches.clear()
         self.invalidatevolatilesets()
diff --git a/mercurial/revset.py b/mercurial/revset.py
--- a/mercurial/revset.py
+++ b/mercurial/revset.py
@@ -489,16 +489,16 @@  def branch(repo, subset, x):
             # note: falls through to the revspec case if no branch with
             # this name exists
             if pattern in repo.branchmap():
-                return subset.filter(lambda r: matcher(repo[r].branch()))
+                return subset.filter(lambda r: matcher(repo.branch(r)))
         else:
-            return subset.filter(lambda r: matcher(repo[r].branch()))
+            return subset.filter(lambda r: matcher(repo.branch(r)))
 
     s = getset(repo, spanset(repo), x)
     b = set()
     for r in s:
-        b.add(repo[r].branch())
+        b.add(repo.branch(r))
     c = s.__contains__
-    return subset.filter(lambda r: c(r) or repo[r].branch() in b)
+    return subset.filter(lambda r: c(r) or repo.branch(r) in b)
 
 def bumped(repo, subset, x):
     """``bumped()``
@@ -1431,7 +1431,7 @@  def matching(repo, subset, x):
     getfieldfuncs = []
     _funcs = {
         'user': lambda r: repo[r].user(),
-        'branch': lambda r: repo[r].branch(),
+        'branch': lambda r: repo.branch(r),
         'date': lambda r: repo[r].date(),
         'description': lambda r: repo[r].description(),
         'files': lambda r: repo[r].files(),
@@ -1532,9 +1532,9 @@  def sort(repo, subset, x):
             elif k == '-rev':
                 e.append(-r)
             elif k == 'branch':
-                e.append(c.branch())
+                e.append(repo.branch(r))
             elif k == '-branch':
-                e.append(invert(c.branch()))
+                e.append(invert(repo.branch(r)))
             elif k == 'desc':
                 e.append(c.description())
             elif k == '-desc':
diff --git a/tests/test-revset.t b/tests/test-revset.t
--- a/tests/test-revset.t
+++ b/tests/test-revset.t
@@ -744,6 +744,39 @@  issue2437
   $ log 'ancestors(8) and (heads(branch("-a-b-c-")) or heads(branch(é)))'
   4
 
+branchname caching
+
+  $ rm .hg/cache/branchnames
+  $ log 'branch("-a-b-c-")'
+  4
+  $ "$TESTDIR/md5sum.py" .hg/cache/branchnames
+  f9bf54761390e8736e245e7de0b5d9e6  .hg/cache/branchnames
+  $ echo > .hg/cache/branchnames
+  $ log 'branch("-a-b-c-")'
+  4
+  $ "$TESTDIR/md5sum.py" .hg/cache/branchnames
+  f9bf54761390e8736e245e7de0b5d9e6  .hg/cache/branchnames
+  $ echo >> .hg/cache/branchnames
+  $ log 'branch("-a-b-c-")'
+  4
+  $ "$TESTDIR/md5sum.py" .hg/cache/branchnames
+  f9bf54761390e8736e245e7de0b5d9e6  .hg/cache/branchnames
+  $ hg tag tag
+  $ "$TESTDIR/md5sum.py" .hg/cache/branchnames
+  f9bf54761390e8736e245e7de0b5d9e6  .hg/cache/branchnames
+  $ log 'branch("-a-b-c-")'
+  4
+  $ "$TESTDIR/md5sum.py" .hg/cache/branchnames
+  9025ff4c8e9d36f49921ae9b2752e6b9  .hg/cache/branchnames
+  $ hg up -qr '.^'
+  $ hg rollback -qf
+  $ "$TESTDIR/md5sum.py" .hg/cache/branchnames
+  9025ff4c8e9d36f49921ae9b2752e6b9  .hg/cache/branchnames
+  $ log 'branch("-a-b-c-")'
+  4
+  $ "$TESTDIR/md5sum.py" .hg/cache/branchnames
+  f9bf54761390e8736e245e7de0b5d9e6  .hg/cache/branchnames
+
 issue2654: report a parse error if the revset was not completely parsed
 
   $ log '1 OR 2'