Patchwork [1,of,3,v3] branchcache: introduce revbranchcache for caching of revision branch names

login
register
mail settings
Submitter Mads Kiilerich
Date Jan. 8, 2015, 10:33 p.m.
Message ID <7334d72b49781e66cf9f.1420756412@ssl.google-analytics.com>
Download mbox | patch
Permalink /patch/7395/
State Superseded
Delegated to: Pierre-Yves David
Headers show

Comments

Mads Kiilerich - Jan. 8, 2015, 10:33 p.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1420671663 -3600
#      Thu Jan 08 00:01:03 2015 +0100
# Node ID 7334d72b49781e66cf9ff5abb3f0497d7c812c28
# Parent  7ad155e13f0f51df8e986a0ec4e58ac9a0ccedbb
branchcache: introduce revbranchcache for caching of revision branch names

It is expensive to retrieve the branch name of a revision. Very expensive when
creating a changectx and calling .branch() every time - slightly less when
using changelog.branchinfo().

Now, to speed things up, provide a way to cache the results on disk in an
efficient format. Each branchname is assigned a number, and for each revision
we store the number of the corresponding branch name. The branch names are
stored in a dedicated file which is strictly append only.

Branch names are usually reused across several revisions, and the total list of
branch names will thus be so small that it is feasible to read the whole set of
names before using the cache. It will however do that it might be more
efficient to use the changelog for retrieving the branch info for a single
revision.

The revision entries are stored in another file. This file is usually append
only, but if the repository has been modified, the file will be truncated and
the relevant parts rewritten on demand.

The entries for each revision are 8 bytes each, and the whole revision file
will thus be 1/8 of 00changelog.i.

Each revision entry contains the first 4 bytes of the corresponding node hash.
This is used as a check sum that always is verified before the entry is used.
That check is relatively expensive but it makes sure history modification is
detected and handled correctly. It will also detect and handle most revision
file corruptions.

This is just a cache. A new format can always be introduced if other
requirements or ideas make that seem like a good idea. Rebuilding the cache is
not really more expensive than it was to run for example 'hg log -b branchname'
before this cache was introduced.

This new method is still unused but promise to make some operations several
times faster once it actually is used.

Abandoning Python 2.4 would make it possible to implement this more efficiently
by using struct classes and pack_into. The Python code could probably also be
micro optimized or it could be implemented very efficiently in C where it would
be easy to control the data access.
Pierre-Yves David - Jan. 9, 2015, 8:15 a.m.
On 01/08/2015 02:33 PM, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1420671663 -3600
> #      Thu Jan 08 00:01:03 2015 +0100
> # Node ID 7334d72b49781e66cf9ff5abb3f0497d7c812c28
> # Parent  7ad155e13f0f51df8e986a0ec4e58ac9a0ccedbb
> branchcache: introduce revbranchcache for caching of revision branch names
>
> It is expensive to retrieve the branch name of a revision. Very expensive when
> creating a changectx and calling .branch() every time - slightly less when
> using changelog.branchinfo().
>
> Now, to speed things up, provide a way to cache the results on disk in an
> efficient format. Each branchname is assigned a number, and for each revision
> we store the number of the corresponding branch name. The branch names are
> stored in a dedicated file which is strictly append only.
>
> Branch names are usually reused across several revisions, and the total list of
> branch names will thus be so small that it is feasible to read the whole set of
> names before using the cache. It will however do that it might be more
> efficient to use the changelog for retrieving the branch info for a single
> revision.
>
> The revision entries are stored in another file. This file is usually append
> only, but if the repository has been modified, the file will be truncated and
> the relevant parts rewritten on demand.
>
> The entries for each revision are 8 bytes each, and the whole revision file
> will thus be 1/8 of 00changelog.i.
>
> Each revision entry contains the first 4 bytes of the corresponding node hash.
> This is used as a check sum that always is verified before the entry is used.
> That check is relatively expensive but it makes sure history modification is
> detected and handled correctly. It will also detect and handle most revision
> file corruptions.
>
> This is just a cache. A new format can always be introduced if other
> requirements or ideas make that seem like a good idea. Rebuilding the cache is
> not really more expensive than it was to run for example 'hg log -b branchname'
> before this cache was introduced.
>
> This new method is still unused but promise to make some operations several
> times faster once it actually is used.
>
> Abandoning Python 2.4 would make it possible to implement this more efficiently
> by using struct classes and pack_into. The Python code could probably also be
> micro optimized or it could be implemented very efficiently in C where it would
> be easy to control the data access.
>
> diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
> --- a/mercurial/branchmap.py
> +++ b/mercurial/branchmap.py
> @@ -9,6 +9,8 @@ from node import bin, hex, nullid, nullr
>   import encoding
>   import util
>   import time
> +from array import array
> +from struct import calcsize, pack, unpack
>
>   def _filename(repo):
>       """name of a branchcache file for a given repo or repoview"""
> @@ -285,3 +287,144 @@ class branchcache(dict):
>           duration = time.time() - starttime
>           repo.ui.log('branchcache', 'updated %s branch cache in %.4f seconds\n',
>                       repo.filtername, duration)
> +
> +# Revision branch info cache
> +
> +_rbcversion = '-v1'
> +_rbcnames = 'cache/rbc-names' + _rbcversion
> +_rbcrevs = 'cache/rbc-revs' + _rbcversion
> +# [4 byte hash prefix][4 byte branch name number with sign bit indicating open]
> +_rbcrecfmt = '>4sI'
> +_rbcrecsize = calcsize(_rbcrecfmt)
> +_rbcnodelen = 4
> +_rbcbranchidxmask = 0x7fffffff
> +_rbccloseflag = 0x80000000
> +
> +class revbranchcache(object):
> +    """Persistent cache, mapping from revision number to branch name and close.
> +    This is a low level cache, independent of filtering.
> +
> +    Branch names are stored in rbc-names in internal encoding separated by 0.
> +    rbc-names is append-only, and each branch name is only stored once and will
> +    thus have a unique index.
> +
> +    The branch info for each revision is stored in rbc-revs as constant size
> +    records. The whole file is read into memory, but it is only 'parsed' on
> +    demand. The file is usually append-only but will be truncated if repo
> +    modification is detected.
> +    The record for each revision contains the first 4 bytes of the
> +    corresponding node hash, and the record is only used if it still matches.
> +    Even a completely trashed rbc-revs fill thus still give the right result
> +    while converging towards full recovery ... assuming no incorrectly matching
> +    node hashes.
> +    The record also contains 4 bytes where 31 bits contains the index of the
> +    branch and the last bit indicate that it is a branch close commit.
> +    The usage pattern for rbc-revs is thus somewhat similar to 00changelog.i
> +    and will grow with it but be 1/8th of its size.
> +    """
> +
> +    def __init__(self, repo):
> +        repo = repo.unfiltered()

This is suspicious. I would assume we should have a single 
'revbranchcache' object per repository. So I would focus on ensure we 
always access it through the unfiltered object (short version: assign it 
to the unfiltered object in the first place). This mean updating the 
next patch.

Turning this line into an assertion seems like a good idea:
(assert repo.filtername is None)


Also, we tend to avoid creating cycle with core attribute of the 
repository. Can we not store the repository object in this one and pass 
it to the few relevant function?

> +        self._ui = repo.ui
> +        self._changelog = repo.changelog

This is going to lead to troubles. 'repo.changelog' is a store cache. 
This means that external even may invalidate the attribute and get it 
recreated. So the 'changelog' object must always be accessed through the 
repository object.

> +        self._vfs = repo.vfs
> +        self._names = [] # branch names in local encoding with static index
> +        self._rbcrevs = array('c') # structs of type _rbcrecfmt
> +        try:
> +            bndata = self._vfs.read(_rbcnames)
> +            self._names = [encoding.tolocal(bn) for bn in bndata.split('\0')]
> +        except (IOError, OSError, util.Abort), inst:
> +            self._ui.debug("couldn't read revision branch cache names: %s\n" %
> +                           inst)

What is the util.Abort here for ?

> +        if self._names:
> +            try:
> +                data = self._vfs.read(_rbcrevs)
> +                self._rbcrevs.fromstring(data)
> +            except (IOError, OSError), inst:
> +                self._ui.debug("couldn't read revision branch cache: %s\n" %
> +                               inst)
> +        # remember number of good records on disk
> +        self._rbcrevslen = min(len(self._rbcrevs) // _rbcrecsize,
> +                               len(self._changelog))
> +        if self._rbcrevslen == 0:
> +            self._names = []
> +        self._rbcnameslen = len(self._names) # number of good names on disk
> +        self._namesreverse = dict((b, r) for r, b in enumerate(self._names))
> +
> +    def branchinfo(self, rev):
> +        """Return branch name and close flag for rev, using and updating
> +        persistent cache."""
> +        rbcrevidx = rev * _rbcrecsize



> +
> +        # if requested rev is missing, add and populate all missing revs
> +        if len(self._rbcrevs) < rbcrevidx + _rbcrecsize:
> +            first = len(self._rbcrevs) // _rbcrecsize
> +            self._rbcrevs.extend('\0' * (len(self._changelog) * _rbcrecsize -
> +                                         len(self._rbcrevs)))
> +            for r in xrange(first, len(self._changelog)):
> +                self._branchinfo(r)

Why do we automatically populate all missing nodes? This is going to be 
a very expensive first population even for very simple operation that 
access branch data for a single changeset. Can we not populate on demand?

> +        # fast path: extract data from cache, use it if node is matching
> +        reponode = self._changelog.node(rev)[:_rbcnodelen]
> +        cachenode, branchidx = unpack(
> +            _rbcrecfmt, buffer(self._rbcrevs, rbcrevidx, _rbcrecsize))
> +        close = bool(branchidx & _rbccloseflag)
> +        if close:
> +            branchidx &= _rbcbranchidxmask
> +        if cachenode == reponode:
> +            return self._names[branchidx], close
> +        # fall back to slow path and make sure it will be written to disk
> +        self._rbcrevslen = min(self._rbcrevslen, rev)
> +        return self._branchinfo(rev)
> +
> +    def _branchinfo(self, rev):
> +        """Retrieve branch info from changelog and update _rbcrevs"""
> +        b, close = self._changelog.branchinfo(rev)
> +        if b in self._namesreverse:
> +            branchidx = self._namesreverse[b]
> +        else:
> +            branchidx = len(self._names)
> +            self._names.append(b)
> +            self._namesreverse[b] = branchidx

Small nits:

This could be:

   branchidx = self._namesreverse.setdefault(b, len(self._names))
   if branchidx >= len(self._names):
       self._name.append(b)
Mads Kiilerich - Jan. 9, 2015, 4:03 p.m.
On 01/09/2015 09:15 AM, Pierre-Yves David wrote:
> On 01/08/2015 02:33 PM, Mads Kiilerich wrote:
>> # HG changeset patch
>> # User Mads Kiilerich <madski@unity3d.com>
>> # Date 1420671663 -3600
>> #      Thu Jan 08 00:01:03 2015 +0100
>> # Node ID 7334d72b49781e66cf9ff5abb3f0497d7c812c28
>> # Parent  7ad155e13f0f51df8e986a0ec4e58ac9a0ccedbb
>> branchcache: introduce revbranchcache for caching of revision branch 
>> names
>>
>> It is expensive to retrieve the branch name of a revision. Very 
>> expensive when
>> creating a changectx and calling .branch() every time - slightly less 
>> when
>> using changelog.branchinfo().
>>
>> Now, to speed things up, provide a way to cache the results on disk 
>> in an
>> efficient format. Each branchname is assigned a number, and for each 
>> revision
>> we store the number of the corresponding branch name. The branch 
>> names are
>> stored in a dedicated file which is strictly append only.
>>
>> Branch names are usually reused across several revisions, and the 
>> total list of
>> branch names will thus be so small that it is feasible to read the 
>> whole set of
>> names before using the cache. It will however do that it might be more
>> efficient to use the changelog for retrieving the branch info for a 
>> single
>> revision.
>>
>> The revision entries are stored in another file. This file is usually 
>> append
>> only, but if the repository has been modified, the file will be 
>> truncated and
>> the relevant parts rewritten on demand.
>>
>> The entries for each revision are 8 bytes each, and the whole 
>> revision file
>> will thus be 1/8 of 00changelog.i.
>>
>> Each revision entry contains the first 4 bytes of the corresponding 
>> node hash.
>> This is used as a check sum that always is verified before the entry 
>> is used.
>> That check is relatively expensive but it makes sure history 
>> modification is
>> detected and handled correctly. It will also detect and handle most 
>> revision
>> file corruptions.
>>
>> This is just a cache. A new format can always be introduced if other
>> requirements or ideas make that seem like a good idea. Rebuilding the 
>> cache is
>> not really more expensive than it was to run for example 'hg log -b 
>> branchname'
>> before this cache was introduced.
>>
>> This new method is still unused but promise to make some operations 
>> several
>> times faster once it actually is used.
>>
>> Abandoning Python 2.4 would make it possible to implement this more 
>> efficiently
>> by using struct classes and pack_into. The Python code could probably 
>> also be
>> micro optimized or it could be implemented very efficiently in C 
>> where it would
>> be easy to control the data access.
>>
>> diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
>> --- a/mercurial/branchmap.py
>> +++ b/mercurial/branchmap.py
>> @@ -9,6 +9,8 @@ from node import bin, hex, nullid, nullr
>>   import encoding
>>   import util
>>   import time
>> +from array import array
>> +from struct import calcsize, pack, unpack
>>
>>   def _filename(repo):
>>       """name of a branchcache file for a given repo or repoview"""
>> @@ -285,3 +287,144 @@ class branchcache(dict):
>>           duration = time.time() - starttime
>>           repo.ui.log('branchcache', 'updated %s branch cache in %.4f 
>> seconds\n',
>>                       repo.filtername, duration)
>> +
>> +# Revision branch info cache
>> +
>> +_rbcversion = '-v1'
>> +_rbcnames = 'cache/rbc-names' + _rbcversion
>> +_rbcrevs = 'cache/rbc-revs' + _rbcversion
>> +# [4 byte hash prefix][4 byte branch name number with sign bit 
>> indicating open]
>> +_rbcrecfmt = '>4sI'
>> +_rbcrecsize = calcsize(_rbcrecfmt)
>> +_rbcnodelen = 4
>> +_rbcbranchidxmask = 0x7fffffff
>> +_rbccloseflag = 0x80000000
>> +
>> +class revbranchcache(object):
>> +    """Persistent cache, mapping from revision number to branch name 
>> and close.
>> +    This is a low level cache, independent of filtering.
>> +
>> +    Branch names are stored in rbc-names in internal encoding 
>> separated by 0.
>> +    rbc-names is append-only, and each branch name is only stored 
>> once and will
>> +    thus have a unique index.
>> +
>> +    The branch info for each revision is stored in rbc-revs as 
>> constant size
>> +    records. The whole file is read into memory, but it is only 
>> 'parsed' on
>> +    demand. The file is usually append-only but will be truncated if 
>> repo
>> +    modification is detected.
>> +    The record for each revision contains the first 4 bytes of the
>> +    corresponding node hash, and the record is only used if it still 
>> matches.
>> +    Even a completely trashed rbc-revs fill thus still give the 
>> right result
>> +    while converging towards full recovery ... assuming no 
>> incorrectly matching
>> +    node hashes.
>> +    The record also contains 4 bytes where 31 bits contains the 
>> index of the
>> +    branch and the last bit indicate that it is a branch close commit.
>> +    The usage pattern for rbc-revs is thus somewhat similar to 
>> 00changelog.i
>> +    and will grow with it but be 1/8th of its size.
>> +    """
>> +
>> +    def __init__(self, repo):
>> +        repo = repo.unfiltered()
>
> This is suspicious. I would assume we should have a single 
> 'revbranchcache' object per repository. 

It will only be used for the few operations that do bulk operations and 
are slow without it. So far that is branchmap calculation and revset 
branch filtering. There will usually only be at most one of them once 
per repo object lifetime. Storing and reusing is thus barely relevant.

A later patch could perhaps store and reuse it more aggressively.

> So I would focus on ensure we always access it through the unfiltered 
> object (short version: assign it to the unfiltered object in the first 
> place). This mean updating the next patch.
>
> Turning this line into an assertion seems like a good idea:
> (assert repo.filtername is None)
>
> Also, we tend to avoid creating cycle with core attribute of the 
> repository. Can we not store the repository object in this one and 
> pass it to the few relevant function?

I don't understand the comment. The repo object is _not_ stored anywhere 
in this patch and I don't see any cycles.

>
>> +        self._ui = repo.ui
>> +        self._changelog = repo.changelog
>
> This is going to lead to troubles. 'repo.changelog' is a store cache. 
> This means that external even may invalidate the attribute and get it 
> recreated. So the 'changelog' object must always be accessed through 
> the repository object.

:-( This repo filtering and repoview is a mess and now we have to make 
all other code a mess too. :-(

Ok, I will pass everything as parameters when invoking revbranchcache 
methods.

For now I will focus on introducing the basic cache and use it 
short-lived and in the least intrusive way where we are in a tight loop 
and/or have the write lock. More fancy integration and optimizations can 
be added later.

>
>> +        self._vfs = repo.vfs
>> +        self._names = [] # branch names in local encoding with 
>> static index
>> +        self._rbcrevs = array('c') # structs of type _rbcrecfmt
>> +        try:
>> +            bndata = self._vfs.read(_rbcnames)
>> +            self._names = [encoding.tolocal(bn) for bn in 
>> bndata.split('\0')]
>> +        except (IOError, OSError, util.Abort), inst:
>> +            self._ui.debug("couldn't read revision branch cache 
>> names: %s\n" %
>> +                           inst)
>
> What is the util.Abort here for ?

cut'n'paste error.

>
>> +        if self._names:
>> +            try:
>> +                data = self._vfs.read(_rbcrevs)
>> +                self._rbcrevs.fromstring(data)
>> +            except (IOError, OSError), inst:
>> +                self._ui.debug("couldn't read revision branch cache: 
>> %s\n" %
>> +                               inst)
>> +        # remember number of good records on disk
>> +        self._rbcrevslen = min(len(self._rbcrevs) // _rbcrecsize,
>> +                               len(self._changelog))
>> +        if self._rbcrevslen == 0:
>> +            self._names = []
>> +        self._rbcnameslen = len(self._names) # number of good names 
>> on disk
>> +        self._namesreverse = dict((b, r) for r, b in 
>> enumerate(self._names))
>> +
>> +    def branchinfo(self, rev):
>> +        """Return branch name and close flag for rev, using and 
>> updating
>> +        persistent cache."""
>> +        rbcrevidx = rev * _rbcrecsize
>
>
>
>> +
>> +        # if requested rev is missing, add and populate all missing 
>> revs
>> +        if len(self._rbcrevs) < rbcrevidx + _rbcrecsize:
>> +            first = len(self._rbcrevs) // _rbcrecsize
>> +            self._rbcrevs.extend('\0' * (len(self._changelog) * 
>> _rbcrecsize -
>> +                                         len(self._rbcrevs)))
>> +            for r in xrange(first, len(self._changelog)):
>> +                self._branchinfo(r)
>
> Why do we automatically populate all missing nodes? This is going to 
> be a very expensive first population even for very simple operation 
> that access branch data for a single changeset. Can we not populate on 
> demand?

One reason is that you insisted that the storage should be append only 
and didn't like the default values in a previous version of this 
feature. We can thus not have any holes in the cache. There is also 
(currently) no good way to write the cache after a revset operation has 
populated it; they have to be read-only and we should aim at keeping it 
fully updated in the place where we write.

Everything should also be populated at some point. Doing it at once will 
not make it more expensive over all.

The array we use for storage will also have to be extended to make room 
for the new entries. Adding entries one by one could imply copying it 
around and is not as efficient as adding all at once.

Revlogs are also faster to walk in-order than doing random seeks.

Writing is also more expensive than reading. By updating everything at 
once we limit the number of writes.

I'm not saying that it is the perfect way to do it, but my gut feeling 
is that it is an ok way to do it. It can work as a baseline and better 
strategies can be implemented later, with the same or another data model.

>
>> +        # fast path: extract data from cache, use it if node is 
>> matching
>> +        reponode = self._changelog.node(rev)[:_rbcnodelen]
>> +        cachenode, branchidx = unpack(
>> +            _rbcrecfmt, buffer(self._rbcrevs, rbcrevidx, _rbcrecsize))
>> +        close = bool(branchidx & _rbccloseflag)
>> +        if close:
>> +            branchidx &= _rbcbranchidxmask
>> +        if cachenode == reponode:
>> +            return self._names[branchidx], close
>> +        # fall back to slow path and make sure it will be written to 
>> disk
>> +        self._rbcrevslen = min(self._rbcrevslen, rev)
>> +        return self._branchinfo(rev)
>> +
>> +    def _branchinfo(self, rev):
>> +        """Retrieve branch info from changelog and update _rbcrevs"""
>> +        b, close = self._changelog.branchinfo(rev)
>> +        if b in self._namesreverse:
>> +            branchidx = self._namesreverse[b]
>> +        else:
>> +            branchidx = len(self._names)
>> +            self._names.append(b)
>> +            self._namesreverse[b] = branchidx
>
> Small nits:
>
> This could be:
>
>   branchidx = self._namesreverse.setdefault(b, len(self._names))
>   if branchidx >= len(self._names):
>       self._name.append(b)

It could. That would save 2 lines. IMO it would also be less explicit 
and less readable. FWIW, it would also be slower.

$ python -m timeit -s'
class self:
   _names = [x for x in "hello world"]
   _namesreverse = dict((j,i) for i,j in enumerate(_names))
' '
for b in "hi":
   if b in self._namesreverse:
     branchidx = self._namesreverse[b]
   else:
     branchidx = len(self._names)
     self._names.append(b)
     self._namesreverse[b] = branchidx
'
1000000 loops, best of 3: 0.334 usec per loop

python -m timeit -s'
class self:
   _names = [x for x in "hello world"]
   _namesreverse = dict((j,i) for i,j in enumerate(_names))
' '
for b in "hi":
   branchidx = self._namesreverse.setdefault(b, len(self._names))
   if branchidx >= len(self._names):
     self._names.append(b)
'
1000000 loops, best of 3: 0.532 usec per loop

/Mads
Pierre-Yves David - Jan. 9, 2015, 9:04 p.m.
On 01/09/2015 08:03 AM, Mads Kiilerich wrote:
> On 01/09/2015 09:15 AM, Pierre-Yves David wrote:
>> On 01/08/2015 02:33 PM, Mads Kiilerich wrote:
>>> # HG changeset patch
>>> # User Mads Kiilerich <madski@unity3d.com>
>>> # Date 1420671663 -3600
>>> #      Thu Jan 08 00:01:03 2015 +0100
>>> # Node ID 7334d72b49781e66cf9ff5abb3f0497d7c812c28
>>> # Parent  7ad155e13f0f51df8e986a0ec4e58ac9a0ccedbb
>>> branchcache: introduce revbranchcache for caching of revision branch
>>> names
>>>
>>> It is expensive to retrieve the branch name of a revision. Very
>>> expensive when
>>> creating a changectx and calling .branch() every time - slightly less
>>> when
>>> using changelog.branchinfo().
>>>
>>> Now, to speed things up, provide a way to cache the results on disk
>>> in an
>>> efficient format. Each branchname is assigned a number, and for each
>>> revision
>>> we store the number of the corresponding branch name. The branch
>>> names are
>>> stored in a dedicated file which is strictly append only.
>>>
>>> Branch names are usually reused across several revisions, and the
>>> total list of
>>> branch names will thus be so small that it is feasible to read the
>>> whole set of
>>> names before using the cache. It will however do that it might be more
>>> efficient to use the changelog for retrieving the branch info for a
>>> single
>>> revision.
>>>
>>> The revision entries are stored in another file. This file is usually
>>> append
>>> only, but if the repository has been modified, the file will be
>>> truncated and
>>> the relevant parts rewritten on demand.
>>>
>>> The entries for each revision are 8 bytes each, and the whole
>>> revision file
>>> will thus be 1/8 of 00changelog.i.
>>>
>>> Each revision entry contains the first 4 bytes of the corresponding
>>> node hash.
>>> This is used as a check sum that always is verified before the entry
>>> is used.
>>> That check is relatively expensive but it makes sure history
>>> modification is
>>> detected and handled correctly. It will also detect and handle most
>>> revision
>>> file corruptions.
>>>
>>> This is just a cache. A new format can always be introduced if other
>>> requirements or ideas make that seem like a good idea. Rebuilding the
>>> cache is
>>> not really more expensive than it was to run for example 'hg log -b
>>> branchname'
>>> before this cache was introduced.
>>>
>>> This new method is still unused but promise to make some operations
>>> several
>>> times faster once it actually is used.
>>>
>>> Abandoning Python 2.4 would make it possible to implement this more
>>> efficiently
>>> by using struct classes and pack_into. The Python code could probably
>>> also be
>>> micro optimized or it could be implemented very efficiently in C
>>> where it would
>>> be easy to control the data access.
>>>
>>> diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
>>> --- a/mercurial/branchmap.py
>>> +++ b/mercurial/branchmap.py
>>> @@ -9,6 +9,8 @@ from node import bin, hex, nullid, nullr
>>>   import encoding
>>>   import util
>>>   import time
>>> +from array import array
>>> +from struct import calcsize, pack, unpack
>>>
>>>   def _filename(repo):
>>>       """name of a branchcache file for a given repo or repoview"""
>>> @@ -285,3 +287,144 @@ class branchcache(dict):
>>>           duration = time.time() - starttime
>>>           repo.ui.log('branchcache', 'updated %s branch cache in %.4f
>>> seconds\n',
>>>                       repo.filtername, duration)
>>> +
>>> +# Revision branch info cache
>>> +
>>> +_rbcversion = '-v1'
>>> +_rbcnames = 'cache/rbc-names' + _rbcversion
>>> +_rbcrevs = 'cache/rbc-revs' + _rbcversion
>>> +# [4 byte hash prefix][4 byte branch name number with sign bit
>>> indicating open]
>>> +_rbcrecfmt = '>4sI'
>>> +_rbcrecsize = calcsize(_rbcrecfmt)
>>> +_rbcnodelen = 4
>>> +_rbcbranchidxmask = 0x7fffffff
>>> +_rbccloseflag = 0x80000000
>>> +
>>> +class revbranchcache(object):
>>> +    """Persistent cache, mapping from revision number to branch name
>>> and close.
>>> +    This is a low level cache, independent of filtering.
>>> +
>>> +    Branch names are stored in rbc-names in internal encoding
>>> separated by 0.
>>> +    rbc-names is append-only, and each branch name is only stored
>>> once and will
>>> +    thus have a unique index.
>>> +
>>> +    The branch info for each revision is stored in rbc-revs as
>>> constant size
>>> +    records. The whole file is read into memory, but it is only
>>> 'parsed' on
>>> +    demand. The file is usually append-only but will be truncated if
>>> repo
>>> +    modification is detected.
>>> +    The record for each revision contains the first 4 bytes of the
>>> +    corresponding node hash, and the record is only used if it still
>>> matches.
>>> +    Even a completely trashed rbc-revs fill thus still give the
>>> right result
>>> +    while converging towards full recovery ... assuming no
>>> incorrectly matching
>>> +    node hashes.
>>> +    The record also contains 4 bytes where 31 bits contains the
>>> index of the
>>> +    branch and the last bit indicate that it is a branch close commit.
>>> +    The usage pattern for rbc-revs is thus somewhat similar to
>>> 00changelog.i
>>> +    and will grow with it but be 1/8th of its size.
>>> +    """
>>> +
>>> +    def __init__(self, repo):
>>> +        repo = repo.unfiltered()
>>
>> This is suspicious. I would assume we should have a single
>> 'revbranchcache' object per repository.
>
> It will only be used for the few operations that do bulk operations and
> are slow without it. So far that is branchmap calculation and revset
> branch filtering. There will usually only be at most one of them once
> per repo object lifetime. Storing and reusing is thus barely relevant.
>
> A later patch could perhaps store and reuse it more aggressively.

Ok, I misunderstood the life time of the object (when looking at the 
other patches). I agree with you that it make sense to use it for single 
expensive operation. This is a good first step.

>> So I would focus on ensure we always access it through the unfiltered
>> object (short version: assign it to the unfiltered object in the first
>> place). This mean updating the next patch.
>>
>> Turning this line into an assertion seems like a good idea:
>> (assert repo.filtername is None)
>>
>> Also, we tend to avoid creating cycle with core attribute of the
>> repository. Can we not store the repository object in this one and
>> pass it to the few relevant function?
>
> I don't understand the comment. The repo object is _not_ stored anywhere
> in this patch and I don't see any cycles.

I'm talking about the planned usage in the other patches, where the 
revbranchcache object is store in attribute. Lets discuss it on other 
patches instead.

>>> +        self._ui = repo.ui
>>> +        self._changelog = repo.changelog
>>
>> This is going to lead to troubles. 'repo.changelog' is a store cache.
>> This means that external even may invalidate the attribute and get it
>> recreated. So the 'changelog' object must always be accessed through
>> the repository object.
>
> :-( This repo filtering and repoview is a mess and now we have to make
> all other code a mess too. :-(

This is actually not related to repoview at all. A lots of the localrepo 
attribute are automatically reloaded if underlying file change. This is 
a core concept for years.

> Ok, I will pass everything as parameters when invoking revbranchcache
> methods.
>
> For now I will focus on introducing the basic cache and use it
> short-lived and in the least intrusive way where we are in a tight loop
> and/or have the write lock. More fancy integration and optimizations can
> be added later.

+1

>>> +        if self._names:
>>> +            try:
>>> +                data = self._vfs.read(_rbcrevs)
>>> +                self._rbcrevs.fromstring(data)
>>> +            except (IOError, OSError), inst:
>>> +                self._ui.debug("couldn't read revision branch cache:
>>> %s\n" %
>>> +                               inst)
>>> +        # remember number of good records on disk
>>> +        self._rbcrevslen = min(len(self._rbcrevs) // _rbcrecsize,
>>> +                               len(self._changelog))
>>> +        if self._rbcrevslen == 0:
>>> +            self._names = []
>>> +        self._rbcnameslen = len(self._names) # number of good names
>>> on disk
>>> +        self._namesreverse = dict((b, r) for r, b in
>>> enumerate(self._names))
>>> +
>>> +    def branchinfo(self, rev):
>>> +        """Return branch name and close flag for rev, using and
>>> updating
>>> +        persistent cache."""
>>> +        rbcrevidx = rev * _rbcrecsize
>>
>>
>>
>>> +
>>> +        # if requested rev is missing, add and populate all missing
>>> revs
>>> +        if len(self._rbcrevs) < rbcrevidx + _rbcrecsize:
>>> +            first = len(self._rbcrevs) // _rbcrecsize
>>> +            self._rbcrevs.extend('\0' * (len(self._changelog) *
>>> _rbcrecsize -
>>> +                                         len(self._rbcrevs)))
>>> +            for r in xrange(first, len(self._changelog)):
>>> +                self._branchinfo(r)
>>
>> Why do we automatically populate all missing nodes? This is going to
>> be a very expensive first population even for very simple operation
>> that access branch data for a single changeset. Can we not populate on
>> demand?
>
> One reason is that you insisted that the storage should be append only
> and didn't like the default values in a previous version of this
> feature. We can thus not have any holes in the cache. There is also
> (currently) no good way to write the cache after a revset operation has
> populated it; they have to be read-only and we should aim at keeping it
> fully updated in the place where we write.

I think I did not advocated for dropping the magic value. I was just 
avocating for dropping the automatic size update.

> Everything should also be populated at some point. Doing it at once will
> not make it more expensive over all.
>
> The array we use for storage will also have to be extended to make room
> for the new entries. Adding entries one by one could imply copying it
> around and is not as efficient as adding all at once.
>
> Revlogs are also faster to walk in-order than doing random seeks.
>
> Writing is also more expensive than reading. By updating everything at
> once we limit the number of writes.
>
> I'm not saying that it is the perfect way to do it, but my gut feeling
> is that it is an ok way to do it. It can work as a baseline and better
> strategies can be implemented later, with the same or another data model.

Since the writing is handled at an upper level, (who can properly do a 
single write, I'm not sure  you argument is relevant WE can still do a 
single final write. I'm still worried about the initial update time for 
small operation right after the mercurial upgrade.

>>> +        # fast path: extract data from cache, use it if node is
>>> matching
>>> +        reponode = self._changelog.node(rev)[:_rbcnodelen]
>>> +        cachenode, branchidx = unpack(
>>> +            _rbcrecfmt, buffer(self._rbcrevs, rbcrevidx, _rbcrecsize))
>>> +        close = bool(branchidx & _rbccloseflag)
>>> +        if close:
>>> +            branchidx &= _rbcbranchidxmask
>>> +        if cachenode == reponode:
>>> +            return self._names[branchidx], close
>>> +        # fall back to slow path and make sure it will be written to
>>> disk
>>> +        self._rbcrevslen = min(self._rbcrevslen, rev)
>>> +        return self._branchinfo(rev)
>>> +
>>> +    def _branchinfo(self, rev):
>>> +        """Retrieve branch info from changelog and update _rbcrevs"""
>>> +        b, close = self._changelog.branchinfo(rev)
>>> +        if b in self._namesreverse:
>>> +            branchidx = self._namesreverse[b]
>>> +        else:
>>> +            branchidx = len(self._names)
>>> +            self._names.append(b)
>>> +            self._namesreverse[b] = branchidx
>>
>> Small nits:
>>
>> This could be:
>>
>>   branchidx = self._namesreverse.setdefault(b, len(self._names))
>>   if branchidx >= len(self._names):
>>       self._name.append(b)
>
> It could. That would save 2 lines. IMO it would also be less explicit
> and less readable. FWIW, it would also be slower.
>
> $ python -m timeit -s'
> class self:
>    _names = [x for x in "hello world"]
>    _namesreverse = dict((j,i) for i,j in enumerate(_names))
> ' '
> for b in "hi":
>    if b in self._namesreverse:
>      branchidx = self._namesreverse[b]
>    else:
>      branchidx = len(self._names)
>      self._names.append(b)
>      self._namesreverse[b] = branchidx
> '
> 1000000 loops, best of 3: 0.334 usec per loop
>
> python -m timeit -s'
> class self:
>    _names = [x for x in "hello world"]
>    _namesreverse = dict((j,i) for i,j in enumerate(_names))
> ' '
> for b in "hi":
>    branchidx = self._namesreverse.setdefault(b, len(self._names))
>    if branchidx >= len(self._names):
>      self._names.append(b)
> '
> 1000000 loops, best of 3: 0.532 usec per loop

This was just a nits, I personally find it more readable and I'm pretty 
sure the performance difference is strongly dominated by other things. 
But this is far from important so keep it the way to prefers.

Patch

diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
--- a/mercurial/branchmap.py
+++ b/mercurial/branchmap.py
@@ -9,6 +9,8 @@  from node import bin, hex, nullid, nullr
 import encoding
 import util
 import time
+from array import array
+from struct import calcsize, pack, unpack
 
 def _filename(repo):
     """name of a branchcache file for a given repo or repoview"""
@@ -285,3 +287,144 @@  class branchcache(dict):
         duration = time.time() - starttime
         repo.ui.log('branchcache', 'updated %s branch cache in %.4f seconds\n',
                     repo.filtername, duration)
+
+# Revision branch info cache
+
+_rbcversion = '-v1'
+_rbcnames = 'cache/rbc-names' + _rbcversion
+_rbcrevs = 'cache/rbc-revs' + _rbcversion
+# [4 byte hash prefix][4 byte branch name number with sign bit indicating open]
+_rbcrecfmt = '>4sI'
+_rbcrecsize = calcsize(_rbcrecfmt)
+_rbcnodelen = 4
+_rbcbranchidxmask = 0x7fffffff
+_rbccloseflag = 0x80000000
+
+class revbranchcache(object):
+    """Persistent cache, mapping from revision number to branch name and close.
+    This is a low level cache, independent of filtering.
+
+    Branch names are stored in rbc-names in internal encoding separated by 0.
+    rbc-names is append-only, and each branch name is only stored once and will
+    thus have a unique index.
+
+    The branch info for each revision is stored in rbc-revs as constant size
+    records. The whole file is read into memory, but it is only 'parsed' on
+    demand. The file is usually append-only but will be truncated if repo
+    modification is detected.
+    The record for each revision contains the first 4 bytes of the
+    corresponding node hash, and the record is only used if it still matches.
+    Even a completely trashed rbc-revs fill thus still give the right result
+    while converging towards full recovery ... assuming no incorrectly matching
+    node hashes.
+    The record also contains 4 bytes where 31 bits contains the index of the
+    branch and the last bit indicate that it is a branch close commit.
+    The usage pattern for rbc-revs is thus somewhat similar to 00changelog.i
+    and will grow with it but be 1/8th of its size.
+    """
+
+    def __init__(self, repo):
+        repo = repo.unfiltered()
+        self._ui = repo.ui
+        self._changelog = repo.changelog
+        self._vfs = repo.vfs
+        self._names = [] # branch names in local encoding with static index
+        self._rbcrevs = array('c') # structs of type _rbcrecfmt
+        try:
+            bndata = self._vfs.read(_rbcnames)
+            self._names = [encoding.tolocal(bn) for bn in bndata.split('\0')]
+        except (IOError, OSError, util.Abort), inst:
+            self._ui.debug("couldn't read revision branch cache names: %s\n" %
+                           inst)
+        if self._names:
+            try:
+                data = self._vfs.read(_rbcrevs)
+                self._rbcrevs.fromstring(data)
+            except (IOError, OSError), inst:
+                self._ui.debug("couldn't read revision branch cache: %s\n" %
+                               inst)
+        # remember number of good records on disk
+        self._rbcrevslen = min(len(self._rbcrevs) // _rbcrecsize,
+                               len(self._changelog))
+        if self._rbcrevslen == 0:
+            self._names = []
+        self._rbcnameslen = len(self._names) # number of good names on disk
+        self._namesreverse = dict((b, r) for r, b in enumerate(self._names))
+
+    def branchinfo(self, rev):
+        """Return branch name and close flag for rev, using and updating
+        persistent cache."""
+        rbcrevidx = rev * _rbcrecsize
+
+        # if requested rev is missing, add and populate all missing revs
+        if len(self._rbcrevs) < rbcrevidx + _rbcrecsize:
+            first = len(self._rbcrevs) // _rbcrecsize
+            self._rbcrevs.extend('\0' * (len(self._changelog) * _rbcrecsize -
+                                         len(self._rbcrevs)))
+            for r in xrange(first, len(self._changelog)):
+                self._branchinfo(r)
+
+        # fast path: extract data from cache, use it if node is matching
+        reponode = self._changelog.node(rev)[:_rbcnodelen]
+        cachenode, branchidx = unpack(
+            _rbcrecfmt, buffer(self._rbcrevs, rbcrevidx, _rbcrecsize))
+        close = bool(branchidx & _rbccloseflag)
+        if close:
+            branchidx &= _rbcbranchidxmask
+        if cachenode == reponode:
+            return self._names[branchidx], close
+        # fall back to slow path and make sure it will be written to disk
+        self._rbcrevslen = min(self._rbcrevslen, rev)
+        return self._branchinfo(rev)
+
+    def _branchinfo(self, rev):
+        """Retrieve branch info from changelog and update _rbcrevs"""
+        b, close = self._changelog.branchinfo(rev)
+        if b in self._namesreverse:
+            branchidx = self._namesreverse[b]
+        else:
+            branchidx = len(self._names)
+            self._names.append(b)
+            self._namesreverse[b] = branchidx
+        reponode = self._changelog.node(rev)
+        if close:
+            branchidx |= _rbccloseflag
+        rbcrevidx = rev * _rbcrecsize
+        rec = array('c')
+        rec.fromstring(pack(_rbcrecfmt, reponode, branchidx))
+        self._rbcrevs[rbcrevidx:rbcrevidx + _rbcrecsize] = rec
+        return b, close
+
+    def write(self):
+        """Save branch cache if it is dirty."""
+        if self._rbcnameslen < len(self._names):
+            try:
+                if self._rbcnameslen == 0:
+                    f = self._vfs.open(_rbcnames, 'wb')
+                else:
+                    f = self._vfs.open(_rbcnames, 'ab')
+                    f.write('\0')
+                f.write('\0'.join(encoding.fromlocal(b)
+                                  for b in self._names[self._rbcnameslen:]))
+                f.close()
+            except (IOError, OSError, util.Abort), inst:
+                self._ui.debug("couldn't write revision branch cache names: "
+                               "%s\n" % inst)
+                return
+            self._rbcnameslen = len(self._names)
+
+        start = self._rbcrevslen * _rbcrecsize
+        if start != len(self._rbcrevs):
+            try:
+                f = self._vfs.open(_rbcrevs, 'ab')
+                if f.tell() != start:
+                    f.seek(start)
+                    f.truncate()
+                end = len(self._changelog) * _rbcrecsize
+                f.write(self._rbcrevs[start:end])
+                f.close()
+            except (IOError, OSError, util.Abort), inst:
+                self._ui.debug("couldn't write revision branch cache: %s\n" %
+                               inst)
+                return
+            self._rbcrevslen = len(self._changelog)