Patchwork [1,of,2,v4] localrepo: persistent caching of branch names

login
register
mail settings
Submitter Mads Kiilerich
Date Oct. 16, 2014, 1:48 a.m.
Message ID <d87009e720063a8d6d80.1413424103@ssl.google-analytics.com>
Download mbox | patch
Permalink /patch/6311/
State Superseded
Headers show

Comments

Mads Kiilerich - Oct. 16, 2014, 1:48 a.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1413424006 -7200
#      Thu Oct 16 03:46:46 2014 +0200
# Node ID d87009e720063a8d6d80afdbe6bb9323e2849030
# Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
localrepo: persistent caching of branch names

It is expensive to retrieve the branch name. Very expensive when creating a
changectx and calling .branch() - slightly less when using
changelog.branchinfo().

Now, to really speed things up, cache the results on disk. To get efficient
lookup for revisions (constant size records) and avoid storing the same branch
name over and ever, store the name of each branch once with a fixed ordering.
For each repo revision, store the node hash and the index of the branch name.
To make it 100% stable against repository mutations, always check the node hash
before using the cache content.

The code for this is kind of similar to the branchmap handling and is placed in
the same module even though the name is not completely spot on.

This new method promise to make some operations up 20 times faster once it
actually is used.

A simpler approach that didn't store and validate node hashes for every
revision was significantly faster (x2) but could be tricked when modifying
history. The usual worst case would be that the whole cache was invalidated
when the repository history was modified, but when trying very hard it could be
tricked into not noticing changes.
Pierre-Yves David - Oct. 16, 2014, 7:27 a.m.
On 10/15/2014 06:48 PM, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1413424006 -7200
> #      Thu Oct 16 03:46:46 2014 +0200
> # Node ID d87009e720063a8d6d80afdbe6bb9323e2849030
> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
> localrepo: persistent caching of branch names
>
> It is expensive to retrieve the branch name. Very expensive when creating a
> changectx and calling .branch() - slightly less when using
> changelog.branchinfo().
>
> Now, to really speed things up, cache the results on disk. To get efficient
> lookup for revisions (constant size records) and avoid storing the same branch
> name over and ever, store the name of each branch once with a fixed ordering.
> For each repo revision, store the node hash and the index of the branch name.
> To make it 100% stable against repository mutations, always check the node hash
> before using the cache content.
>
> The code for this is kind of similar to the branchmap handling and is placed in
> the same module even though the name is not completely spot on.
>
> This new method promise to make some operations up 20 times faster once it
> actually is used.
>
> A simpler approach that didn't store and validate node hashes for every
> revision was significantly faster (x2) but could be tricked when modifying
> history. The usual worst case would be that the whole cache was invalidated
> when the repository history was modified, but when trying very hard it could be
> tricked into not noticing changes.
>
> diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
> --- a/mercurial/branchmap.py
> +++ b/mercurial/branchmap.py
> @@ -9,6 +9,7 @@ from node import bin, hex, nullid, nullr
>   import encoding
>   import util
>   import time
> +import struct, array
>
>   def _filename(repo):
>       """name of a branchcache file for a given repo or repoview"""
> @@ -285,3 +286,97 @@ class branchcache(dict):
>           duration = time.time() - starttime
>           repo.ui.log('branchcache', 'updated %s branch cache in %.4f seconds\n',
>                       repo.filtername, duration)
> +
> +class revbranchcache(object):
> +    """Persistent cache mapping from revision number to branch name.
> +    Consistency is guaranteed by verifying the node hash."""
> +
> +    filename = 'cache/branchnames'
> +    formatversion = 2345164374
> +    headerfmt = '>LLL' # file header: version, records start, records length
> +    recfmt = '>20sH' # a record: node hash, branch name reference
> +    headersize = struct.calcsize(headerfmt)
> +    recsize = struct.calcsize(recfmt)

We could use a textual description of the format I've trouble 
identifying how branch name are first introduced.

Your formatversion value looks temporary


> +
> +    def __init__(self, repo):
> +        self._repo = repo
> +        self._loaded = False
> +        self._dirty = False
> +        self._names = [] # branch names referenced from recfmt records
> +        self._records = array.array('c') # bytes with structs of type recfmt
> +
> +    def _load(self):
> +        """Load cached branch names."""
> +        try:
> +            data = self._repo.vfs.open(self.filename).read()
> +        except IOError:
> +            data = ''

missing the usual.

   if err.errno != errno.ENOENT:
       raise


> +        self._dirty = True

Not sure why we default to dirty as we just read the data from disk?

> +        reporecslen = len(self._repo) * self.recsize
> +        if len(data) >= self.headersize:
> +            # header
> +            v, recsstart, recslen = struct.unpack_from(self.headerfmt, data)
> +            if v == self.formatversion and len(data) == recsstart + recslen:
> +                # between header and records: \0 separated branch names
> +                if recsstart != self.headersize:
> +                    self._names = \
> +                        data[self.headersize:recsstart].split('\0')

Please use an intermediate variable instead of ugly \
continuation.

Now I know how you handle branch name (but should stil be documented. I 
see this as a minor issue that the cache has to be wiped for every new 
branch.

> +                # read records, cap at repo size
> +                self._records.fromstring(
> +                    buffer(data, recsstart, min(recslen, reporecslen)))

wrong identation, should be aligned with the opening parents. You could 
use and intermediate variable.

> +                # only dirty if too many records (after strip)
> +                self._dirty = recslen > reporecslen
> +            else:
> +                self._repo.ui.debug('branch cache file was invalid\n')

Consider reversing the logic and putting the short branch of the if/else 
at start. This having having to refetch the context N line higher when 
meeting this else.

> +        # pad to repo size
> +        if len(self._records) < reporecslen:
> +            self._records.extend(
> +                '\xff' * (reporecslen - len(self._records)))
> +
> +        self._branchnamesindex = dict((b, r)
> +                                      for r, b in enumerate(self._names))
> +        self._node = self._repo.changelog.node
> +        self._branchinfo = self._repo.changelog.branchinfo
> +        self._loaded = True
> +
> +    def branch(self, rev):
> +        """Return branch name of rev, using and updating persistent cache."""
> +        if not self._loaded:
> +            self._load()
> +
> +        node = self._node(rev)
> +        cachenode, branchidx = struct.unpack_from(self.recfmt, self._records,
> +                                                  rev * self.recsize)

The runtime packaing and depacking make me shivers a bit but it seems 
fairly reasonable.

> +        if cachenode == node and branchidx < len(self._names):
> +            return self._names[branchidx]
> +        b, _close = self._branchinfo(rev)
> +        if b in self._branchnamesindex:
> +            branchidx = self._branchnamesindex[b]
> +        else:
> +            branchidx = len(self._names)
> +            self._names.append(b)
> +            self._branchnamesindex[b] = branchidx
> +        struct.pack_into(self.recfmt, self._records, rev * self.recsize,
> +                         node, branchidx)
> +        self._dirty = True
> +        return b
> +
> +    def save(self):
> +        """Save branch cache if it is dirty."""
> +        if self._dirty:
> +            self._repo.ui.debug('writing branch cache file\n')
> +            try:
> +                f = self._repo.vfs.open(self.filename, 'w', atomictemp=True)
> +                s = '\0'.join(self._names)
> +                f.write(struct.pack(self.headerfmt, self.formatversion,
> +                                    self.headersize + len(s),
> +                                    len(self._records)))
> +                f.write(s)
> +                f.write(self._records)
> +                f.close()
> +            except IOError:

Same here need to propage some of them

> +                pass
> +            self._dirty = False
> +
> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
> --- a/mercurial/localrepo.py
> +++ b/mercurial/localrepo.py
> @@ -297,8 +297,11 @@ class localrepository(object):
>           # - bookmark changes
>           self.filteredrevcache = {}
>
> +        self.revbranchcache = branchmap.revbranchcache(self)
> +
>       def close(self):
> -        pass
> +        if self.revbranchcache:
> +            self.revbranchcache.save()

Should we do that sooner instead? and check if the data we are about to 
write are still valid?

>
>       def _restrictcapabilities(self, caps):
>           # bundle2 is not ready for prime time, drop it unless explicitly
> diff --git a/mercurial/statichttprepo.py b/mercurial/statichttprepo.py
> --- a/mercurial/statichttprepo.py
> +++ b/mercurial/statichttprepo.py
> @@ -141,6 +141,7 @@ class statichttprepository(localrepo.loc
>           self._branchcaches = {}
>           self.encodepats = None
>           self.decodepats = None
> +        self.revbranchcache = None
>
>       def _restrictcapabilities(self, caps):
>           caps = super(statichttprepository, self)._restrictcapabilities(caps)
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
Mads Kiilerich - Oct. 16, 2014, 2:45 p.m.
On 10/16/2014 09:27 AM, Pierre-Yves David wrote:
> On 10/15/2014 06:48 PM, Mads Kiilerich wrote:
>> # HG changeset patch
>> # User Mads Kiilerich <madski@unity3d.com>
>> # Date 1413424006 -7200
>> #      Thu Oct 16 03:46:46 2014 +0200
>> # Node ID d87009e720063a8d6d80afdbe6bb9323e2849030
>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>> localrepo: persistent caching of branch names
>>
>> It is expensive to retrieve the branch name. Very expensive when 
>> creating a
>> changectx and calling .branch() - slightly less when using
>> changelog.branchinfo().
>>
>> Now, to really speed things up, cache the results on disk. To get 
>> efficient
>> lookup for revisions (constant size records) and avoid storing the 
>> same branch
>> name over and ever, store the name of each branch once with a fixed 
>> ordering.
>> For each repo revision, store the node hash and the index of the 
>> branch name.
>> To make it 100% stable against repository mutations, always check the 
>> node hash
>> before using the cache content.
>>
>> The code for this is kind of similar to the branchmap handling and is 
>> placed in
>> the same module even though the name is not completely spot on.
>>
>> This new method promise to make some operations up 20 times faster 
>> once it
>> actually is used.
>>
>> A simpler approach that didn't store and validate node hashes for every
>> revision was significantly faster (x2) but could be tricked when 
>> modifying
>> history. The usual worst case would be that the whole cache was 
>> invalidated
>> when the repository history was modified, but when trying very hard 
>> it could be
>> tricked into not noticing changes.
>>
>> diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
>> --- a/mercurial/branchmap.py
>> +++ b/mercurial/branchmap.py
>> @@ -9,6 +9,7 @@ from node import bin, hex, nullid, nullr
>>   import encoding
>>   import util
>>   import time
>> +import struct, array
>>
>>   def _filename(repo):
>>       """name of a branchcache file for a given repo or repoview"""
>> @@ -285,3 +286,97 @@ class branchcache(dict):
>>           duration = time.time() - starttime
>>           repo.ui.log('branchcache', 'updated %s branch cache in %.4f 
>> seconds\n',
>>                       repo.filtername, duration)
>> +
>> +class revbranchcache(object):
>> +    """Persistent cache mapping from revision number to branch name.
>> +    Consistency is guaranteed by verifying the node hash."""
>> +
>> +    filename = 'cache/branchnames'
>> +    formatversion = 2345164374
>> +    headerfmt = '>LLL' # file header: version, records start, 
>> records length
>> +    recfmt = '>20sH' # a record: node hash, branch name reference
>> +    headersize = struct.calcsize(headerfmt)
>> +    recsize = struct.calcsize(recfmt)
>
> We could use a textual description of the format I've trouble 
> identifying how branch name are first introduced.
>
> Your formatversion value looks temporary

I can rename it to magic. That makes it more resilient than starting 
with number 0 as many other files do. (Or we could just leave it out. It 
is just a cache and errors will be detected and we would probably just 
rebuild the cache after format change anyway.)

>
>> +
>> +    def __init__(self, repo):
>> +        self._repo = repo
>> +        self._loaded = False
>> +        self._dirty = False
>> +        self._names = [] # branch names referenced from recfmt records
>> +        self._records = array.array('c') # bytes with structs of 
>> type recfmt
>> +
>> +    def _load(self):
>> +        """Load cached branch names."""
>> +        try:
>> +            data = self._repo.vfs.open(self.filename).read()
>> +        except IOError:
>> +            data = ''
>
> missing the usual.
>
>   if err.errno != errno.ENOENT:
>       raise

No. This is just a cache. I really don't care why the read failed. 
Problems with this file should fall back to normal operation, never 
block anything.

> + self._dirty = True
>
> Not sure why we default to dirty as we just read the data from disk?

It will be set to False after the file successfully has been parsed.

>
>> +        reporecslen = len(self._repo) * self.recsize
>> +        if len(data) >= self.headersize:
>> +            # header
>> +            v, recsstart, recslen = 
>> struct.unpack_from(self.headerfmt, data)
>> +            if v == self.formatversion and len(data) == recsstart + 
>> recslen:
>> +                # between header and records: \0 separated branch names
>> +                if recsstart != self.headersize:
>> +                    self._names = \
>> + data[self.headersize:recsstart].split('\0')
>
> Please use an intermediate variable instead of ugly \
> continuation.

Please don't say that. There is nothing in the Mercurial coding style 
about that. It says mostly pep8, and pep8 explicitly allows use of \ 
when there is no obvious way to place () around it. If you want to 
enforce your own ugly style, please make Matt put it in the coding 
style. My attention span is not limited to one textual line and I 
strongly dislike the use of ugly intermediate redundant variables.

> Now I know how you handle branch name (but should stil be documented. 
> I see this as a minor issue that the cache has to be wiped for every 
> new branch.

No. What makes you think that? It is just appended to the list and no 
indices will change.

>
>> +                # read records, cap at repo size
>> +                self._records.fromstring(
>> +                    buffer(data, recsstart, min(recslen, reporecslen)))
>
> wrong identation, should be aligned with the opening parents. You 
> could use and intermediate variable.

That kind of indentation is explicitly allowed in pep8. Please don't 
call it wrong unless you change the rules so it become wrong.

>
>> +                # only dirty if too many records (after strip)
>> +                self._dirty = recslen > reporecslen
>> +            else:
>> +                self._repo.ui.debug('branch cache file was invalid\n')
>
> Consider reversing the logic and putting the short branch of the 
> if/else at start. This having having to refetch the context N line 
> higher when meeting this else.

Considered ... but not liked. As it is now, the condition has no 
redundant negation, it focus on what is important, and the handling of 
the important case comes first.

But the debug statement is arguably not important and I can remove it.

>
>> +        # pad to repo size
>> +        if len(self._records) < reporecslen:
>> +            self._records.extend(
>> +                '\xff' * (reporecslen - len(self._records)))
>> +
>> +        self._branchnamesindex = dict((b, r)
>> +                                      for r, b in 
>> enumerate(self._names))
>> +        self._node = self._repo.changelog.node
>> +        self._branchinfo = self._repo.changelog.branchinfo
>> +        self._loaded = True
>> +
>> +    def branch(self, rev):
>> +        """Return branch name of rev, using and updating persistent 
>> cache."""
>> +        if not self._loaded:
>> +            self._load()
>> +
>> +        node = self._node(rev)
>> +        cachenode, branchidx = struct.unpack_from(self.recfmt, 
>> self._records,
>> +                                                  rev * self.recsize)
>
> The runtime packaing and depacking make me shivers a bit but it seems 
> fairly reasonable.

Unpacking everything up front and storing updates in a different format 
would make me shiver. This way of doing it is quite efficient in time 
and memory. Even more so if Python had mutable arrays of structs.

>
>> +        if cachenode == node and branchidx < len(self._names):
>> +            return self._names[branchidx]
>> +        b, _close = self._branchinfo(rev)
>> +        if b in self._branchnamesindex:
>> +            branchidx = self._branchnamesindex[b]
>> +        else:
>> +            branchidx = len(self._names)
>> +            self._names.append(b)
>> +            self._branchnamesindex[b] = branchidx
>> +        struct.pack_into(self.recfmt, self._records, rev * 
>> self.recsize,
>> +                         node, branchidx)
>> +        self._dirty = True
>> +        return b
>> +
>> +    def save(self):
>> +        """Save branch cache if it is dirty."""
>> +        if self._dirty:
>> +            self._repo.ui.debug('writing branch cache file\n')
>> +            try:
>> +                f = self._repo.vfs.open(self.filename, 'w', 
>> atomictemp=True)
>> +                s = '\0'.join(self._names)
>> +                f.write(struct.pack(self.headerfmt, self.formatversion,
>> +                                    self.headersize + len(s),
>> +                                    len(self._records)))
>> +                f.write(s)
>> +                f.write(self._records)
>> +                f.close()
>> +            except IOError:
>
> Same here need to propage some of them

Same: it is just a cache. Failing to update it should not cause a 
failure, no matter what the reason was.

>
>> +                pass
>> +            self._dirty = False
>> +
>> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
>> --- a/mercurial/localrepo.py
>> +++ b/mercurial/localrepo.py
>> @@ -297,8 +297,11 @@ class localrepository(object):
>>           # - bookmark changes
>>           self.filteredrevcache = {}
>>
>> +        self.revbranchcache = branchmap.revbranchcache(self)
>> +
>>       def close(self):
>> -        pass
>> +        if self.revbranchcache:
>> +            self.revbranchcache.save()
>
> Should we do that sooner instead? 

Why and when?

Writes are too expensive to do immediately every time we see a cache 
miss, and this seems to be a simple and well-defined point in time where 
we know there will be no other branch lookups.

Later refactorings can perhaps turn it into something that explicitly 
has to be loaded and saved in order to use it in the few relevant places.

> and check if the data we are about to write are still valid?

Writes are always valid. Worst case is that some branch lookups done by 
others are lost, but they are not that expensive to do again. For all 
practical purposes it will quickly converge towards a fully populated cache.


Thanks for feedback.

/Mads
Pierre-Yves David - Oct. 17, 2014, 5:11 a.m.
On 10/16/2014 07:45 AM, Mads Kiilerich wrote:
> On 10/16/2014 09:27 AM, Pierre-Yves David wrote:
>> On 10/15/2014 06:48 PM, Mads Kiilerich wrote:
>>> # HG changeset patch
>>> # User Mads Kiilerich <madski@unity3d.com>
>>> # Date 1413424006 -7200
>>> #      Thu Oct 16 03:46:46 2014 +0200
>>> # Node ID d87009e720063a8d6d80afdbe6bb9323e2849030
>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>> localrepo: persistent caching of branch names
>>>
>>> It is expensive to retrieve the branch name. Very expensive when
>>> creating a
>>> changectx and calling .branch() - slightly less when using
>>> changelog.branchinfo().
>>>
>>> Now, to really speed things up, cache the results on disk. To get
>>> efficient
>>> lookup for revisions (constant size records) and avoid storing the
>>> same branch
>>> name over and ever, store the name of each branch once with a fixed
>>> ordering.
>>> For each repo revision, store the node hash and the index of the
>>> branch name.
>>> To make it 100% stable against repository mutations, always check the
>>> node hash
>>> before using the cache content.
>>>
>>> The code for this is kind of similar to the branchmap handling and is
>>> placed in
>>> the same module even though the name is not completely spot on.
>>>
>>> This new method promise to make some operations up 20 times faster
>>> once it
>>> actually is used.
>>>
>>> A simpler approach that didn't store and validate node hashes for every
>>> revision was significantly faster (x2) but could be tricked when
>>> modifying
>>> history. The usual worst case would be that the whole cache was
>>> invalidated
>>> when the repository history was modified, but when trying very hard
>>> it could be
>>> tricked into not noticing changes.
>>>
>>> diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
>>> --- a/mercurial/branchmap.py
>>> +++ b/mercurial/branchmap.py
>>> @@ -9,6 +9,7 @@ from node import bin, hex, nullid, nullr
>>>   import encoding
>>>   import util
>>>   import time
>>> +import struct, array
>>>
>>>   def _filename(repo):
>>>       """name of a branchcache file for a given repo or repoview"""
>>> @@ -285,3 +286,97 @@ class branchcache(dict):
>>>           duration = time.time() - starttime
>>>           repo.ui.log('branchcache', 'updated %s branch cache in %.4f
>>> seconds\n',
>>>                       repo.filtername, duration)
>>> +
>>> +class revbranchcache(object):
>>> +    """Persistent cache mapping from revision number to branch name.
>>> +    Consistency is guaranteed by verifying the node hash."""
>>> +
>>> +    filename = 'cache/branchnames'
>>> +    formatversion = 2345164374
>>> +    headerfmt = '>LLL' # file header: version, records start,
>>> records length
>>> +    recfmt = '>20sH' # a record: node hash, branch name reference
>>> +    headersize = struct.calcsize(headerfmt)
>>> +    recsize = struct.calcsize(recfmt)
>>
>> We could use a textual description of the format I've trouble
>> identifying how branch name are first introduced.
>>
>> Your formatversion value looks temporary
>
> I can rename it to magic. That makes it more resilient than starting
> with number 0 as many other files do. (Or we could just leave it out. It
> is just a cache and errors will be detected and we would probably just
> rebuild the cache after format change anyway.)

Having this kind of version number to fail fast and gracefully is useful.

Please start it to 0 or 1 (or something sensible that helps tracking 
error down.

>>> +
>>> +    def __init__(self, repo):
>>> +        self._repo = repo
>>> +        self._loaded = False
>>> +        self._dirty = False
>>> +        self._names = [] # branch names referenced from recfmt records
>>> +        self._records = array.array('c') # bytes with structs of
>>> type recfmt
>>> +
>>> +    def _load(self):
>>> +        """Load cached branch names."""
>>> +        try:
>>> +            data = self._repo.vfs.open(self.filename).read()
>>> +        except IOError:
>>> +            data = ''
>>
>> missing the usual.
>>
>>   if err.errno != errno.ENOENT:
>>       raise
>
> No. This is just a cache. I really don't care why the read failed.
> Problems with this file should fall back to normal operation, never
> block anything.

Why don't you just except Exception then?

I think restricting error catching is valuable and we should use the 
same approach than for any other cache.


>> + self._dirty = True
>>
>> Not sure why we default to dirty as we just read the data from disk?
>
> It will be set to False after the file successfully has been parsed.

Empty cache still seems to be something clean to me. I do not get what 
the value of initializing to False here.

>>> +        reporecslen = len(self._repo) * self.recsize
>>> +        if len(data) >= self.headersize:
>>> +            # header
>>> +            v, recsstart, recslen =
>>> struct.unpack_from(self.headerfmt, data)
>>> +            if v == self.formatversion and len(data) == recsstart +
>>> recslen:
>>> +                # between header and records: \0 separated branch names
>>> +                if recsstart != self.headersize:
>>> +                    self._names = \
>>> + data[self.headersize:recsstart].split('\0')
>>
>> Please use an intermediate variable instead of ugly \
>> continuation.
>
> Please don't say that. There is nothing in the Mercurial coding style
> about that. It says mostly pep8, and pep8 explicitly allows use of \
> when there is no obvious way to place () around it. If you want to
> enforce your own ugly style, please make Matt put it in the coding
> style. My attention span is not limited to one textual line and I
> strongly dislike the use of ugly intermediate redundant variables.

Given than there is only 28 such usages of "\" in the whole Mercurial 
code base and that I've already been asked to not use them by other 
reviewers (when I was young and foolish) I stick to my request to remove it.

>> Now I know how you handle branch name (but should stil be documented.
>> I see this as a minor issue that the cache has to be wiped for every
>> new branch.
>
> No. What makes you think that? It is just appended to the list and no
> indices will change.

Right, as I understood later in this patches, the whole content is 
written to disc everytime something change. So there is no data position 
to preserve.

However, This is going to be quite costly for big repo. I'm concerned 
about this. 1 millions revs means a 20MB caches. That is not small thing 
to flush to disk every so often.


>
>>
>>> +                # read records, cap at repo size
>>> +                self._records.fromstring(
>>> +                    buffer(data, recsstart, min(recslen, reporecslen)))
>>
>> wrong identation, should be aligned with the opening parents. You
>> could use and intermediate variable.
>
> That kind of indentation is explicitly allowed in pep8. Please don't
> call it wrong unless you change the rules so it become wrong.

The mercurial code base is mostly consistent in aligning line warping to 
the opening bracket. I guess this comes from Matt using emacs. This has 
been consistently enforced by multiple reviewers. Please stay consistent 
with the rest of the codebase.

>>> +                # only dirty if too many records (after strip)
>>> +                self._dirty = recslen > reporecslen
>>> +            else:
>>> +                self._repo.ui.debug('branch cache file was invalid\n')
>>
>> Consider reversing the logic and putting the short branch of the
>> if/else at start. This having having to refetch the context N line
>> higher when meeting this else.
>
> Considered ... but not liked. As it is now, the condition has no
> redundant negation, it focus on what is important, and the handling of
> the important case comes first.
>
> But the debug statement is arguably not important and I can remove it.

The debug statement is valuable. Keep it around.

>>> +        # pad to repo size
>>> +        if len(self._records) < reporecslen:
>>> +            self._records.extend(
>>> +                '\xff' * (reporecslen - len(self._records)))
>>> +
>>> +        self._branchnamesindex = dict((b, r)
>>> +                                      for r, b in
>>> enumerate(self._names))
>>> +        self._node = self._repo.changelog.node
>>> +        self._branchinfo = self._repo.changelog.branchinfo
>>> +        self._loaded = True
>>> +
>>> +    def branch(self, rev):
>>> +        """Return branch name of rev, using and updating persistent
>>> cache."""
>>> +        if not self._loaded:
>>> +            self._load()
>>> +
>>> +        node = self._node(rev)
>>> +        cachenode, branchidx = struct.unpack_from(self.recfmt,
>>> self._records,
>>> +                                                  rev * self.recsize)
>>
>> The runtime packing and de-packing make me shivers a bit but it seems
>> fairly reasonable.
>
> Unpacking everything up front and storing updates in a different format
> would make me shiver. This way of doing it is quite efficient in time
> and memory. Even more so if Python had mutable arrays of structs.

My statement above meant: I do not like it much, but it seems a valid 
and efficient option. We can keep it for now.

>
>>
>>> +        if cachenode == node and branchidx < len(self._names):
>>> +            return self._names[branchidx]
>>> +        b, _close = self._branchinfo(rev)
>>> +        if b in self._branchnamesindex:
>>> +            branchidx = self._branchnamesindex[b]
>>> +        else:
>>> +            branchidx = len(self._names)
>>> +            self._names.append(b)
>>> +            self._branchnamesindex[b] = branchidx
>>> +        struct.pack_into(self.recfmt, self._records, rev *
>>> self.recsize,
>>> +                         node, branchidx)
>>> +        self._dirty = True
>>> +        return b
>>> +
>>> +    def save(self):
>>> +        """Save branch cache if it is dirty."""
>>> +        if self._dirty:
>>> +            self._repo.ui.debug('writing branch cache file\n')
>>> +            try:
>>> +                f = self._repo.vfs.open(self.filename, 'w',
>>> atomictemp=True)
>>> +                s = '\0'.join(self._names)
>>> +                f.write(struct.pack(self.headerfmt, self.formatversion,
>>> +                                    self.headersize + len(s),
>>> +                                    len(self._records)))
>>> +                f.write(s)
>>> +                f.write(self._records)
>>> +                f.close()
>>> +            except IOError:
>>
>> Same here need to propage some of them
>
> Same: it is just a cache. Failing to update it should not cause a
> failure, no matter what the reason was.

Same same!

>
>>
>>> +                pass
>>> +            self._dirty = False
>>> +
>>> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
>>> --- a/mercurial/localrepo.py
>>> +++ b/mercurial/localrepo.py
>>> @@ -297,8 +297,11 @@ class localrepository(object):
>>>           # - bookmark changes
>>>           self.filteredrevcache = {}
>>>
>>> +        self.revbranchcache = branchmap.revbranchcache(self)
>>> +
>>>       def close(self):
>>> -        pass
>>> +        if self.revbranchcache:
>>> +            self.revbranchcache.save()
>>
>> Should we do that sooner instead?
>
> Why and when?

Hooking to something like unlocking of the repo (if a lock exist) would 
let this be used server side (also read commands server) and limit the 
risk of race where old data rewrite newer one.

The fact it is used only by revset now makes it less important but this 
should probably have the same write cycle than the branchmap.
Mads Kiilerich - Oct. 17, 2014, 4:40 p.m.
On 10/17/2014 07:11 AM, Pierre-Yves David wrote:
> On 10/16/2014 07:45 AM, Mads Kiilerich wrote:
>> On 10/16/2014 09:27 AM, Pierre-Yves David wrote:
>>> Your formatversion value looks temporary
>>
>> I can rename it to magic. That makes it more resilient than starting
>> with number 0 as many other files do. (Or we could just leave it out. It
>> is just a cache and errors will be detected and we would probably just
>> rebuild the cache after format change anyway.)
>
> Having this kind of version number to fail fast and gracefully is useful.
>
> Please start it to 0 or 1 (or something sensible that helps tracking 
> error down.

A magic number is even better at failing fast and gracefully and with 
less risk of incorrect acceptance of bad file formats.

>
>>>> +    def _load(self):
>>>> +        """Load cached branch names."""
>>>> +        try:
>>>> +            data = self._repo.vfs.open(self.filename).read()
>>>> +        except IOError:
>>>> +            data = ''
>>>
>>> missing the usual.
>>>
>>>   if err.errno != errno.ENOENT:
>>>       raise
>>
>> No. This is just a cache. I really don't care why the read failed.
>> Problems with this file should fall back to normal operation, never
>> block anything.
>
> Why don't you just except Exception then?

Because it just is IOErrors I don't care about. I do not want to 
enumerate IOErrors I don't care about (but EPERM would be another). 
Other kinds of errors should still be reported.

> I think restricting error catching is valuable and we should use the 
> same approach than for any other cache.

I see it more like how Python handles .pyc files. They are used and 
(re)written when possible, otherwise silently ignored.

>
>>> + self._dirty = True
>>>
>>> Not sure why we default to dirty as we just read the data from disk?
>>
>> It will be set to False after the file successfully has been parsed.
>
> Empty cache still seems to be something clean to me. 

Empty cache will load successfully and will be clean. It is only no 
cache that not is clean and will be written first time.

> I do not get what the value of initializing to False here.

The value is that we have exactly one place in the code where the cache 
has been correctly and cleanly loaded, but more than one place where it 
fails. Instead of marking it dirty when it fails I mark it as clean when 
it doesn't fail.

There might be different opinions on whether a failing load should be 
marked dirty or not. We could postpone the cleanup to the first 
following write, but I prefer to mark it dirty so we get it cleaned up 
ASAP. Postponing the cleanup will make the code two lines simpler so 
that would be fine with me too.

>>> Now I know how you handle branch name (but should stil be documented.
>>> I see this as a minor issue that the cache has to be wiped for every
>>> new branch.
>>
>> No. What makes you think that? It is just appended to the list and no
>> indices will change.
>
> Right, as I understood later in this patches, the whole content is 
> written to disc everytime something change. So there is no data 
> position to preserve.
>
> However, This is going to be quite costly for big repo. I'm concerned 
> about this. 1 millions revs means a 20MB caches. That is not small 
> thing to flush to disk every so often.

Yes. That is one reason to prefer the initial more fragile version that 
doesn't store the hashes. Right now the size will be comparable to the 
size of the changelog index.

It will however usually only be written on the first branch filtering 
operations after a repository change. And it will only be read/written 
when doing operations that tend to request branch information for "all" 
revisions and thus would be expensive anyway. ("Real" indices mapping 
from branch names to revisions would of course be even better for a 
whole category of queries.)

Instead, we could make the file format append only, possibly by keeping 
the branch names and indices in different files. That would however take 
the atomicity away and we would have to use locking and a different 
write-back scheme.

Different implementations can be considered later. This is just a cache.

>
>>
>>>
>>>> +                pass
>>>> +            self._dirty = False
>>>> +
>>>> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
>>>> --- a/mercurial/localrepo.py
>>>> +++ b/mercurial/localrepo.py
>>>> @@ -297,8 +297,11 @@ class localrepository(object):
>>>>           # - bookmark changes
>>>>           self.filteredrevcache = {}
>>>>
>>>> +        self.revbranchcache = branchmap.revbranchcache(self)
>>>> +
>>>>       def close(self):
>>>> -        pass
>>>> +        if self.revbranchcache:
>>>> +            self.revbranchcache.save()
>>>
>>> Should we do that sooner instead?
>>
>> Why and when?
>
> Hooking to something like unlocking of the repo (if a lock exist) 
> would let this be used server side (also read commands server) and 
> limit the risk of race where old data rewrite newer one.

I understand there are plans/thoughts about moving all cache writing to 
unlocking / transaction commit. This cache can go there together with 
the other caches.

> The fact it is used only by revset now makes it less important but 
> this should probably have the same write cycle than the branchmap.

Except that branchmaps are 100% computed in one go and could be written 
immediately. This cache will only cache results that has been requested 
and we do not know when there will be no more results.

Other implementations that keep the cache 100% updated all the time can 
be considered later. This is just a cache and the file format and 
implementation can be changed.

/Mads
Mads Kiilerich - Oct. 18, 2014, 12:12 a.m.
On 10/18/2014 12:44 AM, Pierre-Yves David wrote:
> So if the repo is empty, we ensure we write and empty cache file? 
> Seems a bit backward to me.

That is making sure that the cache file always basically matches the 
repo, as far as we cheaply can detect. That seems simple and straight 
forward to me.

Anyway, if there is other comments to v5 I can resend without any cache 
invalidation at all in cache load.

> I'm still concerned about this. I dunno if we should get it in for the 
> freeze and disable if too impactfull or if we should delay that for 
> early 3.2 cycles.

I think the impact so big in a positive way that it is worth taking. 10 
s for a very simple and realistic real world case on a real repo down to 
0.5 s.

>> I understand there are plans/thoughts about moving all cache writing to
>> unlocking / transaction commit. This cache can go there together with
>> the other caches.
>
> Writing on unlock seems a similar amount of work than writing on 
> close. What about we do it too?

Sure. It sounds like that is something that should be in a separate 
follow-up patch anyway and on its own merits.

This caching results of operations done on a localrepo instance during 
its lifetime, and saving when closing it seems like simplest and safest 
solution. A baseline that can be improved later.

>> Other implementations that keep the cache 100% updated all the time can
>> be considered later. This is just a cache and the file format and
>> implementation can be changed.
>
> If you use this for branchmaps, you will ensure it is fully updated at 
> the same time as branchmap and can mostly stick to its update cycle 

Sure. A nice separate feature that can follow later.

/Mads

Patch

diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
--- a/mercurial/branchmap.py
+++ b/mercurial/branchmap.py
@@ -9,6 +9,7 @@  from node import bin, hex, nullid, nullr
 import encoding
 import util
 import time
+import struct, array
 
 def _filename(repo):
     """name of a branchcache file for a given repo or repoview"""
@@ -285,3 +286,97 @@  class branchcache(dict):
         duration = time.time() - starttime
         repo.ui.log('branchcache', 'updated %s branch cache in %.4f seconds\n',
                     repo.filtername, duration)
+
+class revbranchcache(object):
+    """Persistent cache mapping from revision number to branch name.
+    Consistency is guaranteed by verifying the node hash."""
+
+    filename = 'cache/branchnames'
+    formatversion = 2345164374
+    headerfmt = '>LLL' # file header: version, records start, records length
+    recfmt = '>20sH' # a record: node hash, branch name reference
+    headersize = struct.calcsize(headerfmt)
+    recsize = struct.calcsize(recfmt)
+
+    def __init__(self, repo):
+        self._repo = repo
+        self._loaded = False
+        self._dirty = False
+        self._names = [] # branch names referenced from recfmt records
+        self._records = array.array('c') # bytes with structs of type recfmt
+
+    def _load(self):
+        """Load cached branch names."""
+        try:
+            data = self._repo.vfs.open(self.filename).read()
+        except IOError:
+            data = ''
+
+        self._dirty = True
+        reporecslen = len(self._repo) * self.recsize
+        if len(data) >= self.headersize:
+            # header
+            v, recsstart, recslen = struct.unpack_from(self.headerfmt, data)
+            if v == self.formatversion and len(data) == recsstart + recslen:
+                # between header and records: \0 separated branch names
+                if recsstart != self.headersize:
+                    self._names = \
+                        data[self.headersize:recsstart].split('\0')
+                # read records, cap at repo size
+                self._records.fromstring(
+                    buffer(data, recsstart, min(recslen, reporecslen)))
+                # only dirty if too many records (after strip)
+                self._dirty = recslen > reporecslen
+            else:
+                self._repo.ui.debug('branch cache file was invalid\n')
+
+        # pad to repo size
+        if len(self._records) < reporecslen:
+            self._records.extend(
+                '\xff' * (reporecslen - len(self._records)))
+
+        self._branchnamesindex = dict((b, r)
+                                      for r, b in enumerate(self._names))
+        self._node = self._repo.changelog.node
+        self._branchinfo = self._repo.changelog.branchinfo
+        self._loaded = True
+
+    def branch(self, rev):
+        """Return branch name of rev, using and updating persistent cache."""
+        if not self._loaded:
+            self._load()
+
+        node = self._node(rev)
+        cachenode, branchidx = struct.unpack_from(self.recfmt, self._records,
+                                                  rev * self.recsize)
+        if cachenode == node and branchidx < len(self._names):
+            return self._names[branchidx]
+        b, _close = self._branchinfo(rev)
+        if b in self._branchnamesindex:
+            branchidx = self._branchnamesindex[b]
+        else:
+            branchidx = len(self._names)
+            self._names.append(b)
+            self._branchnamesindex[b] = branchidx
+        struct.pack_into(self.recfmt, self._records, rev * self.recsize,
+                         node, branchidx)
+        self._dirty = True
+        return b
+
+    def save(self):
+        """Save branch cache if it is dirty."""
+        if self._dirty:
+            self._repo.ui.debug('writing branch cache file\n')
+            try:
+                f = self._repo.vfs.open(self.filename, 'w', atomictemp=True)
+                s = '\0'.join(self._names)
+                f.write(struct.pack(self.headerfmt, self.formatversion,
+                                    self.headersize + len(s),
+                                    len(self._records)))
+                f.write(s)
+                f.write(self._records)
+                f.close()
+            except IOError:
+                pass
+            self._dirty = False
+
diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -297,8 +297,11 @@  class localrepository(object):
         # - bookmark changes
         self.filteredrevcache = {}
 
+        self.revbranchcache = branchmap.revbranchcache(self)
+
     def close(self):
-        pass
+        if self.revbranchcache:
+            self.revbranchcache.save()
 
     def _restrictcapabilities(self, caps):
         # bundle2 is not ready for prime time, drop it unless explicitly
diff --git a/mercurial/statichttprepo.py b/mercurial/statichttprepo.py
--- a/mercurial/statichttprepo.py
+++ b/mercurial/statichttprepo.py
@@ -141,6 +141,7 @@  class statichttprepository(localrepo.loc
         self._branchcaches = {}
         self.encodepats = None
         self.decodepats = None
+        self.revbranchcache = None
 
     def _restrictcapabilities(self, caps):
         caps = super(statichttprepository, self)._restrictcapabilities(caps)