Patchwork [3,of,3] tags: preserve filtered .hgtags filenodes in tags cache (issue4550)

login
register
mail settings
Submitter Gregory Szorc
Date Feb. 24, 2015, 9:11 a.m.
Message ID <7f1904705c29ebe7de38.1424769114@gps-mbp.local>
Download mbox | patch
Permalink /patch/7825/
State Changes Requested
Delegated to: Pierre-Yves David
Headers show

Comments

Gregory Szorc - Feb. 24, 2015, 9:11 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1424769075 28800
#      Tue Feb 24 01:11:15 2015 -0800
# Branch stable
# Node ID 7f1904705c29ebe7de3874f2f03c42e261ed1c96
# Parent  7d72752b8da5bb2482e6eac47545a78ed3fff592
tags: preserve filtered .hgtags filenodes in tags cache (issue4550)

If the tags cache is populated on an unfiltered repository and later
populated on a filtered repository, .hgtags filenode entries for
filtered revisions will disappear from the tags cache because the tags
cache code currently filters out filenode entries for revisions not
known to the current repo object. This behavior results in potentially
expensive recalculation of .hgtags filenode values for filtered
revisions. For evolution users, who create many hidden changesets and
heads, this could result in gradual slowdown, as each hidden head will
add overhead to resolving tags on an unfiltered repo.

This patch makes the tags cache filtered revision aware. Filenode
entries for filtered revisions are preserved during reading and writing.
Entries are only dropped from the tags cache if they don't correspond to
a head, filtered or otherwise.
Ryan McElroy - Feb. 25, 2015, 8:44 p.m.
On 2/24/2015 1:11 AM, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1424769075 28800
> #      Tue Feb 24 01:11:15 2015 -0800
> # Branch stable
> # Node ID 7f1904705c29ebe7de3874f2f03c42e261ed1c96
> # Parent  7d72752b8da5bb2482e6eac47545a78ed3fff592
> tags: preserve filtered .hgtags filenodes in tags cache (issue4550)
>
> If the tags cache is populated on an unfiltered repository and later
> populated on a filtered repository, .hgtags filenode entries for
> filtered revisions will disappear from the tags cache because the tags
> cache code currently filters out filenode entries for revisions not
> known to the current repo object. This behavior results in potentially
> expensive recalculation of .hgtags filenode values for filtered
> revisions. For evolution users, who create many hidden changesets and
> heads, this could result in gradual slowdown, as each hidden head will
> add overhead to resolving tags on an unfiltered repo.
>
> This patch makes the tags cache filtered revision aware. Filenode
> entries for filtered revisions are preserved during reading and writing.
> Entries are only dropped from the tags cache if they don't correspond to
> a head, filtered or otherwise.
>
> diff --git a/mercurial/tags.py b/mercurial/tags.py
> --- a/mercurial/tags.py
> +++ b/mercurial/tags.py
> @@ -246,12 +246,15 @@ def _readtagcache(ui, repo):
>           return (None, None, tags, False)
>       if cachefile:
>           cachefile.close()               # ignore rest of file
>   
> -    repoheads = repo.heads()
> +    ourheads = repo.heads()
> +    repo = repo.unfiltered()
> +    allheads = repo.heads()
> +
>       # Case 2 (uncommon): empty repo; get out quickly and don't bother
>       # writing an empty cache.
> -    if repoheads == [nullid]:
> +    if allheads == [nullid]:
>           return ([], {}, {}, False)
>   
>       # Case 3 (uncommon): cache file missing or empty.
>   
> @@ -268,14 +271,14 @@ def _readtagcache(ui, repo):
>       # exposed".
>       if not len(repo.file('.hgtags')):
>           # No tags have ever been committed, so we can avoid a
>           # potentially expensive search.
> -        return (repoheads, cachefnode, None, True)
> +        return (ourheads, cachefnode, None, True)
>   
>       starttime = time.time()
>   
>       newheads = [head
> -                for head in repoheads
> +                for head in allheads
>                   if head not in set(cacheheads)]
>   
>       # Now we have to lookup the .hgtags filenode for every new head.
>       # This is the most expensive part of finding tags, so performance
> @@ -297,9 +300,9 @@ def _readtagcache(ui, repo):
>              len(cachefnode), len(newheads), duration)
>   
>       # Caller has to iterate over all heads, but can use the filenodes in
>       # cachefnode to get to each .hgtags revision quickly.
> -    return (repoheads, cachefnode, None, True)
> +    return (ourheads, cachefnode, None, True)
>   
>   def _writetagcache(ui, repo, heads, tagfnode, cachetags):
>   
>       try:
> @@ -309,29 +312,39 @@ def _writetagcache(ui, repo, heads, tagf
>   
>       ui.log('tagscache', 'writing tags cache file with %d heads and %d tags\n',
>               len(heads), len(cachetags))
>   
> -    realheads = repo.heads()            # for sanity checks below
> +    # We want to carry forward tagfnode entries that belong to filtered revs,
> +    # even if they aren't in the explicit list of heads. Since entries in the
> +    # cache must be in descending revlog order, we need to merge the sets
> +    # before writing.
> +    #
> +    # When choosing what filenode entries to write, we must consider both the
> +    # filtered and unfiltered views. Otherwise, valid entries may be dropped.
> +    revs = {}
> +    ourheads = set(repo.heads())
> +    repo = repo.unfiltered()
> +    unfilteredheads = set(repo.heads())
> +    allheads = ourheads | unfilteredheads
/me confused: Why isn't unfilteredheads == allheads?
> +    for head, fnode in tagfnode.items():
> +        if head not in allheads:
> +            continue
> +
> +        rev = repo.changelog.rev(head)
> +        revs[rev] = '%s %s' % (hex(head), hex(fnode))
> +
>       for head in heads:
> -        # temporary sanity checks; these can probably be removed
> -        # once this code has been in crew for a few weeks
> -        assert head in repo.changelog.nodemap, \
> -               'trying to write non-existent node %s to tag cache' % short(head)
> -        assert head in realheads, \
> +        assert head in allheads, \
>                  'trying to write non-head %s to tag cache' % short(head)
>           assert head != nullid, \
>                  'trying to write nullid to tag cache'
>   
> -        # This can't fail because of the first assert above.  When/if we
> -        # remove that assert, we might want to catch LookupError here
> -        # and downgrade it to a warning.
> -        rev = repo.changelog.rev(head)
> +        if head not in tagfnode:
> +            rev = repo.changelog.rev(head)
> +            revs[rev] = hex(head)
>   
> -        fnode = tagfnode.get(head)
> -        if fnode:
> -            cachefile.write('%d %s %s\n' % (rev, hex(head), hex(fnode)))
> -        else:
> -            cachefile.write('%d %s\n' % (rev, hex(head)))
> +    for rev, line in sorted(revs.items(), reverse=True):
> +        cachefile.write('%d %s\n' % (rev, line))
>   
>       # Tag names in the cache are in UTF-8 -- which is the whole reason
>       # we keep them in UTF-8 throughout this module.  If we converted
>       # them local encoding on input, we would lose info writing them to
> diff --git a/tests/test-obsolete-tag-cache.t b/tests/test-obsolete-tag-cache.t
> --- a/tests/test-obsolete-tag-cache.t
> +++ b/tests/test-obsolete-tag-cache.t
> @@ -62,8 +62,9 @@ repopulation
>   .hgtags filenodes for hidden heads should be visible (issue4550)
>   
>     $ cat .hg/cache/tags
>     7 eb610439e10e0c6b296f97b59624c2e24fc59e30 b3bce87817fe7ac9dca2834366c1d7534c095cf1
> +  3 c3cb30f2d2cd0aae008cc91a07876e3c5131fd22 b3bce87817fe7ac9dca2834366c1d7534c095cf1
>     
>     55482a6fb4b1881fa8f746fd52cf6f096bb21c89 test1
>     d75775ffbc6bca1794d300f5571272879bd280da test2
>   

This feels like an important fix to me; however I have some suggestions 
to make this rather complicated change more reviewable:

(1) Refactor the names (into ourheads and allheads). This should be a 
trivial thing to review and will make the subsequent diffs smaller and 
focused more on fixing the actual issue
(2) The fixes to each function can be separate as well -- again, it will 
make the cognitive load of reviewing lower, and it will be more 
obviously correct
(3) Then you can turn the test back in the final diff on and *voila* it 
will pass, much happiness will ensue.
Gregory Szorc - Feb. 25, 2015, 9:08 p.m.
On Wed, Feb 25, 2015 at 12:44 PM, Ryan McElroy <rm@fb.com> wrote:

>
> On 2/24/2015 1:11 AM, Gregory Szorc wrote:
>
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1424769075 28800
>> #      Tue Feb 24 01:11:15 2015 -0800
>> # Branch stable
>> # Node ID 7f1904705c29ebe7de3874f2f03c42e261ed1c96
>> # Parent  7d72752b8da5bb2482e6eac47545a78ed3fff592
>> tags: preserve filtered .hgtags filenodes in tags cache (issue4550)
>>
>> If the tags cache is populated on an unfiltered repository and later
>> populated on a filtered repository, .hgtags filenode entries for
>> filtered revisions will disappear from the tags cache because the tags
>> cache code currently filters out filenode entries for revisions not
>> known to the current repo object. This behavior results in potentially
>> expensive recalculation of .hgtags filenode values for filtered
>> revisions. For evolution users, who create many hidden changesets and
>> heads, this could result in gradual slowdown, as each hidden head will
>> add overhead to resolving tags on an unfiltered repo.
>>
>> This patch makes the tags cache filtered revision aware. Filenode
>> entries for filtered revisions are preserved during reading and writing.
>> Entries are only dropped from the tags cache if they don't correspond to
>> a head, filtered or otherwise.
>>
>> diff --git a/mercurial/tags.py b/mercurial/tags.py
>> --- a/mercurial/tags.py
>> +++ b/mercurial/tags.py
>> @@ -246,12 +246,15 @@ def _readtagcache(ui, repo):
>>           return (None, None, tags, False)
>>       if cachefile:
>>           cachefile.close()               # ignore rest of file
>>   -    repoheads = repo.heads()
>> +    ourheads = repo.heads()
>> +    repo = repo.unfiltered()
>> +    allheads = repo.heads()
>> +
>>       # Case 2 (uncommon): empty repo; get out quickly and don't bother
>>       # writing an empty cache.
>> -    if repoheads == [nullid]:
>> +    if allheads == [nullid]:
>>           return ([], {}, {}, False)
>>         # Case 3 (uncommon): cache file missing or empty.
>>   @@ -268,14 +271,14 @@ def _readtagcache(ui, repo):
>>       # exposed".
>>       if not len(repo.file('.hgtags')):
>>           # No tags have ever been committed, so we can avoid a
>>           # potentially expensive search.
>> -        return (repoheads, cachefnode, None, True)
>> +        return (ourheads, cachefnode, None, True)
>>         starttime = time.time()
>>         newheads = [head
>> -                for head in repoheads
>> +                for head in allheads
>>                   if head not in set(cacheheads)]
>>         # Now we have to lookup the .hgtags filenode for every new head.
>>       # This is the most expensive part of finding tags, so performance
>> @@ -297,9 +300,9 @@ def _readtagcache(ui, repo):
>>              len(cachefnode), len(newheads), duration)
>>         # Caller has to iterate over all heads, but can use the filenodes
>> in
>>       # cachefnode to get to each .hgtags revision quickly.
>> -    return (repoheads, cachefnode, None, True)
>> +    return (ourheads, cachefnode, None, True)
>>     def _writetagcache(ui, repo, heads, tagfnode, cachetags):
>>         try:
>> @@ -309,29 +312,39 @@ def _writetagcache(ui, repo, heads, tagf
>>         ui.log('tagscache', 'writing tags cache file with %d heads and %d
>> tags\n',
>>               len(heads), len(cachetags))
>>   -    realheads = repo.heads()            # for sanity checks below
>> +    # We want to carry forward tagfnode entries that belong to filtered
>> revs,
>> +    # even if they aren't in the explicit list of heads. Since entries
>> in the
>> +    # cache must be in descending revlog order, we need to merge the sets
>> +    # before writing.
>> +    #
>> +    # When choosing what filenode entries to write, we must consider
>> both the
>> +    # filtered and unfiltered views. Otherwise, valid entries may be
>> dropped.
>> +    revs = {}
>> +    ourheads = set(repo.heads())
>> +    repo = repo.unfiltered()
>> +    unfilteredheads = set(repo.heads())
>> +    allheads = ourheads | unfilteredheads
>>
> /me confused: Why isn't unfilteredheads == allheads?
>

There are two flavors of DAG heads: unfiltered and filtered. If a true DAG
head is hidden, you'll need to record its parent, as it appears as a head
on filtered repos.

This is a case not explicitly tested by this patch series. This code may
even cause some heads to get dropped from the cache. However, that scenario
should be rare and will only result in loss of a handful of heads, not the
potentially hundreds that will be lost if all hidden DAG heads need
recomputed.
Ryan McElroy - Feb. 25, 2015, 9:25 p.m.
On 2/25/2015 1:08 PM, Gregory Szorc wrote:
>
>         + allheads = ourheads | unfilteredheads
>
>     /me confused: Why isn't unfilteredheads == allheads?
>
>
> There are two flavors of DAG heads: unfiltered and filtered. If a true 
> DAG head is hidden, you'll need to record its parent, as it appears as 
> a head on filtered repos.
>
> This is a case not explicitly tested by this patch series. This code 
> may even cause some heads to get dropped from the cache. However, that 
> scenario should be rare and will only result in loss of a handful of 
> heads, not the potentially hundreds that will be lost if all hidden 
> DAG heads need recomputed.

Ah, yes, that makes sense. Thanks for the clarification.
Pierre-Yves David - Feb. 27, 2015, 6:29 p.m.
On 02/24/2015 09:11 AM, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc<gregory.szorc@gmail.com>
> # Date 1424769075 28800
> #      Tue Feb 24 01:11:15 2015 -0800
> # Branch stable
> # Node ID 7f1904705c29ebe7de3874f2f03c42e261ed1c96
> # Parent  7d72752b8da5bb2482e6eac47545a78ed3fff592
> tags: preserve filtered .hgtags filenodes in tags cache (issue4550)

It is clear that we need some filter awareness for tags.

However, the level of filter impact what are heads and therefore what 
file version-s- of .hgtags are considered. Therefore it seems strange to 
me that you have been able to build a solution that take filter into 
account -after- the tags have been computed.

So either:
- I missed something in your patch which is quite complexe
- I missed something in my understanding of the issue
- There is something wrong with this patch.

Please find which of this options is true ;-)

Out of my head I would expect we need a "one cache file per filter" 
approach as we have for branch heads. We had a related discussion in 
Munich where you sugguested to cache different step of the operation so 
that we do not need to unpack manifest for all heads when computing 
tags. This could help to share computation between each filter level.

Cheers,
Gregory Szorc - Feb. 27, 2015, 7:11 p.m.
On Fri, Feb 27, 2015 at 10:29 AM, Pierre-Yves David <
pierre-yves.david@ens-lyon.org> wrote:

>
>
> On 02/24/2015 09:11 AM, Gregory Szorc wrote:
>
>> # HG changeset patch
>> # User Gregory Szorc<gregory.szorc@gmail.com>
>> # Date 1424769075 28800
>> #      Tue Feb 24 01:11:15 2015 -0800
>> # Branch stable
>> # Node ID 7f1904705c29ebe7de3874f2f03c42e261ed1c96
>> # Parent  7d72752b8da5bb2482e6eac47545a78ed3fff592
>> tags: preserve filtered .hgtags filenodes in tags cache (issue4550)
>>
>
> It is clear that we need some filter awareness for tags.
>
> However, the level of filter impact what are heads and therefore what file
> version-s- of .hgtags are considered. Therefore it seems strange to me that
> you have been able to build a solution that take filter into account
> -after- the tags have been computed.
>
> So either:
> - I missed something in your patch which is quite complexe
> - I missed something in my understanding of the issue
> - There is something wrong with this patch.
>
> Please find which of this options is true ;-)
>

The most important thing to understand about the tags cache is that it is
two caches in one. The first part is a mapping of head nodes to .hgtags
filenodes. The second part is a cache of the results of resolving tags from
the contents of all the .hgtags for a given set of heads (a view).

This (admittedly hacky) patch impacts just the first part, the filenode
cache. Instead of merely consulting the heads from the repo that was passed
in, we always expand context to include unfiltered heads as well. There
continues to exist multiple views of the resolved tags values, but only the
most recently-computed view is persisted in the cache. I believe this "just
works."



> Out of my head I would expect we need a "one cache file per filter"
> approach as we have for branch heads. We had a related discussion in Munich
> where you sugguested to cache different step of the operation so that we do
> not need to unpack manifest for all heads when computing tags. This could
> help to share computation between each filter level.
>

The tags cache needs overhauled. This is something I've long wanted to do
but can't seem to find the time to actually do.

IMO the tags cache first needs to be split. The .hgtags filenode mapping
needs to be in a separate file and we need to be less aggressive about
throwing away perfectly valid entries. The cache invalidation logic ensures
that once a head is stripped, expensive filenode resolution can occur. We
need to rely on the idempotence of the changeset to filenode value and
cache these mappings very aggressively.

As you said, we also probably want a separate tags cache per filter.

This can be on the table for 3.4.

However, 3.3 caused a regression in tags cache performance when
repositories have hidden commits. This is either due from transaction
changes or from the namespace API. Something is needed to restore old
performance or obsolescence users of large repositories are going to have a
very bad time. This patch is my best effort.

Does this explanation help?

Patch

diff --git a/mercurial/tags.py b/mercurial/tags.py
--- a/mercurial/tags.py
+++ b/mercurial/tags.py
@@ -246,12 +246,15 @@  def _readtagcache(ui, repo):
         return (None, None, tags, False)
     if cachefile:
         cachefile.close()               # ignore rest of file
 
-    repoheads = repo.heads()
+    ourheads = repo.heads()
+    repo = repo.unfiltered()
+    allheads = repo.heads()
+
     # Case 2 (uncommon): empty repo; get out quickly and don't bother
     # writing an empty cache.
-    if repoheads == [nullid]:
+    if allheads == [nullid]:
         return ([], {}, {}, False)
 
     # Case 3 (uncommon): cache file missing or empty.
 
@@ -268,14 +271,14 @@  def _readtagcache(ui, repo):
     # exposed".
     if not len(repo.file('.hgtags')):
         # No tags have ever been committed, so we can avoid a
         # potentially expensive search.
-        return (repoheads, cachefnode, None, True)
+        return (ourheads, cachefnode, None, True)
 
     starttime = time.time()
 
     newheads = [head
-                for head in repoheads
+                for head in allheads
                 if head not in set(cacheheads)]
 
     # Now we have to lookup the .hgtags filenode for every new head.
     # This is the most expensive part of finding tags, so performance
@@ -297,9 +300,9 @@  def _readtagcache(ui, repo):
            len(cachefnode), len(newheads), duration)
 
     # Caller has to iterate over all heads, but can use the filenodes in
     # cachefnode to get to each .hgtags revision quickly.
-    return (repoheads, cachefnode, None, True)
+    return (ourheads, cachefnode, None, True)
 
 def _writetagcache(ui, repo, heads, tagfnode, cachetags):
 
     try:
@@ -309,29 +312,39 @@  def _writetagcache(ui, repo, heads, tagf
 
     ui.log('tagscache', 'writing tags cache file with %d heads and %d tags\n',
             len(heads), len(cachetags))
 
-    realheads = repo.heads()            # for sanity checks below
+    # We want to carry forward tagfnode entries that belong to filtered revs,
+    # even if they aren't in the explicit list of heads. Since entries in the
+    # cache must be in descending revlog order, we need to merge the sets
+    # before writing.
+    #
+    # When choosing what filenode entries to write, we must consider both the
+    # filtered and unfiltered views. Otherwise, valid entries may be dropped.
+    revs = {}
+    ourheads = set(repo.heads())
+    repo = repo.unfiltered()
+    unfilteredheads = set(repo.heads())
+    allheads = ourheads | unfilteredheads
+    for head, fnode in tagfnode.items():
+        if head not in allheads:
+            continue
+
+        rev = repo.changelog.rev(head)
+        revs[rev] = '%s %s' % (hex(head), hex(fnode))
+
     for head in heads:
-        # temporary sanity checks; these can probably be removed
-        # once this code has been in crew for a few weeks
-        assert head in repo.changelog.nodemap, \
-               'trying to write non-existent node %s to tag cache' % short(head)
-        assert head in realheads, \
+        assert head in allheads, \
                'trying to write non-head %s to tag cache' % short(head)
         assert head != nullid, \
                'trying to write nullid to tag cache'
 
-        # This can't fail because of the first assert above.  When/if we
-        # remove that assert, we might want to catch LookupError here
-        # and downgrade it to a warning.
-        rev = repo.changelog.rev(head)
+        if head not in tagfnode:
+            rev = repo.changelog.rev(head)
+            revs[rev] = hex(head)
 
-        fnode = tagfnode.get(head)
-        if fnode:
-            cachefile.write('%d %s %s\n' % (rev, hex(head), hex(fnode)))
-        else:
-            cachefile.write('%d %s\n' % (rev, hex(head)))
+    for rev, line in sorted(revs.items(), reverse=True):
+        cachefile.write('%d %s\n' % (rev, line))
 
     # Tag names in the cache are in UTF-8 -- which is the whole reason
     # we keep them in UTF-8 throughout this module.  If we converted
     # them local encoding on input, we would lose info writing them to
diff --git a/tests/test-obsolete-tag-cache.t b/tests/test-obsolete-tag-cache.t
--- a/tests/test-obsolete-tag-cache.t
+++ b/tests/test-obsolete-tag-cache.t
@@ -62,8 +62,9 @@  repopulation
 .hgtags filenodes for hidden heads should be visible (issue4550)
 
   $ cat .hg/cache/tags
   7 eb610439e10e0c6b296f97b59624c2e24fc59e30 b3bce87817fe7ac9dca2834366c1d7534c095cf1
+  3 c3cb30f2d2cd0aae008cc91a07876e3c5131fd22 b3bce87817fe7ac9dca2834366c1d7534c095cf1
   
   55482a6fb4b1881fa8f746fd52cf6f096bb21c89 test1
   d75775ffbc6bca1794d300f5571272879bd280da test2