Patchwork [2,of,2,V3] treemanifest: make node reuse match flat manifest behavior

login
register
mail settings
Submitter Durham Goode
Date March 2, 2017, 12:37 a.m.
Message ID <f99c019063bea8a0b5f6.1488415055@dev111.prn1.facebook.com>
Download mbox | patch
Permalink /patch/18856/
State Changes Requested
Delegated to: Martin von Zweigbergk
Headers show

Comments

Durham Goode - March 2, 2017, 12:37 a.m.
# HG changeset patch
# User Durham Goode <durham@fb.com>
# Date 1488413981 28800
#      Wed Mar 01 16:19:41 2017 -0800
# Node ID f99c019063bea8a0b5f66c1e79dbd0b98b1326d8
# Parent  9c89185b63f574cad1d4009f6c9dd511a2cba986
treemanifest: make node reuse match flat manifest behavior

In a flat manifest, a node with the same content but different parents is still
considered a new node. In the current tree manifests however, if the content is
the same, we ignore the parents entirely and just reuse the existing node.

In our external treemanifest extension, we want to allow having one treemanifest
for every flat manifests, as a way of easeing the migration to treemanifests. To
make this possible, let's change the root node treemanifest behavior to match
the behavior for flat manifests, so we can have a 1:1 relationship.

While this sounds like a BC breakage, it's not actually a state users can
normally get in because: A) you can't make empty commits, and B) even if you try
to make an empty commit (by making a commit then amending it's changes away),
the higher level commit logic in localrepo.commitctx() forces the commit to use
the original p1 manifest node if no files were changed. So this would only
affect extensions and automation that reached passed the normal
localrepo.commit() logic straight into the manifest logic.
via Mercurial-devel - March 3, 2017, 6:35 p.m.
For now, I'm just adding back mercurial-devel that I accidentally
dropped in my reply (sorry, the default is different between Gmail and
Inbox, and I use both).

On Fri, Mar 3, 2017 at 10:05 AM, Durham Goode <durham@fb.com> wrote:
>
>
> On 3/3/17 9:25 AM, Martin von Zweigbergk wrote:
>>
>> On Wed, Mar 1, 2017 at 4:37 PM, Durham Goode <durham@fb.com> wrote:
>>>
>>> # HG changeset patch
>>> # User Durham Goode <durham@fb.com>
>>> # Date 1488413981 28800
>>> #      Wed Mar 01 16:19:41 2017 -0800
>>> # Node ID f99c019063bea8a0b5f66c1e79dbd0b98b1326d8
>>> # Parent  9c89185b63f574cad1d4009f6c9dd511a2cba986
>>> treemanifest: make node reuse match flat manifest behavior
>>>
>>> In a flat manifest, a node with the same content but different parents is
>>> still
>>> considered a new node. In the current tree manifests however, if the
>>> content is
>>> the same, we ignore the parents entirely and just reuse the existing
>>> node.
>>
>>
>> I think I copied that behavior from filelogs.
>
>
> filelog.add() calls straight to revlog.addrevision() which only exits early
> if the full p1+p2+data hash matches an existing full hash. In fact, it
> couldn't exit early for a content-only match because it would have to
> decompress the p1 content to even check. The only reason treemanifest could
> do it is because it happens to always have the p1 in memory and available
> for comparison.
>
>>>
>>> In our external treemanifest extension, we want to allow having one
>>> treemanifest
>>> for every flat manifests,
>>
>>
>> Nit: did you mean it the other way around? Because it sounds like you
>> can already have one treemanifest per flat manifest, but not one flat
>> manifest per treemanifest. But the next sentence says you want them
>> 1:1, so maybe add "and vice versa" to be clear, or "one unique
>> treemanifest per flat manifest".
>
>
> K, I can add the word unique to it.  When I said 'one x for every y' I meant
> len(x) == len(y), but maybe it's not so clear.
>
>>>  as a way of easeing the migration to treemanifests. To
>>> make this possible, let's change the root node treemanifest behavior to
>>> match
>>> the behavior for flat manifests, so we can have a 1:1 relationship.
>>
>>
>> Could you elaborate on what complications the current implementation
>> causes for you? It may not be necessary to put it in an updated commit
>> message, but I'd like to understand the reason better.
>
>
> As part of our migration to treemanifest, I'm backfilling all the trees for
> our entire repo, and I'm giving the root node the original flat-manifest
> hash, so eventually we can even delete the flat manifest and our old commits
> will still work. To make this possible, I need to have 1 unique treemanifest
> for every flat manifest, so all the old manifest hashes can still be
> referenced.
>
>> It's a little strange to me to use a different condition for the root
>> manifest. If we think that we should always emit a new revision if
>> either content or parents have changed, then maybe we should do that
>> for all directories. If we don't, it seems like we should not do it
>> for the root manifest either. In that case, I'd be happier to extract
>> a function or two that makes it easier for you guys to override it
>> internally instead.
>
>
> One difference between the root manifest and the sub-trees is that
> manifest.py does not control the p1/p2 for the root, but it does control the
> p1/p2 for the sub-trees.  If manifest.py controlled p1/p2 for the root, we
> could be smart and not produce a new tree when the content matches p1 or p2.
> But because an external caller provides p1/p2, if manifest.add(p1, p2,
> content) returns a node that doesn't actually have p1 and p2 as the parents
> (which is what can happen today), then we're not actually respecting what
> the caller asked us to do. That's why we need to respect it for the root,
> but not necessarily for the sub-trees.
>
> Usually Mercurial is good about not calling manifest.add(p1, p2, content)
> when the content matches p1, which is why we don't see a ton of empty
> manifest deltas in the repo. This does become an issue for merge commits
> though, where the code doesn't just want to throw out the p2 information so
> it calls mf.add(p1, p2, same_content), which is how we found the issue in
> the first place.
>
> Technically we could apply this behavior to every sub-tree as well, but that
> introduces complications. For instance, I think it'd mean we have to create
> a new empty sub-tree for every commit, since for every sub-tree the content
> is the same but the parents are not.
>
>>>
>>> While this sounds like a BC breakage, it's not actually a state users can
>>> normally get in because: A) you can't make empty commits, and B) even if
>>> you try
>>> to make an empty commit (by making a commit then amending it's changes
>>> away),
>>> the higher level commit logic in localrepo.commitctx() forces the commit
>>> to use
>>> the original p1 manifest node if no files were changed. So this would
>>> only
>>> affect extensions and automation that reached passed the normal
>>> localrepo.commit() logic straight into the manifest logic.
>>
>>
>> So what does the included test case do? There is no extension involved
>> there.
>
>
> It just verifies the existing behavior. I wrote it hoping it'd expose the
> bug, but it just worked, which led me notice the localrepo.commitctx logic.
>
>
>>
>>>
>>> diff --git a/mercurial/manifest.py b/mercurial/manifest.py
>>> --- a/mercurial/manifest.py
>>> +++ b/mercurial/manifest.py
>>> @@ -1233,7 +1233,7 @@ class manifestrevlog(revlog.revlog):
>>>      def _addtree(self, m, transaction, link, m1, m2, readtree):
>>>          # If the manifest is unchanged compared to one parent,
>>>          # don't write a new revision
>>> -        if m.unmodifiedsince(m1) or m.unmodifiedsince(m2):
>>> +        if self._dir != '' and (m.unmodifiedsince(m1) or
>>> m.unmodifiedsince(m2)):
>>>              return m.node()
>>>          def writesubtree(subm, subp1, subp2):
>>>              sublog = self.dirlog(subm.dir())
>>> @@ -1241,13 +1241,17 @@ class manifestrevlog(revlog.revlog):
>>>                         readtree=readtree)
>>>          m.writesubtrees(m1, m2, writesubtree)
>>>          text = m.dirtext(self._usemanifestv2)
>>> -        # Double-check whether contents are unchanged to one parent
>>> -        if text == m1.dirtext(self._usemanifestv2):
>>> -            n = m1.node()
>>> -        elif text == m2.dirtext(self._usemanifestv2):
>>> -            n = m2.node()
>>> -        else:
>>> +        n = None
>>> +        if self._dir != '':
>>> +            # Double-check whether contents are unchanged to one parent
>>> +            if text == m1.dirtext(self._usemanifestv2):
>>> +                n = m1.node()
>>> +            elif text == m2.dirtext(self._usemanifestv2):
>>> +                n = m2.node()
>>> +
>>> +        if not n:
>>>              n = self.addrevision(text, transaction, link, m1.node(),
>>> m2.node())
>>> +
>>>          # Save nodeid so parent manifest can calculate its nodeid
>>>          m.setnode(n)
>>>          return n
>>> diff --git a/tests/test-treemanifest.t b/tests/test-treemanifest.t
>>> --- a/tests/test-treemanifest.t
>>> +++ b/tests/test-treemanifest.t
>>> @@ -825,3 +825,13 @@ other branch
>>>    added 1 changesets with 1 changes to 1 files (+1 heads)
>>>    (run 'hg heads' to see heads, 'hg merge' to merge)
>>>
>>> +Committing a empty commit does not duplicate root treemanifest
>>> +  $ echo z >> z
>>> +  $ hg commit -Aqm 'pre-empty commit'
>>> +  $ hg rm z
>>> +  $ hg commit --amend -m 'empty commit'
>>> +  saved backup bundle to
>>> $TESTTMP/grafted-dir-repo-clone/.hg/strip-backup/cb99d5717cea-de37743b-amend-backup.hg
>>> (glob)
>>
>>
>> I slightly simpler way of creating a commit with not manifest changes:
>>
>> $ hg branch foo
>> $ hg ci -m 'new branch'
>>
>>> +  $ hg log -r 'tip + tip^' -T '{manifest}\n'
>>> +  1:678d3574b88c
>>> +  1:678d3574b88c
>>> +  $ hg --config extensions.strip= strip -r . -q
via Mercurial-devel - March 11, 2017, 12:04 a.m.
On Fri, Mar 3, 2017 at 10:35 AM, Martin von Zweigbergk
<martinvonz@google.com> wrote:
> For now, I'm just adding back mercurial-devel that I accidentally
> dropped in my reply (sorry, the default is different between Gmail and
> Inbox, and I use both).
>
> On Fri, Mar 3, 2017 at 10:05 AM, Durham Goode <durham@fb.com> wrote:
>>
>>
>> On 3/3/17 9:25 AM, Martin von Zweigbergk wrote:
>>>
>>> On Wed, Mar 1, 2017 at 4:37 PM, Durham Goode <durham@fb.com> wrote:
>>>>
>>>> # HG changeset patch
>>>> # User Durham Goode <durham@fb.com>
>>>> # Date 1488413981 28800
>>>> #      Wed Mar 01 16:19:41 2017 -0800
>>>> # Node ID f99c019063bea8a0b5f66c1e79dbd0b98b1326d8
>>>> # Parent  9c89185b63f574cad1d4009f6c9dd511a2cba986
>>>> treemanifest: make node reuse match flat manifest behavior

Durham and I looked into this a bit more and compared it to how
filelogs and manifests behave. I do wonder if it wouldn't be more
consistent also create new manifest revisions for subdirectories that
have changed on at least one side of the merge, but I also don't know
that it matters. We can do that later. So, I've queued this patch,
thanks.

>>>>
>>>> In a flat manifest, a node with the same content but different parents is
>>>> still
>>>> considered a new node. In the current tree manifests however, if the
>>>> content is
>>>> the same, we ignore the parents entirely and just reuse the existing
>>>> node.
>>>
>>>
>>> I think I copied that behavior from filelogs.
>>
>>
>> filelog.add() calls straight to revlog.addrevision() which only exits early
>> if the full p1+p2+data hash matches an existing full hash. In fact, it
>> couldn't exit early for a content-only match because it would have to
>> decompress the p1 content to even check. The only reason treemanifest could
>> do it is because it happens to always have the p1 in memory and available
>> for comparison.
>>
>>>>
>>>> In our external treemanifest extension, we want to allow having one
>>>> treemanifest
>>>> for every flat manifests,
>>>
>>>
>>> Nit: did you mean it the other way around? Because it sounds like you
>>> can already have one treemanifest per flat manifest, but not one flat
>>> manifest per treemanifest. But the next sentence says you want them
>>> 1:1, so maybe add "and vice versa" to be clear, or "one unique
>>> treemanifest per flat manifest".
>>
>>
>> K, I can add the word unique to it.  When I said 'one x for every y' I meant
>> len(x) == len(y), but maybe it's not so clear.
>>
>>>>  as a way of easeing the migration to treemanifests. To
>>>> make this possible, let's change the root node treemanifest behavior to
>>>> match
>>>> the behavior for flat manifests, so we can have a 1:1 relationship.
>>>
>>>
>>> Could you elaborate on what complications the current implementation
>>> causes for you? It may not be necessary to put it in an updated commit
>>> message, but I'd like to understand the reason better.
>>
>>
>> As part of our migration to treemanifest, I'm backfilling all the trees for
>> our entire repo, and I'm giving the root node the original flat-manifest
>> hash, so eventually we can even delete the flat manifest and our old commits
>> will still work. To make this possible, I need to have 1 unique treemanifest
>> for every flat manifest, so all the old manifest hashes can still be
>> referenced.
>>
>>> It's a little strange to me to use a different condition for the root
>>> manifest. If we think that we should always emit a new revision if
>>> either content or parents have changed, then maybe we should do that
>>> for all directories. If we don't, it seems like we should not do it
>>> for the root manifest either. In that case, I'd be happier to extract
>>> a function or two that makes it easier for you guys to override it
>>> internally instead.
>>
>>
>> One difference between the root manifest and the sub-trees is that
>> manifest.py does not control the p1/p2 for the root, but it does control the
>> p1/p2 for the sub-trees.  If manifest.py controlled p1/p2 for the root, we
>> could be smart and not produce a new tree when the content matches p1 or p2.
>> But because an external caller provides p1/p2, if manifest.add(p1, p2,
>> content) returns a node that doesn't actually have p1 and p2 as the parents
>> (which is what can happen today), then we're not actually respecting what
>> the caller asked us to do. That's why we need to respect it for the root,
>> but not necessarily for the sub-trees.
>>
>> Usually Mercurial is good about not calling manifest.add(p1, p2, content)
>> when the content matches p1, which is why we don't see a ton of empty
>> manifest deltas in the repo. This does become an issue for merge commits
>> though, where the code doesn't just want to throw out the p2 information so
>> it calls mf.add(p1, p2, same_content), which is how we found the issue in
>> the first place.
>>
>> Technically we could apply this behavior to every sub-tree as well, but that
>> introduces complications. For instance, I think it'd mean we have to create
>> a new empty sub-tree for every commit, since for every sub-tree the content
>> is the same but the parents are not.
>>
>>>>
>>>> While this sounds like a BC breakage, it's not actually a state users can
>>>> normally get in because: A) you can't make empty commits, and B) even if
>>>> you try
>>>> to make an empty commit (by making a commit then amending it's changes
>>>> away),
>>>> the higher level commit logic in localrepo.commitctx() forces the commit
>>>> to use
>>>> the original p1 manifest node if no files were changed. So this would
>>>> only
>>>> affect extensions and automation that reached passed the normal
>>>> localrepo.commit() logic straight into the manifest logic.
>>>
>>>
>>> So what does the included test case do? There is no extension involved
>>> there.
>>
>>
>> It just verifies the existing behavior. I wrote it hoping it'd expose the
>> bug, but it just worked, which led me notice the localrepo.commitctx logic.
>>
>>
>>>
>>>>
>>>> diff --git a/mercurial/manifest.py b/mercurial/manifest.py
>>>> --- a/mercurial/manifest.py
>>>> +++ b/mercurial/manifest.py
>>>> @@ -1233,7 +1233,7 @@ class manifestrevlog(revlog.revlog):
>>>>      def _addtree(self, m, transaction, link, m1, m2, readtree):
>>>>          # If the manifest is unchanged compared to one parent,
>>>>          # don't write a new revision
>>>> -        if m.unmodifiedsince(m1) or m.unmodifiedsince(m2):
>>>> +        if self._dir != '' and (m.unmodifiedsince(m1) or
>>>> m.unmodifiedsince(m2)):
>>>>              return m.node()
>>>>          def writesubtree(subm, subp1, subp2):
>>>>              sublog = self.dirlog(subm.dir())
>>>> @@ -1241,13 +1241,17 @@ class manifestrevlog(revlog.revlog):
>>>>                         readtree=readtree)
>>>>          m.writesubtrees(m1, m2, writesubtree)
>>>>          text = m.dirtext(self._usemanifestv2)
>>>> -        # Double-check whether contents are unchanged to one parent
>>>> -        if text == m1.dirtext(self._usemanifestv2):
>>>> -            n = m1.node()
>>>> -        elif text == m2.dirtext(self._usemanifestv2):
>>>> -            n = m2.node()
>>>> -        else:
>>>> +        n = None
>>>> +        if self._dir != '':
>>>> +            # Double-check whether contents are unchanged to one parent
>>>> +            if text == m1.dirtext(self._usemanifestv2):
>>>> +                n = m1.node()
>>>> +            elif text == m2.dirtext(self._usemanifestv2):
>>>> +                n = m2.node()
>>>> +
>>>> +        if not n:
>>>>              n = self.addrevision(text, transaction, link, m1.node(),
>>>> m2.node())
>>>> +
>>>>          # Save nodeid so parent manifest can calculate its nodeid
>>>>          m.setnode(n)
>>>>          return n
>>>> diff --git a/tests/test-treemanifest.t b/tests/test-treemanifest.t
>>>> --- a/tests/test-treemanifest.t
>>>> +++ b/tests/test-treemanifest.t
>>>> @@ -825,3 +825,13 @@ other branch
>>>>    added 1 changesets with 1 changes to 1 files (+1 heads)
>>>>    (run 'hg heads' to see heads, 'hg merge' to merge)
>>>>
>>>> +Committing a empty commit does not duplicate root treemanifest
>>>> +  $ echo z >> z
>>>> +  $ hg commit -Aqm 'pre-empty commit'
>>>> +  $ hg rm z
>>>> +  $ hg commit --amend -m 'empty commit'
>>>> +  saved backup bundle to
>>>> $TESTTMP/grafted-dir-repo-clone/.hg/strip-backup/cb99d5717cea-de37743b-amend-backup.hg
>>>> (glob)
>>>
>>>
>>> I slightly simpler way of creating a commit with not manifest changes:
>>>
>>> $ hg branch foo
>>> $ hg ci -m 'new branch'
>>>
>>>> +  $ hg log -r 'tip + tip^' -T '{manifest}\n'
>>>> +  1:678d3574b88c
>>>> +  1:678d3574b88c
>>>> +  $ hg --config extensions.strip= strip -r . -q

Patch

diff --git a/mercurial/manifest.py b/mercurial/manifest.py
--- a/mercurial/manifest.py
+++ b/mercurial/manifest.py
@@ -1233,7 +1233,7 @@  class manifestrevlog(revlog.revlog):
     def _addtree(self, m, transaction, link, m1, m2, readtree):
         # If the manifest is unchanged compared to one parent,
         # don't write a new revision
-        if m.unmodifiedsince(m1) or m.unmodifiedsince(m2):
+        if self._dir != '' and (m.unmodifiedsince(m1) or m.unmodifiedsince(m2)):
             return m.node()
         def writesubtree(subm, subp1, subp2):
             sublog = self.dirlog(subm.dir())
@@ -1241,13 +1241,17 @@  class manifestrevlog(revlog.revlog):
                        readtree=readtree)
         m.writesubtrees(m1, m2, writesubtree)
         text = m.dirtext(self._usemanifestv2)
-        # Double-check whether contents are unchanged to one parent
-        if text == m1.dirtext(self._usemanifestv2):
-            n = m1.node()
-        elif text == m2.dirtext(self._usemanifestv2):
-            n = m2.node()
-        else:
+        n = None
+        if self._dir != '':
+            # Double-check whether contents are unchanged to one parent
+            if text == m1.dirtext(self._usemanifestv2):
+                n = m1.node()
+            elif text == m2.dirtext(self._usemanifestv2):
+                n = m2.node()
+
+        if not n:
             n = self.addrevision(text, transaction, link, m1.node(), m2.node())
+
         # Save nodeid so parent manifest can calculate its nodeid
         m.setnode(n)
         return n
diff --git a/tests/test-treemanifest.t b/tests/test-treemanifest.t
--- a/tests/test-treemanifest.t
+++ b/tests/test-treemanifest.t
@@ -825,3 +825,13 @@  other branch
   added 1 changesets with 1 changes to 1 files (+1 heads)
   (run 'hg heads' to see heads, 'hg merge' to merge)
 
+Committing a empty commit does not duplicate root treemanifest
+  $ echo z >> z
+  $ hg commit -Aqm 'pre-empty commit'
+  $ hg rm z
+  $ hg commit --amend -m 'empty commit'
+  saved backup bundle to $TESTTMP/grafted-dir-repo-clone/.hg/strip-backup/cb99d5717cea-de37743b-amend-backup.hg (glob)
+  $ hg log -r 'tip + tip^' -T '{manifest}\n'
+  1:678d3574b88c
+  1:678d3574b88c
+  $ hg --config extensions.strip= strip -r . -q