Submitter | phabricator |
---|---|
Date | March 5, 2020, 6:15 p.m. |
Message ID | <differential-rev-PHID-DREV-pthbwawwnh3jgo5lznje-req@mercurial-scm.org> |
Download | mbox | patch |
Permalink | /patch/45526/ |
State | Superseded |
Headers | show |
Comments
martinvonz added a comment. > In 99ebde4fec99 <https://phab.mercurial-scm.org/rHG99ebde4fec9967c2e80f85702129abf874eefe74>, we changed the list of files stored into the files field. > This lead to the changeset centric copy algorithm to break in various merge > situation involving merge. Could you explain why it broke? It's hard to review this patch without really knowing what the problem or the solution is. > The new implementation focus on correctness. Performance improvement will come > later. How much slower is it? Could you run some of those benchmarks you have run on previous patches touching this code? How do you plan to improve it? INLINE COMMENTS > copies.py:277 > cl = repo.changelog > + isancestor = cl.isancestorrev # XXX we should had chaching to this. > missingrevs = cl.findmissingrevs(common=[a.rev()], heads=[b.rev()]) s/had/add/? > test-copies-chain-merge.t:743-744 > + d (filelog !) > + b (compatibility !) > + b (sidedata !) > R b nit: combine into `(no-filelog !)`? REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8244/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8244 To: marmoute, #hg-reviewers Cc: martinvonz, mercurial-devel
marmoute added a comment. In D8244#122806 <https://phab.mercurial-scm.org/D8244#122806>, @martinvonz wrote: >> In 99ebde4fec99 <https://phab.mercurial-scm.org/rHG99ebde4fec9967c2e80f85702129abf874eefe74>, we changed the list of files stored into the files field. >> This lead to the changeset centric copy algorithm to break in various merge >> situation involving merge. > > Could you explain why it broke? It's hard to review this patch without really knowing what the problem or the solution is. Outdated information from p1 could overwrite newer information from p2. >> The new implementation focus on correctness. Performance improvement will come >> later. > > How much slower is it? Could you run some of those benchmarks you have run on previous patches touching this code? How do you plan to improve it? There are two new calls that might degrade performance. This `isancestor` call that is easy to cache in memory. And the "ismerged" logic that is easy to cache on disk with the rest of the copy related information (the case is rare). There is some win to have in python, but the main win will be the move the Rust algorithm (that need to be updated with the new logic). Moving to rust give a very large performance boost on the slow cases (usually over 10x, sometime 100x IIRC). That is the one I care about. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8244/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8244 To: marmoute, #hg-reviewers Cc: martinvonz, mercurial-devel
martinvonz added a comment. In D8244#122821 <https://phab.mercurial-scm.org/D8244#122821>, @marmoute wrote: > In D8244#122806 <https://phab.mercurial-scm.org/D8244#122806>, @martinvonz wrote: > >>> In 99ebde4fec99 <https://phab.mercurial-scm.org/rHG99ebde4fec9967c2e80f85702129abf874eefe74>, we changed the list of files stored into the files field. >>> This lead to the changeset centric copy algorithm to break in various merge >>> situation involving merge. >> >> Could you explain why it broke? It's hard to review this patch without really knowing what the problem or the solution is. > > Outdated information from p1 could overwrite newer information from p2. > >>> The new implementation focus on correctness. Performance improvement will come >>> later. >> >> How much slower is it? Could you run some of those benchmarks you have run on previous patches touching this code? How do you plan to improve it? > > There are two new calls that might degrade performance. This `isancestor` call that is easy to cache in memory. And the "ismerged" logic that is easy to cache on disk with the rest of the copy related information (the case is rare). > There is some win to have in python, but the main win will be the move the Rust algorithm (that need to be updated with the new logic). Moving to rust give a very large performance boost on the slow cases (usually over 10x, sometime 100x IIRC). That is the one I care about. I'll make the performance impact more concrete myself. I picked two quite arbitrary tags in the mozilla-unified repo and this is what I saw: Before this patch: $ python3 ~/hg/hg perfpathcopies FIREFOX_BETA_44_END FIREFOX_BETA_54_END ! wall 5.279230 comb 5.270000 user 5.250000 sys 0.020000 (best of 3) After this patch: $ python3 ~/hg/hg perfpathcopies FIREFOX_BETA_44_END FIREFOX_BETA_54_END ! wall 8.277523 comb 8.280000 user 8.170000 sys 0.110000 (best of 3) Could you share some more benchmarking data? I know you had a set of commits that you've used before (and that you've asked me to use for benchmarking my patches to `copies.py` against). It's quite a significant slowdown for the case I tested above, but I'm fine with it since it fixes a bug. I'd just like to see how it behaves in other cases. INLINE COMMENTS > martinvonz wrote in copies.py:277 > s/had/add/? Not actually done, it seems, but not a big deal anyway. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8244/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8244 To: marmoute, #hg-reviewers Cc: martinvonz, mercurial-devel
marmoute added a comment. In D8244#123560 <https://phab.mercurial-scm.org/D8244#123560>, @martinvonz wrote: > In D8244#122821 <https://phab.mercurial-scm.org/D8244#122821>, @marmoute wrote: > >> In D8244#122806 <https://phab.mercurial-scm.org/D8244#122806>, @martinvonz wrote: >> >>>> In 99ebde4fec99 <https://phab.mercurial-scm.org/rHG99ebde4fec9967c2e80f85702129abf874eefe74>, we changed the list of files stored into the files field. >>>> This lead to the changeset centric copy algorithm to break in various merge >>>> situation involving merge. >>> >>> Could you explain why it broke? It's hard to review this patch without really knowing what the problem or the solution is. >> >> Outdated information from p1 could overwrite newer information from p2. >> >>>> The new implementation focus on correctness. Performance improvement will come >>>> later. >>> >>> How much slower is it? Could you run some of those benchmarks you have run on previous patches touching this code? How do you plan to improve it? >> >> There are two new calls that might degrade performance. This `isancestor` call that is easy to cache in memory. And the "ismerged" logic that is easy to cache on disk with the rest of the copy related information (the case is rare). >> There is some win to have in python, but the main win will be the move the Rust algorithm (that need to be updated with the new logic). Moving to rust give a very large performance boost on the slow cases (usually over 10x, sometime 100x IIRC). That is the one I care about. > > I'll make the performance impact more concrete myself. I picked two quite arbitrary tags in the mozilla-unified repo and this is what I saw: > Before this patch: > > $ python3 ~/hg/hg perfpathcopies FIREFOX_BETA_44_END FIREFOX_BETA_54_END > ! wall 5.279230 comb 5.270000 user 5.250000 sys 0.020000 (best of 3) > > After this patch: > > $ python3 ~/hg/hg perfpathcopies FIREFOX_BETA_44_END FIREFOX_BETA_54_END > ! wall 8.277523 comb 8.280000 user 8.170000 sys 0.110000 (best of 3) > > Could you share some more benchmarking data? I know you had a set of commits that you've used before (and that you've asked me to use for benchmarking my patches to `copies.py` against). It's quite a significant slowdown for the case I tested above, but I'm fine with it since it fixes a bug. I'd just like to see how it behaves in other cases. We now have some automatic benchmark setup for copies tracing, however we don't have any reference repositories with the necessary data for changeset centric copy tracing. Building a reference takes some manual operation and a lots of CPU time. So I am planning to build some once the format is more finalized. I am not too worried about the current performance number because there are multiple easy optimization. In addition the rust version of the previous algorithm proved massively more efficient, so I have good hope for this one too. The existing reference can still be used to gather various useful pairs of revision to run manual benchmark on. You can see them using the following command in a setup scmperf repo. $ grep -A 3 copies repos/*.benchrepo To setup an scmperrepo, you can use: $ hg clone https://foss.heptapod.net/mercurial/scmperf $ cd scmperf $ ./script/setup-repos default.repos REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8244/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8244 To: marmoute, #hg-reviewers Cc: martinvonz, mercurial-devel
marmoute added a comment. Gentle ping on this. This is still ready for review. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8244/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8244 To: marmoute, #hg-reviewers Cc: mercurial-patches, martinvonz, mercurial-devel
martinvonz added a comment.
In D8244#126461 <https://phab.mercurial-scm.org/D8244#126461>, @marmoute wrote:
> Gentle ping on this. This is still ready for review.
I'm still a bit worried about the performance regression and I was hoping to get more information about that. Now is the beginning of the next cycle, so I guess it's a good time to queue this now and test it on Google users. I'll review this soon.
REPOSITORY
rHG Mercurial
CHANGES SINCE LAST ACTION
https://phab.mercurial-scm.org/D8244/new/
REVISION DETAIL
https://phab.mercurial-scm.org/D8244
To: marmoute, #hg-reviewers
Cc: mercurial-patches, martinvonz, mercurial-devel
marmoute added a comment. In D8244#126503 <https://phab.mercurial-scm.org/D8244#126503>, @martinvonz wrote: > In D8244#126461 <https://phab.mercurial-scm.org/D8244#126461>, @marmoute wrote: > >> Gentle ping on this. This is still ready for review. > > I'm still a bit worried about the performance regression and I was hoping to get more information about that. Now is the beginning of the next cycle, so I guess it's a good time to queue this now and test it on Google users. I'll review this soon. I started poking at the performance and [spoiler] and this will be fine :-) Do not hold your breath, because there is a long road between "prototype to get numbers" and "clean implementation". If you can queue this, I would appreciate building on a more solid ground. For example, on the worst pypy case I have been playing with: Before the fix: ! wall 1.765737 comb 1.760000 user 1.690000 sys 0.070000 (median of 6) After the fix: ! wall 29.983194 comb 29.880000 user 29.780000 sys 0.100000 (median of 3) With curent state of speedup: ! wall 2.135641 comb 2.140000 user 2.120000 sys 0.020000 (median of 5) REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D8244/new/ REVISION DETAIL https://phab.mercurial-scm.org/D8244 To: marmoute, #hg-reviewers Cc: mercurial-patches, martinvonz, mercurial-devel
Patch
diff --git a/tests/test-copies-chain-merge.t b/tests/test-copies-chain-merge.t --- a/tests/test-copies-chain-merge.t +++ b/tests/test-copies-chain-merge.t @@ -1,3 +1,5 @@ +#testcases filelog compatibility sidedata + ===================================================== Test Copy tracing for chain of copies involving merge ===================================================== @@ -6,6 +8,7 @@ are involved. It cheks we do not have unwanted update of behavior and that the different options to retrieve copies behave correctly. + Setup ===== @@ -18,6 +21,22 @@ > logtemplate={rev} {desc}]\n > EOF +#if compatibility + $ cat >> $HGRCPATH << EOF + > [experimental] + > copies.read-from = compatibility + > EOF +#endif + +#if sidedata + $ cat >> $HGRCPATH << EOF + > [format] + > exp-use-side-data = yes + > exp-use-copies-side-data-changeset = yes + > EOF +#endif + + $ hg init repo-chain $ cd repo-chain @@ -672,17 +691,26 @@ 0 4 0dd616bc7ab1 000000000000 000000000000 1 18 6da5a2eecb9c 000000000000 000000000000 2 19 eb806e34ef6b 0dd616bc7ab1 6da5a2eecb9c + +# Here the filelog based implementation is not looking at the rename +# information (because the file exist on both side). However the changelog +# based on works fine. We have different output. + $ hg status --copies --rev 'desc("a-2")' --rev 'desc("mAEm-0")' M f + b (no-filelog !) R b $ hg status --copies --rev 'desc("a-2")' --rev 'desc("mEAm-0")' M f + b (no-filelog !) R b $ hg status --copies --rev 'desc("e-2")' --rev 'desc("mAEm-0")' M f + d (no-filelog !) R d $ hg status --copies --rev 'desc("e-2")' --rev 'desc("mEAm-0")' M f + d (no-filelog !) R d $ hg status --copies --rev 'desc("i-2")' --rev 'desc("a-2")' A f @@ -692,6 +720,18 @@ A f b R b + +# From here, we run status against revision where both source file exists. +# +# The filelog based implementation picks an arbitrary side based on revision +# numbers. So the same side "wins" whatever the parents order is. This is +# sub-optimal because depending on revision numbers means the result can be +# different from one repository to the next. +# +# The changeset based algorithm use the parent order to break tie on conflicting +# information and will have a different order depending on who is p1 and p2. +# That order is stable accross repositories. (data from p1 prevails) + $ hg status --copies --rev 'desc("i-2")' --rev 'desc("mAEm-0")' A f d @@ -699,7 +739,9 @@ R d $ hg status --copies --rev 'desc("i-2")' --rev 'desc("mEAm-0")' A f - d + d (filelog !) + b (compatibility !) + b (sidedata !) R b R d $ hg status --copies --rev 'desc("i-0")' --rev 'desc("mAEm-0")' @@ -709,7 +751,8 @@ R b $ hg status --copies --rev 'desc("i-0")' --rev 'desc("mEAm-0")' A f - a + a (filelog !) + b (no-filelog !) R a R b @@ -733,21 +776,25 @@ R h $ hg status --copies --rev 'desc("b-1")' --rev 'desc("mBFm-0")' M d + h (no-filelog !) R h $ hg status --copies --rev 'desc("f-2")' --rev 'desc("mBFm-0")' M b $ hg status --copies --rev 'desc("f-1")' --rev 'desc("mBFm-0")' M b M d + i (no-filelog !) R i $ hg status --copies --rev 'desc("b-1")' --rev 'desc("mFBm-0")' M d + h (no-filelog !) R h $ hg status --copies --rev 'desc("f-2")' --rev 'desc("mFBm-0")' M b $ hg status --copies --rev 'desc("f-1")' --rev 'desc("mFBm-0")' M b M d + i (no-filelog !) R i The following graphlog is wrong, the "a -> c -> d" chain was overwritten and should not appear. @@ -777,9 +824,15 @@ Unlike in the 'BD/DB' cases, an actuall merge happened here. So we should consider history and rename on both branch of the merge. +One side of the merge have a long history with rename. The other side of the +merge point to a new file with a smaller history. Each side is "valid". + +(and again the filelog based algorithm only explore one, with a pick based on +revision numbers) + $ hg status --copies --rev 'desc("i-0")' --rev 'desc("mDGm-0")' A d - a + a (filelog !) R a $ hg status --copies --rev 'desc("i-0")' --rev 'desc("mGDm-0")' A d @@ -840,7 +893,8 @@ $ hg status --copies --rev 'desc("i-0")' --rev 'desc("mFGm-0")' A d - a + h (no-filelog !) + a (filelog !) R a R h $ hg status --copies --rev 'desc("i-0")' --rev 'desc("mGFm-0")' @@ -854,15 +908,19 @@ M d $ hg status --copies --rev 'desc("f-1")' --rev 'desc("mFGm-0")' M d + i (no-filelog !) R i $ hg status --copies --rev 'desc("f-1")' --rev 'desc("mGFm-0")' M d + i (no-filelog !) R i $ hg status --copies --rev 'desc("g-1")' --rev 'desc("mFGm-0")' M d + h (no-filelog !) R h $ hg status --copies --rev 'desc("g-1")' --rev 'desc("mGFm-0")' M d + h (no-filelog !) R h $ hg log -Gfr 'desc("mFGm-0")' d diff --git a/mercurial/copies.py b/mercurial/copies.py --- a/mercurial/copies.py +++ b/mercurial/copies.py @@ -183,10 +183,27 @@ * p1copies: mapping of copies from p1 * p2copies: mapping of copies from p2 * removed: a list of removed files + * ismerged: a callback to know if file was merged in that revision """ cl = repo.changelog parents = cl.parentrevs + def get_ismerged(rev): + ctx = repo[rev] + + def ismerged(path): + if path not in ctx.files(): + return False + fctx = ctx[path] + parents = fctx._filelog.parents(fctx._filenode) + nb_parents = 0 + for n in parents: + if n != node.nullid: + nb_parents += 1 + return nb_parents >= 2 + + return ismerged + if repo.filecopiesmode == b'changeset-sidedata': changelogrevision = cl.changelogrevision flags = cl.flags @@ -233,7 +250,8 @@ p1copies = {} p2copies = {} removed = [] - return p1, p2, p1copies, p2copies, removed + + return p1, p2, p1copies, p2copies, removed, get_ismerged(rev) else: @@ -242,7 +260,7 @@ ctx = repo[rev] p1copies, p2copies = ctx._copies removed = ctx.filesremoved() - return p1, p2, p1copies, p2copies, removed + return p1, p2, p1copies, p2copies, removed, get_ismerged(rev) return revinfo @@ -256,6 +274,7 @@ revinfo = _revinfogetter(repo) cl = repo.changelog + isancestor = cl.isancestorrev # XXX we should had chaching to this. missingrevs = cl.findmissingrevs(common=[a.rev()], heads=[b.rev()]) mrset = set(missingrevs) roots = set() @@ -283,10 +302,14 @@ iterrevs.update(roots) iterrevs.remove(b.rev()) revs = sorted(iterrevs) - return _combinechangesetcopies(revs, children, b.rev(), revinfo, match) + return _combinechangesetcopies( + revs, children, b.rev(), revinfo, match, isancestor + ) -def _combinechangesetcopies(revs, children, targetrev, revinfo, match): +def _combinechangesetcopies( + revs, children, targetrev, revinfo, match, isancestor +): """combine the copies information for each item of iterrevs revs: sorted iterable of revision to visit @@ -305,7 +328,7 @@ # this is a root copies = {} for i, c in enumerate(children[r]): - p1, p2, p1copies, p2copies, removed = revinfo(c) + p1, p2, p1copies, p2copies, removed, ismerged = revinfo(c) if r == p1: parent = 1 childcopies = p1copies @@ -319,9 +342,12 @@ } newcopies = copies if childcopies: - newcopies = _chain(newcopies, childcopies) - # _chain makes a copies, we can avoid doing so in some - # simple/linear cases. + newcopies = copies.copy() + for dest, source in pycompat.iteritems(childcopies): + prev = copies.get(source) + if prev is not None and prev[1] is not None: + source = prev[1] + newcopies[dest] = (c, source) assert newcopies is not copies for f in removed: if f in newcopies: @@ -330,7 +356,7 @@ # branches. when there are no other branches, this # could be avoided. newcopies = copies.copy() - del newcopies[f] + newcopies[f] = (c, None) othercopies = all_copies.get(c) if othercopies is None: all_copies[c] = newcopies @@ -338,21 +364,55 @@ # we are the second parent to work on c, we need to merge our # work with the other. # - # Unlike when copies are stored in the filelog, we consider - # it a copy even if the destination already existed on the - # other branch. It's simply too expensive to check if the - # file existed in the manifest. - # # In case of conflict, parent 1 take precedence over parent 2. # This is an arbitrary choice made anew when implementing # changeset based copies. It was made without regards with # potential filelog related behavior. if parent == 1: - othercopies.update(newcopies) + _merge_copies_dict( + othercopies, newcopies, isancestor, ismerged + ) else: - newcopies.update(othercopies) + _merge_copies_dict( + newcopies, othercopies, isancestor, ismerged + ) all_copies[c] = newcopies - return all_copies[targetrev] + + final_copies = {} + for dest, (tt, source) in all_copies[targetrev].items(): + if source is not None: + final_copies[dest] = source + return final_copies + + +def _merge_copies_dict(minor, major, isancestor, ismerged): + """merge two copies-mapping together, minor and major + + In case of conflict, value from "major" will be picked. + + - `isancestors(low_rev, high_rev)`: callable return True if `low_rev` is an + ancestors of `high_rev`, + + - `ismerged(path)`: callable return True if `path` have been merged in the + current revision, + """ + for dest, value in major.items(): + other = minor.get(dest) + if other is None: + minor[dest] = value + else: + new_tt = value[0] + other_tt = other[0] + if value[1] == other[1]: + continue + # content from "major" wins, unless it is older + # than the branch point or there is a merge + if ( + new_tt == other_tt + or not isancestor(new_tt, other_tt) + or ismerged(dest) + ): + minor[dest] = value def _forwardcopies(a, b, base=None, match=None):