Submitter | Augie Fackler |
---|---|
Date | March 21, 2017, 7:13 p.m. |
Message ID | <a4745fd9219ed5b408bf.1490123580@arthedain.pit.corp.google.com> |
Download | mbox | patch |
Permalink | /patch/19535/ |
State | Accepted |
Headers | show |
Comments
On Tue, 21 Mar 2017 15:13:00 -0400, Augie Fackler wrote: > # HG changeset patch > # User Augie Fackler <augie@google.com> > # Date 1489903420 14400 > # Sun Mar 19 02:03:40 2017 -0400 > # Node ID a4745fd9219ed5b408bfc0403a4a8e6acd41df6c > # Parent 66c3ae6d886cae0e3a3cff6a0058e2d2a866fd9d > similar: avoid sorting and making copies of addedfiles and removedfiles sets > > The process of porting to Python 3 exposed some weirdness in this > code: workingfilectx doesn't define rich comparison operators, so in > Python 2 the default comparison devolved to id(value), which is the > pointer address of the object. Inspection of _findexactmatches and > _findsimilarmatches revealed that they didn't care about the sort > order of the data, so we remove the sort (and potentially make > addremove faster since it's not sorting things). We now have to do one > little extra set dance in order to not mutate the addedfiles set > during its iteration, but that's a small price to pay for the > resulting cleaner nature of the code. I have unsent patch to sort files alphabetically, not by memory address. IIRC, I wrote it because a TortoiseHg user reported that the similarity detection was unstable if there were multiple candidates of the copy source. I'll post my patch soon.
> On Mar 22, 2017, at 09:43, Yuya Nishihara <yuya@tcha.org> wrote: > > On Tue, 21 Mar 2017 15:13:00 -0400, Augie Fackler wrote: >> # HG changeset patch >> # User Augie Fackler <augie@google.com> >> # Date 1489903420 14400 >> # Sun Mar 19 02:03:40 2017 -0400 >> # Node ID a4745fd9219ed5b408bfc0403a4a8e6acd41df6c >> # Parent 66c3ae6d886cae0e3a3cff6a0058e2d2a866fd9d >> similar: avoid sorting and making copies of addedfiles and removedfiles sets >> >> The process of porting to Python 3 exposed some weirdness in this >> code: workingfilectx doesn't define rich comparison operators, so in >> Python 2 the default comparison devolved to id(value), which is the >> pointer address of the object. Inspection of _findexactmatches and >> _findsimilarmatches revealed that they didn't care about the sort >> order of the data, so we remove the sort (and potentially make >> addremove faster since it's not sorting things). We now have to do one >> little extra set dance in order to not mutate the addedfiles set >> during its iteration, but that's a small price to pay for the >> resulting cleaner nature of the code. > > I have unsent patch to sort files alphabetically, not by memory address. IIRC, > I wrote it because a TortoiseHg user reported that the similarity detection > was unstable if there were multiple candidates of the copy source. Ah, I see by inspection now: the issue arises if multiple files are the same amount of similarity (or if there are multiple exact matches), if we don't sort the list, we'll end up with slightly random results. > I'll post my patch soon. Thanks!
Patch
diff --git a/mercurial/similar.py b/mercurial/similar.py --- a/mercurial/similar.py +++ b/mercurial/similar.py @@ -107,13 +107,14 @@ def findrenames(repo, added, removed, th if fp in parentctx and parentctx[fp].size() > 0]) # Find exact matches. - for (a, b) in _findexactmatches(repo, - sorted(addedfiles), sorted(removedfiles)): - addedfiles.remove(b) + addedremove = set() + for (a, b) in _findexactmatches(repo, addedfiles, removedfiles): + addedremove.add(b) yield (a.path(), b.path(), 1.0) + addedfiles -= addedremove # If the user requested similar files to be matched, search for them also. if threshold < 1.0: for (a, b, score) in _findsimilarmatches(repo, - sorted(addedfiles), sorted(removedfiles), threshold): + addedfiles, removedfiles, threshold): yield (a.path(), b.path(), score)