Patchwork [1,of,4,py3] similar: avoid sorting and making copies of addedfiles and removedfiles sets

login
register
mail settings
Submitter Augie Fackler
Date March 21, 2017, 7:13 p.m.
Message ID <a4745fd9219ed5b408bf.1490123580@arthedain.pit.corp.google.com>
Download mbox | patch
Permalink /patch/19535/
State Accepted
Headers show

Comments

Augie Fackler - March 21, 2017, 7:13 p.m.
# HG changeset patch
# User Augie Fackler <augie@google.com>
# Date 1489903420 14400
#      Sun Mar 19 02:03:40 2017 -0400
# Node ID a4745fd9219ed5b408bfc0403a4a8e6acd41df6c
# Parent  66c3ae6d886cae0e3a3cff6a0058e2d2a866fd9d
similar: avoid sorting and making copies of addedfiles and removedfiles sets

The process of porting to Python 3 exposed some weirdness in this
code: workingfilectx doesn't define rich comparison operators, so in
Python 2 the default comparison devolved to id(value), which is the
pointer address of the object. Inspection of _findexactmatches and
_findsimilarmatches revealed that they didn't care about the sort
order of the data, so we remove the sort (and potentially make
addremove faster since it's not sorting things). We now have to do one
little extra set dance in order to not mutate the addedfiles set
during its iteration, but that's a small price to pay for the
resulting cleaner nature of the code.
Yuya Nishihara - March 22, 2017, 1:43 p.m.
On Tue, 21 Mar 2017 15:13:00 -0400, Augie Fackler wrote:
> # HG changeset patch
> # User Augie Fackler <augie@google.com>
> # Date 1489903420 14400
> #      Sun Mar 19 02:03:40 2017 -0400
> # Node ID a4745fd9219ed5b408bfc0403a4a8e6acd41df6c
> # Parent  66c3ae6d886cae0e3a3cff6a0058e2d2a866fd9d
> similar: avoid sorting and making copies of addedfiles and removedfiles sets
> 
> The process of porting to Python 3 exposed some weirdness in this
> code: workingfilectx doesn't define rich comparison operators, so in
> Python 2 the default comparison devolved to id(value), which is the
> pointer address of the object. Inspection of _findexactmatches and
> _findsimilarmatches revealed that they didn't care about the sort
> order of the data, so we remove the sort (and potentially make
> addremove faster since it's not sorting things). We now have to do one
> little extra set dance in order to not mutate the addedfiles set
> during its iteration, but that's a small price to pay for the
> resulting cleaner nature of the code.

I have unsent patch to sort files alphabetically, not by memory address. IIRC,
I wrote it because a TortoiseHg user reported that the similarity detection
was unstable if there were multiple candidates of the copy source.

I'll post my patch soon.
Augie Fackler - March 22, 2017, 3 p.m.
> On Mar 22, 2017, at 09:43, Yuya Nishihara <yuya@tcha.org> wrote:
> 
> On Tue, 21 Mar 2017 15:13:00 -0400, Augie Fackler wrote:
>> # HG changeset patch
>> # User Augie Fackler <augie@google.com>
>> # Date 1489903420 14400
>> #      Sun Mar 19 02:03:40 2017 -0400
>> # Node ID a4745fd9219ed5b408bfc0403a4a8e6acd41df6c
>> # Parent  66c3ae6d886cae0e3a3cff6a0058e2d2a866fd9d
>> similar: avoid sorting and making copies of addedfiles and removedfiles sets
>> 
>> The process of porting to Python 3 exposed some weirdness in this
>> code: workingfilectx doesn't define rich comparison operators, so in
>> Python 2 the default comparison devolved to id(value), which is the
>> pointer address of the object. Inspection of _findexactmatches and
>> _findsimilarmatches revealed that they didn't care about the sort
>> order of the data, so we remove the sort (and potentially make
>> addremove faster since it's not sorting things). We now have to do one
>> little extra set dance in order to not mutate the addedfiles set
>> during its iteration, but that's a small price to pay for the
>> resulting cleaner nature of the code.
> 
> I have unsent patch to sort files alphabetically, not by memory address. IIRC,
> I wrote it because a TortoiseHg user reported that the similarity detection
> was unstable if there were multiple candidates of the copy source.

Ah, I see by inspection now: the issue arises if multiple files are the same amount of similarity (or if there are multiple exact matches), if we don't sort the list, we'll end up with slightly random results.

> I'll post my patch soon.

Thanks!

Patch

diff --git a/mercurial/similar.py b/mercurial/similar.py
--- a/mercurial/similar.py
+++ b/mercurial/similar.py
@@ -107,13 +107,14 @@  def findrenames(repo, added, removed, th
             if fp in parentctx and parentctx[fp].size() > 0])
 
     # Find exact matches.
-    for (a, b) in _findexactmatches(repo,
-            sorted(addedfiles), sorted(removedfiles)):
-        addedfiles.remove(b)
+    addedremove = set()
+    for (a, b) in _findexactmatches(repo, addedfiles, removedfiles):
+        addedremove.add(b)
         yield (a.path(), b.path(), 1.0)
+    addedfiles -= addedremove
 
     # If the user requested similar files to be matched, search for them also.
     if threshold < 1.0:
         for (a, b, score) in _findsimilarmatches(repo,
-                sorted(addedfiles), sorted(removedfiles), threshold):
+                addedfiles, removedfiles, threshold):
             yield (a.path(), b.path(), score)