Patchwork [RFC] releasenotes: add similarity check function to compare incoming notes

login
register
mail settings
Submitter Rishabh Madan
Date June 23, 2017, 4:37 p.m.
Message ID <a90693382178ca82b291.1498235842@bunty>
Download mbox | patch
Permalink /patch/21631/
State Changes Requested
Headers show

Comments

Rishabh Madan - June 23, 2017, 4:37 p.m.
# HG changeset patch
# User Rishabh Madan <rishabhmadan96@gmail.com>
# Date 1498235213 -7200
#      Fri Jun 23 18:26:53 2017 +0200
# Node ID a90693382178ca82b2918ee4b159dfb490d1bfc8
# Parent  b6e6d8df88beb042f5a37123a0ea6a9b437f7755
releasenotes: add similarity check function to compare incoming notes

It is possible that the incoming note fragments might have some similar content
as the existing release notes. In case of a bug fix, we match for issueNNNN in $
existing notes. For other general cases, it makes use of fuzzywuzzy library to
get a similarity score. If the score is above a certain threshold, we ignore
the fragment otherwise add it. But the score might be misleading for small comm$
messages. So, it uses similarity function if only the length of string (in word$
is above a certain number. The patch also adds tests related to its usage.
But it needs improvement in the sense of combining the incoming notes. We can
use interactive mode for adding the notes. Maybe we can do this if similarity
score is under a certain range.
Yuya Nishihara - June 26, 2017, 1:53 p.m.
On Fri, 23 Jun 2017 18:37:22 +0200, Rishabh Madan wrote:
> # HG changeset patch
> # User Rishabh Madan <rishabhmadan96@gmail.com>
> # Date 1498235213 -7200
> #      Fri Jun 23 18:26:53 2017 +0200
> # Node ID a90693382178ca82b2918ee4b159dfb490d1bfc8
> # Parent  b6e6d8df88beb042f5a37123a0ea6a9b437f7755
> releasenotes: add similarity check function to compare incoming notes
> 
> It is possible that the incoming note fragments might have some similar content
> as the existing release notes. In case of a bug fix, we match for issueNNNN in $
> existing notes. For other general cases, it makes use of fuzzywuzzy library to
> get a similarity score. If the score is above a certain threshold, we ignore
> the fragment otherwise add it. But the score might be misleading for small comm$
> messages. So, it uses similarity function if only the length of string (in word$
> is above a certain number. The patch also adds tests related to its usage.
> But it needs improvement in the sense of combining the incoming notes. We can
> use interactive mode for adding the notes. Maybe we can do this if similarity
> score is under a certain range.
> 
> diff -r b6e6d8df88be -r a90693382178 hgext/releasenotes.py
> --- a/hgext/releasenotes.py	Fri Jun 23 17:15:53 2017 +0200
> +++ b/hgext/releasenotes.py	Fri Jun 23 18:26:53 2017 +0200
> @@ -12,6 +12,7 @@
>  """
>  
>  from __future__ import absolute_import
> +from fuzzywuzzy import fuzz

I have no idea if the releasenotes extension may depend on third-party modules.
Given this extension wouldn't be widely used, it might be okay. But I'm not
sure, literally.

FWIW, have you tried difflib.SequenceMatcher()? Its algorithm would be more
generic, but might be good enough.

And I think this kind of functions will need a unit/doc test.
Augie Fackler - June 26, 2017, 2:23 p.m.
On Mon, Jun 26, 2017 at 10:53:27PM +0900, Yuya Nishihara wrote:
> On Fri, 23 Jun 2017 18:37:22 +0200, Rishabh Madan wrote:
> > # HG changeset patch
> > # User Rishabh Madan <rishabhmadan96@gmail.com>
> > # Date 1498235213 -7200
> > #      Fri Jun 23 18:26:53 2017 +0200
> > # Node ID a90693382178ca82b2918ee4b159dfb490d1bfc8
> > # Parent  b6e6d8df88beb042f5a37123a0ea6a9b437f7755
> > releasenotes: add similarity check function to compare incoming notes
> >
> > It is possible that the incoming note fragments might have some similar content
> > as the existing release notes. In case of a bug fix, we match for issueNNNN in $
> > existing notes. For other general cases, it makes use of fuzzywuzzy library to
> > get a similarity score. If the score is above a certain threshold, we ignore
> > the fragment otherwise add it. But the score might be misleading for small comm$
> > messages. So, it uses similarity function if only the length of string (in word$
> > is above a certain number. The patch also adds tests related to its usage.
> > But it needs improvement in the sense of combining the incoming notes. We can
> > use interactive mode for adding the notes. Maybe we can do this if similarity
> > score is under a certain range.
> >
> > diff -r b6e6d8df88be -r a90693382178 hgext/releasenotes.py
> > --- a/hgext/releasenotes.py	Fri Jun 23 17:15:53 2017 +0200
> > +++ b/hgext/releasenotes.py	Fri Jun 23 18:26:53 2017 +0200
> > @@ -12,6 +12,7 @@
> >  """
> >
> >  from __future__ import absolute_import
> > +from fuzzywuzzy import fuzz
>
> I have no idea if the releasenotes extension may depend on third-party modules.
> Given this extension wouldn't be widely used, it might be okay. But I'm not
> sure, literally.

I'm certainly not opposed to starting with fuzzywuzzy and moving to
difflib later if the results are good enough. I'll let y'all figure
that out though.

(There's some precedent already, like how we depend on pygments for highlight.)

>
> FWIW, have you tried difflib.SequenceMatcher()? Its algorithm would be more
> generic, but might be good enough.
>
> And I think this kind of functions will need a unit/doc test.
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Yuya Nishihara - June 27, 2017, 3:11 p.m.
On Mon, 26 Jun 2017 10:23:50 -0400, Augie Fackler wrote:
> On Mon, Jun 26, 2017 at 10:53:27PM +0900, Yuya Nishihara wrote:
> > On Fri, 23 Jun 2017 18:37:22 +0200, Rishabh Madan wrote:
> > > # HG changeset patch
> > > # User Rishabh Madan <rishabhmadan96@gmail.com>
> > > # Date 1498235213 -7200
> > > #      Fri Jun 23 18:26:53 2017 +0200
> > > # Node ID a90693382178ca82b2918ee4b159dfb490d1bfc8
> > > # Parent  b6e6d8df88beb042f5a37123a0ea6a9b437f7755
> > > releasenotes: add similarity check function to compare incoming notes
> > >
> > > It is possible that the incoming note fragments might have some similar content
> > > as the existing release notes. In case of a bug fix, we match for issueNNNN in $
> > > existing notes. For other general cases, it makes use of fuzzywuzzy library to
> > > get a similarity score. If the score is above a certain threshold, we ignore
> > > the fragment otherwise add it. But the score might be misleading for small comm$
> > > messages. So, it uses similarity function if only the length of string (in word$
> > > is above a certain number. The patch also adds tests related to its usage.
> > > But it needs improvement in the sense of combining the incoming notes. We can
> > > use interactive mode for adding the notes. Maybe we can do this if similarity
> > > score is under a certain range.
> > >
> > > diff -r b6e6d8df88be -r a90693382178 hgext/releasenotes.py
> > > --- a/hgext/releasenotes.py	Fri Jun 23 17:15:53 2017 +0200
> > > +++ b/hgext/releasenotes.py	Fri Jun 23 18:26:53 2017 +0200
> > > @@ -12,6 +12,7 @@
> > >  """
> > >
> > >  from __future__ import absolute_import
> > > +from fuzzywuzzy import fuzz
> >
> > I have no idea if the releasenotes extension may depend on third-party modules.
> > Given this extension wouldn't be widely used, it might be okay. But I'm not
> > sure, literally.
> 
> I'm certainly not opposed to starting with fuzzywuzzy and moving to
> difflib later if the results are good enough. I'll let y'all figure
> that out though.
> 
> (There's some precedent already, like how we depend on pygments for highlight.)

Okay.

Rishabh, can you add '#require fuzzywuzzy' rule to the releasenotes tests?
Perhaps you'll just need to copy-edit has_pygments() in tests/hghave.py.

Anyways, it's probably a good step to think about how to test this new
feature, and where functions will be split so they can be easily tested.
Jun Wu - June 30, 2017, 4:49 a.m.
Excerpts from 's message of 2017-06-27 17:37:08 +0200:
> Sure I'll do that. And about the external dependency issue, actually
> fuzzywuzzy is implemented using difflib and python levenshtien distance,
> so if we want we can import that part of the code or probably even rewrite
> it.  I discussed this with Kevin and he told me that we will have to
> discuss with them about the licensing if we plan to do anything similar to
> this.

By the way, did you notice that there is mercurial/similar.py? That file has
a "_score" which is simple but somehow effective. It is currently based on
line diffs. If we can do word diffs, do you think similar._score could be
somehow be reused to satisfy your need?
Jun Wu - July 5, 2017, 12:37 a.m.
Excerpts from Rishabh Madan's message of 2017-07-01 11:16:12 +0200:
> > By the way, did you notice that there is mercurial/similar.py? That file
> > has a "_score" which is simple but somehow effective. It is currently
> > based on line diffs. If we can do word diffs, do you think
> > similar._score could be somehow be reused to satisfy your need?
> 
> I took a look at the _score function and if I understand correctly by word
> diffs you mean first tokenizing those strings and then matching words,
> right? If we plan to do something like this, then we can go one step
> further with it and simply duplicate the fuzz function that we use from
> fuzzywuzzy. The function basically tokenizes the string and forms three
> different strings, first contains intersecting words from both strings
> (alphabetically sorted), second contains sorted intersection along with
> rest of the words from string 1 (alphabetically sorted) and third one
> contains sorted intersection with rest of the words from string 2
> (alphabetically sorted). It then simply uses SequenceMatcher from
> difflib and gets score for all possible pairs of these three strings and
> finally picks the best match score.

I see. Sounds like it's not a direct match of what you want. Since
fuzzywuzzy is working, and supporting word-diff requires some
not-too-trivial changes. I think it's better to just continue use
fuzzywuzzy now.

Patch

diff -r b6e6d8df88be -r a90693382178 hgext/releasenotes.py
--- a/hgext/releasenotes.py	Fri Jun 23 17:15:53 2017 +0200
+++ b/hgext/releasenotes.py	Fri Jun 23 18:26:53 2017 +0200
@@ -12,6 +12,7 @@ 
 """
 
 from __future__ import absolute_import
+from fuzzywuzzy import fuzz
 
 import errno
 import re
@@ -44,6 +45,7 @@ 
 ]
 
 RE_DIRECTIVE = re.compile('^\.\. ([a-zA-Z0-9_]+)::\s*([^$]+)?$')
+RE_ISSUE = r'\bissue [0-9]{4,6}(?![0-9])\b|\bissue[0-9]{4,6}(?![0-9])\b'
 
 BULLET_SECTION = _('Other Changes')
 
@@ -89,7 +91,20 @@ 
 
         This is used to combine multiple sources of release notes together.
         """
+
+        all_points = []
+
         for section in other:
+            for title, paragraphs in self.titledforsection(section):
+                str = ""
+                str = converttostring(paragraphs)
+                all_points.append(str)
+
+            for paragraphs in self.nontitledforsection(section):
+                str = ""
+                str = converttostring(paragraphs)
+                all_points.append(str)
+
             for title, paragraphs in other.titledforsection(section):
                 if self.hastitledinsection(section, title):
                     # TODO prompt for resolution if different and running in
@@ -97,18 +112,59 @@ 
                     ui.write(_('%s already exists in %s section; ignoring\n') %
                              (title, section))
                     continue
+                str_incoming = converttostring(paragraphs)
+                if section == 'fix':
+                    issues = re.findall(RE_ISSUE, str_incoming, re.IGNORECASE)
+                    if len(issues) > 0:
+                        issuenumber = issues[0]
+                        issuenumber = "".join(issuenumber.split())
+                        if any(issuenumber in s for s in all_points):
+                            ui.write(_("\"%s\" already exists in notes; "
+                                     "ignoring\n") % issuenumber)
+                            continue
+                        else:
+                            self.addtitleditem(section, title, paragraphs)
+                            continue
 
-                # TODO perform similarity comparison and try to match against
-                # existing.
-                self.addtitleditem(section, title, paragraphs)
+                if len(str_incoming.split()) > 10:
+                    merge = similaritycheck(str_incoming, all_points)
+
+                    if not merge:
+                        ui.write(_("\"%s\" already exists in notes file; "
+                                 "ignoring\n") % str_incoming)
+                    else:
+                        self.addtitleditem(section, title, paragraphs)
+                else:
+                    self.addtitleditem(section, title, paragraphs)
 
             for paragraphs in other.nontitledforsection(section):
-                if paragraphs in self.nontitledforsection(section):
-                    continue
+                str_incoming = converttostring(paragraphs)
+                if section == 'fix':
+                    issues = re.findall(RE_ISSUE, str_incoming, re.IGNORECASE)
+                    if len(issues) > 0:
+                        issuenumber = issues[0].lower()
+                        issuenumber = "".join(issuenumber.split())
+                        if any(issuenumber in s for s in all_points):
+                            ui.write(_("\"%s\" already exists in notes; "
+                                     "ignoring\n") % str_incoming)
+                            continue
+                        else:
+                            self.addnontitleditem(section, paragraphs)
+                            continue
 
-                # TODO perform similarily comparison and try to match against
-                # existing.
-                self.addnontitleditem(section, paragraphs)
+                if paragraphs in self.nontitledforsection(section):
+                        continue
+
+                if len(str_incoming.split()) > 10:
+                    merge = similaritycheck(str_incoming, all_points)
+
+                    if not merge:
+                        ui.write(_("\"%s\" already exists in notes; "
+                                 "ignoring\n") % str_incoming)
+                    else:
+                        self.addnontitleditem(section, paragraphs)
+                else:
+                    self.addnontitleditem(section, paragraphs)
 
 class releasenotessections(object):
     def __init__(self, ui):
@@ -128,6 +184,27 @@ 
 
         return None
 
+def converttostring(paragraphs):
+    """
+    Converts paragraph and bullet data to individual strings.
+    """
+    str = ""
+    for para in paragraphs:
+        str += ' '.join(para) + ' '
+    return str
+
+def similaritycheck(incoming_str, existingnotes):
+    """
+    Returns true when note fragment can be merged to existing notes.
+    """
+    merge = True
+    for bullet in existingnotes:
+        score = fuzz.token_set_ratio(incoming_str, bullet)
+        if score > 75:
+            merge = False
+            break
+    return merge
+
 def parsenotesfromrevisions(repo, directives, revs):
     notes = parsedreleasenotes()