Patchwork [2,of,2,v6] releasenotes: add similarity check function to compare incoming notes

login
register
mail settings
Submitter Rishabh Madan
Date July 29, 2017, 10 a.m.
Message ID <d75147e1966090d53f77.1501322446@bunty>
Download mbox | patch
Permalink /patch/22577/
State Superseded
Headers show

Comments

Rishabh Madan - July 29, 2017, 10 a.m.
# HG changeset patch
# User Rishabh Madan <rishabhmadan96@gmail.com>
# Date 1501320794 -19800
#      Sat Jul 29 15:03:14 2017 +0530
# Node ID d75147e1966090d53f773f327886f0103152b1ca
# Parent  1d79b04c402f3f431ca052b677b1021ddd93a10e
releasenotes: add similarity check function to compare incoming notes

It is possible that the incoming note fragments might have some similar content
as the existing release notes. In case of a bug fix, we match for issueNNNN in $
existing notes. For other general cases, it makes use of fuzzywuzzy library to
get a similarity score. If the score is above a certain threshold, we ignore
the fragment otherwise add it. But the score might be misleading for small comm$
messages. So, it uses similarity function if only the length of string (in word$
is above a certain number. The patch also adds tests related to its usage.
But it needs improvement in the sense of combining the incoming notes. We can
use interactive mode for adding the notes. Maybe we can do this if similarity
score is under a certain range.
Yuya Nishihara - July 30, 2017, 5:35 a.m.
On Sat, 29 Jul 2017 15:30:46 +0530, Rishabh Madan wrote:
> # HG changeset patch
> # User Rishabh Madan <rishabhmadan96@gmail.com>
> # Date 1501320794 -19800
> #      Sat Jul 29 15:03:14 2017 +0530
> # Node ID d75147e1966090d53f773f327886f0103152b1ca
> # Parent  1d79b04c402f3f431ca052b677b1021ddd93a10e
> releasenotes: add similarity check function to compare incoming notes

> +            titlediter = iter(self.titledforsection(section))
> +            nontitlediter = iter(self.nontitledforsection(section))
> +            existingnotes = converttitled(titlediter) + \
> +                convertnontitled(nontitlediter)

You don't need these explicit casts from list to iterator.

>              for title, paragraphs in other.titledforsection(section):
>                  if self.hastitledinsection(section, title):
>                      # TODO prompt for resolution if different and running in
> @@ -100,16 +106,32 @@
>                               (title, section))
>                      continue
>  
> -                # TODO perform similarity comparison and try to match against
> -                # existing.
> +                incoming_str = converttitled(paragraphs)[0]

So converttitled() takes

 a) a list of (title, paragraphs) tuples
 b) a list of paragraphs

Which is the valid use?

> +def converttitled(iterable):
> +    """
> +    Convert titled paragraphs to strings
> +    """
> +    string_list = []
> +    for titledparagraphs in iterable:
> +        str = ""
> +        for paragraphs in titledparagraphs:
> +            if isinstance(paragraphs, basestring):
> +                continue

Maybe this isinstace() is necessary because the type of the given iterable is
unstable?

> +            else:
> +                for para in paragraphs:
> +                    str += ' '.join(para) + ' '

Nit: str.join([str]) is generally cheaper than doing str + str repeatedly.

> +def getissuenum(incoming_str):
> +    """
> +    Returns issue number from the incoming string if it exists
> +    """
> +    issue = re.search(RE_ISSUE, incoming_str, re.IGNORECASE)
> +    if issue:
> +        issue = issue.group()
> +        issue = "".join(issue.split())

I couldn't figure out why it strips whitespace. Perhaps this could be stated
explicitly as follows.

  r'\(bissue) ?([0-9]{4,6}(?![0-9]))\b'
  group(1) + group(2)

> +def findissue(existing, issue, ui):
> +    """
> +    Returns true if issue number already exists in notes.
> +    """
> +    if any(issue in s for s in existing):
> +        ui.write(_("\"%s\" already exists in notes; "
> +                 "ignoring\n") % issue)
> +        return True
> +    else:
> +        return False
> +
> +def similar(existing, incoming_str, ui):
> +    """
> +    Returns true if similar note found in existing notes.
> +    """
> +    if len(incoming_str.split()) > 10:
> +        merge = similaritycheck(incoming_str, existing)
> +        if not merge:
> +            ui.write(_("\"%s\" already exists in notes file; "
> +                     "ignoring\n") % incoming_str)
> +            return True
> +        else:
> +            return False
> +    else:
> +        return False

Nit: I slightly prefer (ui, ...) over(..., ui) because it seems more common. Or
even ui.write() could be moved to the caller.
Yuya Nishihara - July 30, 2017, 3:46 p.m.
On Sun, 30 Jul 2017 20:08:33 +0530, Rishabh Madan wrote:
> On Sun, Jul 30, 2017 at 11:05 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> > So converttitled() takes
> >
> >  a) a list of (title, paragraphs) tuples
> >  b) a list of paragraphs
> >
> > Which is the valid use?
> >
> 
> The first one is what it takes as input.
> 
> >
> > > +def converttitled(iterable):
> > > +    """
> > > +    Convert titled paragraphs to strings
> > > +    """
> > > +    string_list = []
> > > +    for titledparagraphs in iterable:
> > > +        str = ""
> > > +        for paragraphs in titledparagraphs:
> > > +            if isinstance(paragraphs, basestring):
> > > +                continue
> >
> > Maybe this isinstace() is necessary because the type of the given iterable
> > is
> > unstable?
> >
> I need to iterate to iterate through a data like `('Title fix ', [['adds
> fix to notes']])`. So for this case, title is a string but notes is a list,
> hence I'm using isinstance to check if the value is a string or not. In
> case it is, that means it's a list, so I just continue iterating.

If titledparagraphs is [(str, [str])], you can simply destructure the type.

  for title, paragraphs in titledparagraphs:
      ...

Patch

diff -r 1d79b04c402f -r d75147e19660 hgext/releasenotes.py
--- a/hgext/releasenotes.py	Sat Jul 29 14:06:26 2017 +0530
+++ b/hgext/releasenotes.py	Sat Jul 29 15:03:14 2017 +0530
@@ -14,6 +14,7 @@ 
 from __future__ import absolute_import
 
 import errno
+import fuzzywuzzy.fuzz as fuzz
 import re
 import sys
 import textwrap
@@ -46,6 +47,7 @@ 
 ]
 
 RE_DIRECTIVE = re.compile('^\.\. ([a-zA-Z0-9_]+)::\s*([^$]+)?$')
+RE_ISSUE = r'\bissue ?[0-9]{4,6}(?![0-9])\b'
 
 BULLET_SECTION = _('Other Changes')
 
@@ -92,6 +94,10 @@ 
         This is used to combine multiple sources of release notes together.
         """
         for section in other:
+            titlediter = iter(self.titledforsection(section))
+            nontitlediter = iter(self.nontitledforsection(section))
+            existingnotes = converttitled(titlediter) + \
+                convertnontitled(nontitlediter)
             for title, paragraphs in other.titledforsection(section):
                 if self.hastitledinsection(section, title):
                     # TODO prompt for resolution if different and running in
@@ -100,16 +106,32 @@ 
                              (title, section))
                     continue
 
-                # TODO perform similarity comparison and try to match against
-                # existing.
+                incoming_str = converttitled(paragraphs)[0]
+                if section == 'fix':
+                    issue = getissuenum(incoming_str)
+                    if issue:
+                        if findissue(existingnotes, issue, ui):
+                            continue
+
+                if similar(existingnotes, incoming_str, ui):
+                    continue
+
                 self.addtitleditem(section, title, paragraphs)
 
             for paragraphs in other.nontitledforsection(section):
                 if paragraphs in self.nontitledforsection(section):
                     continue
 
-                # TODO perform similarily comparison and try to match against
-                # existing.
+                incoming_str = convertnontitled(paragraphs)[0]
+                if section == 'fix':
+                    issue = getissuenum(incoming_str)
+                    if issue:
+                        if findissue(existingnotes, issue, ui):
+                            continue
+
+                if similar(existingnotes, incoming_str, ui):
+                    continue
+
                 self.addnontitleditem(section, paragraphs)
 
 class releasenotessections(object):
@@ -136,6 +158,86 @@ 
 
         return None
 
+def converttitled(iterable):
+    """
+    Convert titled paragraphs to strings
+    """
+    string_list = []
+    for titledparagraphs in iterable:
+        str = ""
+        for paragraphs in titledparagraphs:
+            if isinstance(paragraphs, basestring):
+                continue
+            else:
+                for para in paragraphs:
+                    str += ' '.join(para) + ' '
+        string_list.append(str)
+    return string_list
+
+def convertnontitled(iterable):
+    """
+    Convert non-titled bullets to strings
+    """
+    string_list = []
+    for paragraphs in iterable:
+        str = ""
+        if isinstance(paragraphs[0], basestring):
+            str += ' '.join(paragraphs) + ' '
+        else:
+            for para in paragraphs:
+                str += ' '.join(para) + ' '
+        string_list.append(str)
+    return string_list
+
+def getissuenum(incoming_str):
+    """
+    Returns issue number from the incoming string if it exists
+    """
+    issue = re.search(RE_ISSUE, incoming_str, re.IGNORECASE)
+    if issue:
+        issue = issue.group()
+        issue = "".join(issue.split())
+    return issue
+
+
+def findissue(existing, issue, ui):
+    """
+    Returns true if issue number already exists in notes.
+    """
+    if any(issue in s for s in existing):
+        ui.write(_("\"%s\" already exists in notes; "
+                 "ignoring\n") % issue)
+        return True
+    else:
+        return False
+
+def similar(existing, incoming_str, ui):
+    """
+    Returns true if similar note found in existing notes.
+    """
+    if len(incoming_str.split()) > 10:
+        merge = similaritycheck(incoming_str, existing)
+        if not merge:
+            ui.write(_("\"%s\" already exists in notes file; "
+                     "ignoring\n") % incoming_str)
+            return True
+        else:
+            return False
+    else:
+        return False
+
+def similaritycheck(incoming_str, existingnotes):
+    """
+    Returns true when note fragment can be merged to existing notes.
+    """
+    merge = True
+    for bullet in existingnotes:
+        score = fuzz.token_set_ratio(incoming_str, bullet)
+        if score > 75:
+            merge = False
+            break
+    return merge
+
 def getcustomadmonitions(repo):
     ctx = repo['.']
     p = config.config()
diff -r 1d79b04c402f -r d75147e19660 tests/test-releasenotes-formatting.t
--- a/tests/test-releasenotes-formatting.t	Sat Jul 29 14:06:26 2017 +0530
+++ b/tests/test-releasenotes-formatting.t	Sat Jul 29 15:03:14 2017 +0530
@@ -1,3 +1,5 @@ 
+#require fuzzywuzzy
+
   $ cat >> $HGRCPATH << EOF
   > [extensions]
   > releasenotes=
diff -r 1d79b04c402f -r d75147e19660 tests/test-releasenotes-merging.t
--- a/tests/test-releasenotes-merging.t	Sat Jul 29 14:06:26 2017 +0530
+++ b/tests/test-releasenotes-merging.t	Sat Jul 29 15:03:14 2017 +0530
@@ -1,3 +1,5 @@ 
+#require fuzzywuzzy
+
   $ cat >> $HGRCPATH << EOF
   > [extensions]
   > releasenotes=
@@ -158,3 +160,122 @@ 
   
   * this is fix3.
 
+  $ cd ..
+
+Ignores commit messages containing issueNNNN based on issue number.
+
+  $ hg init simple-fuzzrepo
+  $ cd simple-fuzzrepo
+  $ touch fix1
+  $ hg -q commit -A -l - << EOF
+  > commit 1
+  > 
+  > .. fix::
+  > 
+  >    Resolved issue4567.
+  > EOF
+
+  $ cat >> $TESTTMP/issue-number-notes << EOF
+  > Bug Fixes
+  > =========
+  > 
+  > * Fixed issue1234 related to XYZ.
+  > 
+  > * Fixed issue4567 related to ABC.
+  > 
+  > * Fixed issue3986 related to PQR.
+  > EOF
+
+  $ hg releasenotes -r . $TESTTMP/issue-number-notes
+  "issue4567" already exists in notes; ignoring
+
+  $ cat $TESTTMP/issue-number-notes
+  Bug Fixes
+  =========
+  
+  * Fixed issue1234 related to XYZ.
+  
+  * Fixed issue4567 related to ABC.
+  
+  * Fixed issue3986 related to PQR.
+
+  $ cd ..
+
+Adds short commit messages (words < 10) without
+comparison unless there is an exact match.
+
+  $ hg init tempdir
+  $ cd tempdir
+  $ touch feature1
+  $ hg -q commit -A -l - << EOF
+  > commit 1
+  > 
+  > .. feature::
+  > 
+  >    Adds a new feature 1.
+  > EOF
+
+  $ hg releasenotes -r . $TESTTMP/short-sentence-notes
+
+  $ touch feature2
+  $ hg -q commit -A -l - << EOF
+  > commit 2
+  > 
+  > .. feature::
+  > 
+  >    Adds a new feature 2.
+  > EOF
+
+  $ hg releasenotes -r . $TESTTMP/short-sentence-notes
+  $ cat $TESTTMP/short-sentence-notes
+  New Features
+  ============
+  
+  * Adds a new feature 1.
+  
+  * Adds a new feature 2.
+
+  $ cd ..
+
+Ignores commit messages based on fuzzy comparison.
+
+  $ hg init fuzznotes
+  $ cd fuzznotes
+  $ touch fix1
+  $ hg -q commit -A -l - << EOF
+  > commit 1
+  > 
+  > .. fix::
+  > 
+  >    This is a fix with another line.
+  >    And it is a big one.
+  > EOF
+
+  $ cat >> $TESTTMP/fuzz-ignore-notes << EOF
+  > Bug Fixes
+  > =========
+  > 
+  > * Fixed issue4567 by improving X.
+  > 
+  > * This is the first line. This is next line with one newline.
+  > 
+  >   This is another line written after two newlines. This is going to be a big one.
+  > 
+  > * This fixes another problem.
+  > EOF
+
+  $ hg releasenotes -r . $TESTTMP/fuzz-ignore-notes
+  "This is a fix with another line. And it is a big one. " already exists in notes file; ignoring
+
+  $ cat $TESTTMP/fuzz-ignore-notes
+  Bug Fixes
+  =========
+  
+  * Fixed issue4567 by improving X.
+  
+  * This is the first line. This is next line with one newline.
+  
+    This is another line written after two newlines. This is going to be a big
+    one.
+  
+  * This fixes another problem.
diff -r 1d79b04c402f -r d75147e19660 tests/test-releasenotes-parsing.t
--- a/tests/test-releasenotes-parsing.t	Sat Jul 29 14:06:26 2017 +0530
+++ b/tests/test-releasenotes-parsing.t	Sat Jul 29 15:03:14 2017 +0530
@@ -1,3 +1,5 @@ 
+#require fuzzywuzzy
+
   $ cat >> $HGRCPATH << EOF
   > [extensions]
   > releasenotes=