Patchwork [2,of,2,v3] releasenotes: add similarity check function to compare incoming notes

login
register
mail settings
Submitter Rishabh Madan
Date July 9, 2017, 9:45 p.m.
Message ID <c83144b6b48a2374a371.1499636725@bunty>
Download mbox | patch
Permalink /patch/22183/
State Accepted
Headers show

Comments

Rishabh Madan - July 9, 2017, 9:45 p.m.
# HG changeset patch
# User Rishabh Madan <rishabhmadan96@gmail.com>
# Date 1499636627 -7200
#      Sun Jul 09 23:43:47 2017 +0200
# Node ID c83144b6b48a2374a37114336e58974546e915e3
# Parent  43c97ccdfd39cfa447100b54e923d1d3d753476b
releasenotes: add similarity check function to compare incoming notes

It is possible that the incoming note fragments might have some similar content
as the existing release notes. In case of a bug fix, we match for issueNNNN in $
existing notes. For other general cases, it makes use of fuzzywuzzy library to
get a similarity score. If the score is above a certain threshold, we ignore
the fragment otherwise add it. But the score might be misleading for small comm$
messages. So, it uses similarity function if only the length of string (in word$
is above a certain number. The patch also adds tests related to its usage.
But it needs improvement in the sense of combining the incoming notes. We can
use interactive mode for adding the notes. Maybe we can do this if similarity
score is under a certain range.
Yuya Nishihara - July 10, 2017, 1:52 p.m.
On Sun, 09 Jul 2017 23:45:25 +0200, Rishabh Madan wrote:
> # HG changeset patch
> # User Rishabh Madan <rishabhmadan96@gmail.com>
> # Date 1499636627 -7200
> #      Sun Jul 09 23:43:47 2017 +0200
> # Node ID c83144b6b48a2374a37114336e58974546e915e3
> # Parent  43c97ccdfd39cfa447100b54e923d1d3d753476b
> releasenotes: add similarity check function to compare incoming notes

Generally looks good, but can't be applied without the custom admonition
patch. Can you send new version as a series?

>          This is used to combine multiple sources of release notes together.
>          """
> +        existing = self
> +
>          for section in other:
> +            merge = mergereleasenotes(section, existing)
>              for title, paragraphs in other.titledforsection(section):

> -                if self.hastitledinsection(section, title):
> -                    # TODO prompt for resolution if different and running in
> -                    # interactive mode.
> -                    ui.write(_('%s already exists in %s section; ignoring\n') %
> -                             (title, section))
> -                    continue

Can you leave the existing deduplication code untouched? It's probably a
good idea to move it to mergereleasenotes(), but that makes a dependency
from mergereleasenotes to parsedreleasenotes.

>              for paragraphs in other.nontitledforsection(section):
> -                if paragraphs in self.nontitledforsection(section):
> -                    continue

This too.

> +class mergereleasenotes(object):
> +    def __init__(self, section, existing):
> +        self.incoming_points = []
> +        self.existing_points = self.gatherexistingnotes(section, existing)
> +
> +    def gatherexistingnotes(self, section, existing):
> +        existing_points = []
> +        for title, paragraphs in existing.titledforsection(section):
> +            str = ""

Nit: unused assignment.

> +            str = converttostring(paragraphs)
> +            existing_points.append(str)
> +
> +        for paragraphs in existing.nontitledforsection(section):
> +            str = ""
> +            str = converttostring(paragraphs)
> +            existing_points.append(str)
> +        return existing_points

If gatherexistingnotes() takes a simple iterable, mergereleasenotes() don't
have to know the parsedreleasenotes class. That isn't important, but seems
slightly better.

> +    def check_merge(self, ui, section, paragraphs, existing, title=None):
> +        if title:
> +            if existing.hastitledinsection(section, title):
> +                ui.write(_('%s already exists in %s section; ignoring\n') %
> +                         (title, section))
> +                return False
> +        elif paragraphs in existing.nontitledforsection(section):
> +            return False
> +
> +        incoming_str = converttostring(paragraphs)
> +        if section == 'fix':
> +            issues = re.findall(RE_ISSUE, incoming_str, re.IGNORECASE)
> +            if len(issues) > 0:
> +                issuenumber = issues[0]

re.search() can be used if you only need the first match.

> +                issuenumber = "".join(issuenumber.split())
> +                if any(issuenumber in s for s in self.existing_points):
> +                    ui.write(_("\"%s\" already exists in notes; "
> +                             "ignoring\n") % issuenumber)
> +                    return False
> +                else:
> +                    return True
> +        if len(incoming_str.split()) > 10:
> +            merge = similaritycheck(incoming_str, self.existing_points)
> +            if not merge:
> +                ui.write(_("\"%s\" already exists in notes file; "
> +                         "ignoring\n") % incoming_str)
> +                return False
> +            else:
> +                return True
> +        else:
> +            return True

> +def converttostring(paragraphs):
> +    """
> +    Converts paragraph and bullet data to individual strings.
> +    """
> +    str = ""
> +    for para in paragraphs:
> +        str += ' '.join(para) + ' '
> +    return str

Nit: probably better to build a list of all paragraphs, and join them at once,
which should be generally cheaper on Python.

  words = []
  for
      words.extend(para)
  return ' '.join(words)

> +def similaritycheck(incoming_str, existingnotes):
> +    """
> +    Returns true when note fragment can be merged to existing notes.
> +    """
> +    merge = True
> +    for bullet in existingnotes:
> +        score = fuzz.token_set_ratio(incoming_str, bullet)
> +        if score > 75:
> +            merge = False
> +            break
> +    return merge
> +

Patch

diff -r 43c97ccdfd39 -r c83144b6b48a hgext/releasenotes.py
--- a/hgext/releasenotes.py	Sun Jul 09 23:25:02 2017 +0200
+++ b/hgext/releasenotes.py	Sun Jul 09 23:43:47 2017 +0200
@@ -14,6 +14,7 @@ 
 from __future__ import absolute_import
 
 import errno
+import fuzzywuzzy.fuzz as fuzz
 import re
 import sys
 import textwrap
@@ -45,6 +46,7 @@ 
 ]
 
 RE_DIRECTIVE = re.compile('^\.\. ([a-zA-Z0-9_]+)::\s*([^$]+)?$')
+RE_ISSUE = r'\bissue [0-9]{4,6}(?![0-9])\b|\bissue[0-9]{4,6}(?![0-9])\b'
 
 BULLET_SECTION = _('Other Changes')
 
@@ -90,26 +92,17 @@ 
 
         This is used to combine multiple sources of release notes together.
         """
+        existing = self
+
         for section in other:
+            merge = mergereleasenotes(section, existing)
             for title, paragraphs in other.titledforsection(section):
-                if self.hastitledinsection(section, title):
-                    # TODO prompt for resolution if different and running in
-                    # interactive mode.
-                    ui.write(_('%s already exists in %s section; ignoring\n') %
-                             (title, section))
-                    continue
-
-                # TODO perform similarity comparison and try to match against
-                # existing.
-                self.addtitleditem(section, title, paragraphs)
+                if merge.check_merge(ui, section, paragraphs, existing, title):
+                    self.addtitleditem(section, title, paragraphs)
 
             for paragraphs in other.nontitledforsection(section):
-                if paragraphs in self.nontitledforsection(section):
-                    continue
-
-                # TODO perform similarily comparison and try to match against
-                # existing.
-                self.addnontitleditem(section, paragraphs)
+                if merge.check_merge(ui, section, paragraphs, existing):
+                    self.addnontitleditem(section, paragraphs)
 
 class releasenotessections(object):
     def __init__(self, ui, repo=None):
@@ -145,6 +138,77 @@ 
 
         return None
 
+class mergereleasenotes(object):
+    def __init__(self, section, existing):
+        self.incoming_points = []
+        self.existing_points = self.gatherexistingnotes(section, existing)
+
+    def gatherexistingnotes(self, section, existing):
+        existing_points = []
+        for title, paragraphs in existing.titledforsection(section):
+            str = ""
+            str = converttostring(paragraphs)
+            existing_points.append(str)
+
+        for paragraphs in existing.nontitledforsection(section):
+            str = ""
+            str = converttostring(paragraphs)
+            existing_points.append(str)
+        return existing_points
+
+    def check_merge(self, ui, section, paragraphs, existing, title=None):
+        if title:
+            if existing.hastitledinsection(section, title):
+                ui.write(_('%s already exists in %s section; ignoring\n') %
+                         (title, section))
+                return False
+        elif paragraphs in existing.nontitledforsection(section):
+            return False
+
+        incoming_str = converttostring(paragraphs)
+        if section == 'fix':
+            issues = re.findall(RE_ISSUE, incoming_str, re.IGNORECASE)
+            if len(issues) > 0:
+                issuenumber = issues[0]
+                issuenumber = "".join(issuenumber.split())
+                if any(issuenumber in s for s in self.existing_points):
+                    ui.write(_("\"%s\" already exists in notes; "
+                             "ignoring\n") % issuenumber)
+                    return False
+                else:
+                    return True
+        if len(incoming_str.split()) > 10:
+            merge = similaritycheck(incoming_str, self.existing_points)
+            if not merge:
+                ui.write(_("\"%s\" already exists in notes file; "
+                         "ignoring\n") % incoming_str)
+                return False
+            else:
+                return True
+        else:
+            return True
+
+def converttostring(paragraphs):
+    """
+    Converts paragraph and bullet data to individual strings.
+    """
+    str = ""
+    for para in paragraphs:
+        str += ' '.join(para) + ' '
+    return str
+
+def similaritycheck(incoming_str, existingnotes):
+    """
+    Returns true when note fragment can be merged to existing notes.
+    """
+    merge = True
+    for bullet in existingnotes:
+        score = fuzz.token_set_ratio(incoming_str, bullet)
+        if score > 75:
+            merge = False
+            break
+    return merge
+
 def getcustomadmonitions(repo):
     custom_sections = list()
     ctx = repo['.']
diff -r 43c97ccdfd39 -r c83144b6b48a tests/test-releasenotes-formatting.t
--- a/tests/test-releasenotes-formatting.t	Sun Jul 09 23:25:02 2017 +0200
+++ b/tests/test-releasenotes-formatting.t	Sun Jul 09 23:43:47 2017 +0200
@@ -1,3 +1,5 @@ 
+#require fuzzywuzzy
+
   $ cat >> $HGRCPATH << EOF
   > [extensions]
   > releasenotes=
diff -r 43c97ccdfd39 -r c83144b6b48a tests/test-releasenotes-merging.t
--- a/tests/test-releasenotes-merging.t	Sun Jul 09 23:25:02 2017 +0200
+++ b/tests/test-releasenotes-merging.t	Sun Jul 09 23:43:47 2017 +0200
@@ -1,3 +1,5 @@ 
+#require fuzzywuzzy
+
   $ cat >> $HGRCPATH << EOF
   > [extensions]
   > releasenotes=
@@ -158,3 +160,122 @@ 
   
   * this is fix3.
 
+  $ cd ..
+
+Ignores commit messages containing issueNNNN based on issue number.
+
+  $ hg init simple-fuzzrepo
+  $ cd simple-fuzzrepo
+  $ touch fix1
+  $ hg -q commit -A -l - << EOF
+  > commit 1
+  > 
+  > .. fix::
+  > 
+  >    Resolved issue4567.
+  > EOF
+
+  $ cat >> $TESTTMP/issue-number-notes << EOF
+  > Bug Fixes
+  > =========
+  > 
+  > * Fixed issue1234 related to XYZ.
+  > 
+  > * Fixed issue4567 related to ABC.
+  > 
+  > * Fixed issue3986 related to PQR.
+  > EOF
+
+  $ hg releasenotes -r . $TESTTMP/issue-number-notes
+  "issue4567" already exists in notes; ignoring
+
+  $ cat $TESTTMP/issue-number-notes
+  Bug Fixes
+  =========
+  
+  * Fixed issue1234 related to XYZ.
+  
+  * Fixed issue4567 related to ABC.
+  
+  * Fixed issue3986 related to PQR.
+
+  $ cd ..
+
+Adds short commit messages (words < 10) without
+comparison unless there is an exact match.
+
+  $ hg init tempdir
+  $ cd tempdir
+  $ touch feature1
+  $ hg -q commit -A -l - << EOF
+  > commit 1
+  > 
+  > .. feature::
+  > 
+  >    Adds a new feature 1.
+  > EOF
+
+  $ hg releasenotes -r . $TESTTMP/short-sentence-notes
+
+  $ touch feature2
+  $ hg -q commit -A -l - << EOF
+  > commit 2
+  > 
+  > .. feature::
+  > 
+  >    Adds a new feature 2.
+  > EOF
+
+  $ hg releasenotes -r . $TESTTMP/short-sentence-notes
+  $ cat $TESTTMP/short-sentence-notes
+  New Features
+  ============
+  
+  * Adds a new feature 1.
+  
+  * Adds a new feature 2.
+
+  $ cd ..
+
+Ignores commit messages based on fuzzy comparison.
+
+  $ hg init fuzznotes
+  $ cd fuzznotes
+  $ touch fix1
+  $ hg -q commit -A -l - << EOF
+  > commit 1
+  > 
+  > .. fix::
+  > 
+  >    This is a fix with another line.
+  >    And it is a big one.
+  > EOF
+
+  $ cat >> $TESTTMP/fuzz-ignore-notes << EOF
+  > Bug Fixes
+  > =========
+  > 
+  > * Fixed issue4567 by improving X.
+  > 
+  > * This is the first line. This is next line with one newline.
+  > 
+  >   This is another line written after two newlines. This is going to be a big one.
+  > 
+  > * This fixes another problem.
+  > EOF
+
+  $ hg releasenotes -r . $TESTTMP/fuzz-ignore-notes
+  "This is a fix with another line. And it is a big one. " already exists in notes file; ignoring
+
+  $ cat $TESTTMP/fuzz-ignore-notes
+  Bug Fixes
+  =========
+  
+  * Fixed issue4567 by improving X.
+  
+  * This is the first line. This is next line with one newline.
+  
+    This is another line written after two newlines. This is going to be a big
+    one.
+  
+  * This fixes another problem.
diff -r 43c97ccdfd39 -r c83144b6b48a tests/test-releasenotes-parsing.t
--- a/tests/test-releasenotes-parsing.t	Sun Jul 09 23:25:02 2017 +0200
+++ b/tests/test-releasenotes-parsing.t	Sun Jul 09 23:43:47 2017 +0200
@@ -1,3 +1,5 @@ 
+#require fuzzywuzzy
+
   $ cat >> $HGRCPATH << EOF
   > [extensions]
   > releasenotes=