Patchwork [RFC] similar: allow similarity detection to use sha256 for digesting file contents

login
register
mail settings
Submitter Katsunori FUJIWARA
Date March 1, 2017, 3:02 p.m.
Message ID <018d9759cb93f116007d.1488380529@speaknoevil>
Download mbox | patch
Permalink /patch/18853/
State Accepted
Headers show

Comments

Katsunori FUJIWARA - March 1, 2017, 3:02 p.m.
# HG changeset patch
# User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
# Date 1488380487 -32400
#      Thu Mar 02 00:01:27 2017 +0900
# Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
# Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
similar: allow similarity detection to use sha256 for digesting file contents

Before this patch, similarity detection logic (used for addremove and
automv) uses SHA-1 digesting. But this cause incorrect rename
detection, if:

  - removing file A and adding file B occur at same committing, and
  - SHA-1 hash values of file A and B are same

This may prevent security experts from managing sample files for
SHAttered issue in Mercurial repository, for example.

  https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
  https://shattered.it/

Hash collision itself isn't so serious for core repository
functionality of Mercurial, described by mpm as below, though.

  https://www.mercurial-scm.org/wiki/mpm/SHA1

HOW ABOUT:

  - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?

    ease (handling problematic files safely by default) or
    performance?

  - what name of config knob is reasonable to control digesting algorithm ?

    or should we fully switch to more secure digest alrgorithm ?

BTW, almost all of SHA-1 clients in Mercurial source tree applies it
on other than file contents (e.g. list of parent hash ids, file path,
URL, and so on). But some SHA-1 clients below apply it on file
contents.

  - patch.trydiff()

    SHA-1 is applied on "blob SIZE\0" header + file content (only if
    experimental.extendedheader.index is enabled)

  - largefiles

The former should be less serious.

On the other hand, the latter causes unintentional unification between
largefiles, which cause same SHA-1 hash value, at checking files out.
via Mercurial-devel - March 1, 2017, 5:05 p.m.
On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori
<foozy@lares.dti.ne.jp> wrote:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> # Date 1488380487 -32400
> #      Thu Mar 02 00:01:27 2017 +0900
> # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> similar: allow similarity detection to use sha256 for digesting file contents
>
> Before this patch, similarity detection logic (used for addremove and
> automv) uses SHA-1 digesting. But this cause incorrect rename
> detection, if:
>
>   - removing file A and adding file B occur at same committing, and
>   - SHA-1 hash values of file A and B are same

If rename detection in this very special case is all that's lost, I
don't think the extra complexity (code and config) is worth it.

>
> This may prevent security experts from managing sample files for
> SHAttered issue in Mercurial repository, for example.
>
>   https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
>   https://shattered.it/
>
> Hash collision itself isn't so serious for core repository
> functionality of Mercurial, described by mpm as below, though.
>
>   https://www.mercurial-scm.org/wiki/mpm/SHA1
>
> HOW ABOUT:
>
>   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
>
>     ease (handling problematic files safely by default) or
>     performance?
>
>   - what name of config knob is reasonable to control digesting algorithm ?
>
>     or should we fully switch to more secure digest alrgorithm ?

Yes, I'd rather wait until we do that.

>
> BTW, almost all of SHA-1 clients in Mercurial source tree applies it
> on other than file contents (e.g. list of parent hash ids, file path,
> URL, and so on). But some SHA-1 clients below apply it on file
> contents.
>
>   - patch.trydiff()
>
>     SHA-1 is applied on "blob SIZE\0" header + file content (only if
>     experimental.extendedheader.index is enabled)
>
>   - largefiles
>
> The former should be less serious.
>
> On the other hand, the latter causes unintentional unification between
> largefiles, which cause same SHA-1 hash value, at checking files out.
>
> diff --git a/mercurial/similar.py b/mercurial/similar.py
> --- a/mercurial/similar.py
> +++ b/mercurial/similar.py
> @@ -12,9 +12,15 @@ import hashlib
>  from .i18n import _
>  from . import (
>      bdiff,
> +    error,
>      mdiff,
>  )
>
> +DIGESTERS = {
> +    'sha1': hashlib.sha1,
> +    'sha256': hashlib.sha256,
> +}
> +
>  def _findexactmatches(repo, added, removed):
>      '''find renamed files that have no changes
>
> @@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
>      '''
>      numfiles = len(added) + len(removed)
>
> +    digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
> +    if digest not in DIGESTERS:
> +        raise error.ConfigError(_("ui.similaritydigest value is invalid: %s")
> +                                % digest)
> +    digester = DIGESTERS[digest]
> +
>      # Get hashes of removed files.
>      hashes = {}
>      for i, fctx in enumerate(removed):
>          repo.ui.progress(_('searching for exact renames'), i, total=numfiles,
>                           unit=_('files'))
> -        h = hashlib.sha1(fctx.data()).digest()
> +        h = digester(fctx.data()).digest()
>          hashes[h] = fctx
>
>      # For each added file, see if it corresponds to a removed file.
>      for i, fctx in enumerate(added):
>          repo.ui.progress(_('searching for exact renames'), i + len(removed),
>                  total=numfiles, unit=_('files'))
> -        h = hashlib.sha1(fctx.data()).digest()
> +        h = digester(fctx.data()).digest()
>          if h in hashes:
>              yield (hashes[h], fctx)
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Gregory Szorc - March 2, 2017, 12:34 a.m.
On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
wrote:

> # HG changeset patch
> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> # Date 1488380487 -32400
> #      Thu Mar 02 00:01:27 2017 +0900
> # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> similar: allow similarity detection to use sha256 for digesting file
> contents
>
> Before this patch, similarity detection logic (used for addremove and
> automv) uses SHA-1 digesting. But this cause incorrect rename
> detection, if:
>
>   - removing file A and adding file B occur at same committing, and
>   - SHA-1 hash values of file A and B are same
>
> This may prevent security experts from managing sample files for
> SHAttered issue in Mercurial repository, for example.
>
>   https://security.googleblog.com/2017/02/announcing-first-
> sha1-collision.html
>   https://shattered.it/
>
> Hash collision itself isn't so serious for core repository
> functionality of Mercurial, described by mpm as below, though.
>
>   https://www.mercurial-scm.org/wiki/mpm/SHA1
>
> HOW ABOUT:
>
>   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
>

SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
likely no good reason to use SHA-256 for simple identity checks.


>
>     ease (handling problematic files safely by default) or
>     performance?
>


On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time content
identity checks, hashing shouldn't be a bottleneck, at least not for most
repos.

So I think it is fine to change this function from SHA-1 to SHA-512
assuming the hashes don't "leak" into storage. If they end up being stored
or used for something other than identity checks, then we need to bloat
scope to discuss our general hashing future. And that needs its own thread
;)


>
>   - what name of config knob is reasonable to control digesting algorithm ?
>
>     or should we fully switch to more secure digest alrgorithm ?
>
> BTW, almost all of SHA-1 clients in Mercurial source tree applies it
> on other than file contents (e.g. list of parent hash ids, file path,
> URL, and so on). But some SHA-1 clients below apply it on file
> contents.
>
>   - patch.trydiff()
>
>     SHA-1 is applied on "blob SIZE\0" header + file content (only if
>     experimental.extendedheader.index is enabled)
>
>   - largefiles
>
> The former should be less serious.
>
> On the other hand, the latter causes unintentional unification between
> largefiles, which cause same SHA-1 hash value, at checking files out.
>


>
> diff --git a/mercurial/similar.py b/mercurial/similar.py
> --- a/mercurial/similar.py
> +++ b/mercurial/similar.py
> @@ -12,9 +12,15 @@ import hashlib
>  from .i18n import _
>  from . import (
>      bdiff,
> +    error,
>      mdiff,
>  )
>
> +DIGESTERS = {
> +    'sha1': hashlib.sha1,
> +    'sha256': hashlib.sha256,
> +}
> +
>  def _findexactmatches(repo, added, removed):
>      '''find renamed files that have no changes
>
> @@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
>      '''
>      numfiles = len(added) + len(removed)
>
> +    digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
> +    if digest not in DIGESTERS:
> +        raise error.ConfigError(_("ui.similaritydigest value is invalid:
> %s")
> +                                % digest)
> +    digester = DIGESTERS[digest]
> +
>      # Get hashes of removed files.
>      hashes = {}
>      for i, fctx in enumerate(removed):
>          repo.ui.progress(_('searching for exact renames'), i,
> total=numfiles,
>                           unit=_('files'))
> -        h = hashlib.sha1(fctx.data()).digest()
> +        h = digester(fctx.data()).digest()
>          hashes[h] = fctx
>
>      # For each added file, see if it corresponds to a removed file.
>      for i, fctx in enumerate(added):
>          repo.ui.progress(_('searching for exact renames'), i +
> len(removed),
>                  total=numfiles, unit=_('files'))
> -        h = hashlib.sha1(fctx.data()).digest()
> +        h = digester(fctx.data()).digest()
>          if h in hashes:
>              yield (hashes[h], fctx)
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Mike Hommey - March 2, 2017, 12:46 a.m.
On Wed, Mar 01, 2017 at 04:34:43PM -0800, Gregory Szorc wrote:
> On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> wrote:
> 
> > # HG changeset patch
> > # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> > # Date 1488380487 -32400
> > #      Thu Mar 02 00:01:27 2017 +0900
> > # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> > # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> > similar: allow similarity detection to use sha256 for digesting file
> > contents
> >
> > Before this patch, similarity detection logic (used for addremove and
> > automv) uses SHA-1 digesting. But this cause incorrect rename
> > detection, if:
> >
> >   - removing file A and adding file B occur at same committing, and
> >   - SHA-1 hash values of file A and B are same
> >
> > This may prevent security experts from managing sample files for
> > SHAttered issue in Mercurial repository, for example.
> >
> >   https://security.googleblog.com/2017/02/announcing-first-
> > sha1-collision.html
> >   https://shattered.it/
> >
> > Hash collision itself isn't so serious for core repository
> > functionality of Mercurial, described by mpm as below, though.
> >
> >   https://www.mercurial-scm.org/wiki/mpm/SHA1
> >
> > HOW ABOUT:
> >
> >   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
> >
> 
> SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
> likely no good reason to use SHA-256 for simple identity checks.
> 
> 
> >
> >     ease (handling problematic files safely by default) or
> >     performance?
> >
> 
> 
> On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
> SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time content
> identity checks, hashing shouldn't be a bottleneck, at least not for most
> repos.
> 
> So I think it is fine to change this function from SHA-1 to SHA-512
> assuming the hashes don't "leak" into storage. If they end up being stored
> or used for something other than identity checks, then we need to bloat
> scope to discuss our general hashing future. And that needs its own thread
> ;)

With hashing, there is *always* the risk of collision. It might be tiny,
but it still exists. Why not just compare the contents when the hash
match? Then it doesn't really matter what the hash is. The hash is just
a shortcut to avoid comparing full contents in a O(n^2) fashion.

There aren't going to be that many hash matches anyways, comparing the
content then should not make a significant difference in speed, but
would guarantee that the "similarity" is real.

(BTW, interestingly, in terms of similarity detection, while the
SHAttered PDFs are not 100% identical, they are 80%+ similar)

Mike
Gregory Szorc - March 2, 2017, 1:39 a.m.
On Wed, Mar 1, 2017 at 4:46 PM, Mike Hommey <mh@glandium.org> wrote:

> On Wed, Mar 01, 2017 at 04:34:43PM -0800, Gregory Szorc wrote:
> > On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori <
> foozy@lares.dti.ne.jp>
> > wrote:
> >
> > > # HG changeset patch
> > > # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> > > # Date 1488380487 -32400
> > > #      Thu Mar 02 00:01:27 2017 +0900
> > > # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> > > # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> > > similar: allow similarity detection to use sha256 for digesting file
> > > contents
> > >
> > > Before this patch, similarity detection logic (used for addremove and
> > > automv) uses SHA-1 digesting. But this cause incorrect rename
> > > detection, if:
> > >
> > >   - removing file A and adding file B occur at same committing, and
> > >   - SHA-1 hash values of file A and B are same
> > >
> > > This may prevent security experts from managing sample files for
> > > SHAttered issue in Mercurial repository, for example.
> > >
> > >   https://security.googleblog.com/2017/02/announcing-first-
> > > sha1-collision.html
> > >   https://shattered.it/
> > >
> > > Hash collision itself isn't so serious for core repository
> > > functionality of Mercurial, described by mpm as below, though.
> > >
> > >   https://www.mercurial-scm.org/wiki/mpm/SHA1
> > >
> > > HOW ABOUT:
> > >
> > >   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
> > >
> >
> > SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
> > likely no good reason to use SHA-256 for simple identity checks.
> >
> >
> > >
> > >     ease (handling problematic files safely by default) or
> > >     performance?
> > >
> >
> >
> > On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
> > SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time
> content
> > identity checks, hashing shouldn't be a bottleneck, at least not for most
> > repos.
> >
> > So I think it is fine to change this function from SHA-1 to SHA-512
> > assuming the hashes don't "leak" into storage. If they end up being
> stored
> > or used for something other than identity checks, then we need to bloat
> > scope to discuss our general hashing future. And that needs its own
> thread
> > ;)
>
> With hashing, there is *always* the risk of collision. It might be tiny,
> but it still exists. Why not just compare the contents when the hash
> match? Then it doesn't really matter what the hash is. The hash is just
> a shortcut to avoid comparing full contents in a O(n^2) fashion.
>
> There aren't going to be that many hash matches anyways, comparing the
> content then should not make a significant difference in speed, but
> would guarantee that the "similarity" is real.
>

Yeah, the objects will already be in memory. Under the hood bytes.__eq__
just looks at the buffer length then falls back to memcmp, which should be
on the order of 10 GB/s. So using __eq__ here seems reasonable.


>
> (BTW, interestingly, in terms of similarity detection, while the
> SHAttered PDFs are not 100% identical, they are 80%+ similar)
>
> Mike
>

Patch

diff --git a/mercurial/similar.py b/mercurial/similar.py
--- a/mercurial/similar.py
+++ b/mercurial/similar.py
@@ -12,9 +12,15 @@  import hashlib
 from .i18n import _
 from . import (
     bdiff,
+    error,
     mdiff,
 )
 
+DIGESTERS = {
+    'sha1': hashlib.sha1,
+    'sha256': hashlib.sha256,
+}
+
 def _findexactmatches(repo, added, removed):
     '''find renamed files that have no changes
 
@@ -23,19 +29,25 @@  def _findexactmatches(repo, added, remov
     '''
     numfiles = len(added) + len(removed)
 
+    digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
+    if digest not in DIGESTERS:
+        raise error.ConfigError(_("ui.similaritydigest value is invalid: %s")
+                                % digest)
+    digester = DIGESTERS[digest]
+
     # Get hashes of removed files.
     hashes = {}
     for i, fctx in enumerate(removed):
         repo.ui.progress(_('searching for exact renames'), i, total=numfiles,
                          unit=_('files'))
-        h = hashlib.sha1(fctx.data()).digest()
+        h = digester(fctx.data()).digest()
         hashes[h] = fctx
 
     # For each added file, see if it corresponds to a removed file.
     for i, fctx in enumerate(added):
         repo.ui.progress(_('searching for exact renames'), i + len(removed),
                 total=numfiles, unit=_('files'))
-        h = hashlib.sha1(fctx.data()).digest()
+        h = digester(fctx.data()).digest()
         if h in hashes:
             yield (hashes[h], fctx)