Patchwork [alternative-version] pathencode: for long paths, strip first 5 chars, not first dir

login
register
mail settings
Submitter Martin von Zweigbergk
Date May 8, 2015, 7:59 p.m.
Message ID <baa1fb9f49617b0d554c.1431115145@martinvonz.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/8984/
State Accepted
Commit 297ea0df75d0f7ab552fc8986d04c8196de8d9dd
Delegated to: Matt Mackall
Headers show

Comments

Martin von Zweigbergk - May 8, 2015, 7:59 p.m.
# HG changeset patch
# User Martin von Zweigbergk <martinvonz@google.com>
# Date 1430953094 25200
#      Wed May 06 15:58:14 2015 -0700
# Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
# Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
pathencode: for long paths, strip first 5 chars, not first dir

When encoding long paths, the pure Python code strips the first
directory from the path, while the native code currently strips the
first 5 characters. This discrepancy has not been a problem so far,
since we have not stored anything in directories other than
data/. However, we will soon be storing submanifest revlogs in
metadata/, so the discrepancy will have to go [1]. Since file
collisions are avoided by the hashing alone (which is done on the full
unencoded path), it doesn't really matter whether we drop the first
dir, the first 5 characters, or special-case non-data/. To avoid
touching the C code, let's always strip the first 5 characters.

 [1] Or maybe elsewhere, but the discrepancy is ugly either way.
Adrian Buehlmann - May 8, 2015, 9:13 p.m.
On 2015-05-08 21:59, Martin von Zweigbergk wrote:
> # HG changeset patch
> # User Martin von Zweigbergk <martinvonz@google.com>
> # Date 1430953094 25200
> #      Wed May 06 15:58:14 2015 -0700
> # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
> # Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
> pathencode: for long paths, strip first 5 chars, not first dir
> 
> When encoding long paths, the pure Python code strips the first
> directory from the path, while the native code currently strips the
> first 5 characters. This discrepancy has not been a problem so far,
> since we have not stored anything in directories other than
> data/. However, we will soon be storing submanifest revlogs in
> metadata/, so the discrepancy will have to go [1]. Since file
> collisions are avoided by the hashing alone (which is done on the full
> unencoded path), it doesn't really matter whether we drop the first
> dir, the first 5 characters, or special-case non-data/. To avoid
> touching the C code, let's always strip the first 5 characters.
> 
>  [1] Or maybe elsewhere, but the discrepancy is ugly either way.
> 
> diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
> --- a/mercurial/store.py	Thu May 07 16:43:58 2015 -0700
> +++ b/mercurial/store.py	Wed May 06 15:58:14 2015 -0700
> @@ -187,7 +187,7 @@
>  
>  def _hashencode(path, dotencode):
>      digest = _sha(path).hexdigest()
> -    le = lowerencode(path).split('/')[1:]
> +    le = lowerencode(path[5:]).split('/')
>      parts = _auxencode(le, dotencode)
>      basename = parts[-1]
>      _root, ext = os.path.splitext(basename)
> diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
> --- a/tests/test-hybridencode.py	Thu May 07 16:43:58 2015 -0700
> +++ b/tests/test-hybridencode.py	Wed May 06 15:58:14 2015 -0700
> @@ -460,3 +460,9 @@
>            'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
>            'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
>            'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
> +
> +print "paths outside data/ can be encoded"
> +show('metadata/dir/00manifest.i')
> +show('metadata/12345678/12345678/12345678/12345678/12345678/'
> +          '12345678/12345678/12345678/12345678/12345678/12345678/'
> +          '12345678/12345678/00manifest.i')
> diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
> --- a/tests/test-hybridencode.py.out	Thu May 07 16:43:58 2015 -0700
> +++ b/tests/test-hybridencode.py.out	Wed May 06 15:58:14 2015 -0700
> @@ -491,3 +491,10 @@
>  A = 'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
>  B = 'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
>  
> +paths outside data/ can be encoded
> +A = 'metadata/dir/00manifest.i'
> +B = 'metadata/dir/00manifest.i'
> +
> +A = 'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
> +B = 'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
> +

The current layout (using the current, 'fncache', repo format) for
regular revlogs under .hg/store/ is:

  data/      for non-path-hashed revlogs
  dh/        for path-hashed-revlogs

You - as I understand it - want to separately store submanifests under a
new directory named 'metadata', under 'store' (leaving aside for now
that I don't like the name 'metadata', because of its length).

Don't you then also need a separate directory for paths of submanifests
that trigger the path-hash-encoding?

If you store path-hashed submanifest revlogs under 'dh', together with
non-submanifest revlogs, won't that lead to potential path collisions
inside dh?

The trailing 'ata' of 'metadata' doesn't look like a correct
discriminator to me, as that string may derive from the working tree as
well.
Martin von Zweigbergk - May 8, 2015, 9:31 p.m.
On Fri, May 8, 2015 at 2:13 PM Adrian Buehlmann <adrian@cadifra.com> wrote:

> On 2015-05-08 21:59, Martin von Zweigbergk wrote:
> > # HG changeset patch
> > # User Martin von Zweigbergk <martinvonz@google.com>
> > # Date 1430953094 25200
> > #      Wed May 06 15:58:14 2015 -0700
> > # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
> > # Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
> > pathencode: for long paths, strip first 5 chars, not first dir
> >
> > When encoding long paths, the pure Python code strips the first
> > directory from the path, while the native code currently strips the
> > first 5 characters. This discrepancy has not been a problem so far,
> > since we have not stored anything in directories other than
> > data/. However, we will soon be storing submanifest revlogs in
> > metadata/, so the discrepancy will have to go [1]. Since file
> > collisions are avoided by the hashing alone (which is done on the full
> > unencoded path), it doesn't really matter whether we drop the first
> > dir, the first 5 characters, or special-case non-data/. To avoid
> > touching the C code, let's always strip the first 5 characters.
> >
> >  [1] Or maybe elsewhere, but the discrepancy is ugly either way.
> >
> > diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
> > --- a/mercurial/store.py      Thu May 07 16:43:58 2015 -0700
> > +++ b/mercurial/store.py      Wed May 06 15:58:14 2015 -0700
> > @@ -187,7 +187,7 @@
> >
> >  def _hashencode(path, dotencode):
> >      digest = _sha(path).hexdigest()
> > -    le = lowerencode(path).split('/')[1:]
> > +    le = lowerencode(path[5:]).split('/')
> >      parts = _auxencode(le, dotencode)
> >      basename = parts[-1]
> >      _root, ext = os.path.splitext(basename)
> > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
> > --- a/tests/test-hybridencode.py      Thu May 07 16:43:58 2015 -0700
> > +++ b/tests/test-hybridencode.py      Wed May 06 15:58:14 2015 -0700
> > @@ -460,3 +460,9 @@
> >            'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
> >            'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
> >            'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
> > +
> > +print "paths outside data/ can be encoded"
> > +show('metadata/dir/00manifest.i')
> > +show('metadata/12345678/12345678/12345678/12345678/12345678/'
> > +          '12345678/12345678/12345678/12345678/12345678/12345678/'
> > +          '12345678/12345678/00manifest.i')
> > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
> > --- a/tests/test-hybridencode.py.out  Thu May 07 16:43:58 2015 -0700
> > +++ b/tests/test-hybridencode.py.out  Wed May 06 15:58:14 2015 -0700
> > @@ -491,3 +491,10 @@
> >  A =
> 'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
> >  B =
> 'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
> >
> > +paths outside data/ can be encoded
> > +A = 'metadata/dir/00manifest.i'
> > +B = 'metadata/dir/00manifest.i'
> > +
> > +A =
> 'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
> > +B =
> 'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
> > +
>
> The current layout (using the current, 'fncache', repo format) for
> regular revlogs under .hg/store/ is:
>
>   data/      for non-path-hashed revlogs
>   dh/        for path-hashed-revlogs
>
> You - as I understand it - want to separately store submanifests under a
> new directory named 'metadata', under 'store' (leaving aside for now
> that I don't like the name 'metadata', because of its length).
>
> Don't you then also need a separate directory for paths of submanifests
> that trigger the path-hash-encoding?
>
> If you store path-hashed submanifest revlogs under 'dh', together with
> non-submanifest revlogs, won't that lead to potential path collisions
> inside dh?


I think not. That's what I tried to explain with the "collisions are
avoided by the hashing alone" bit.


>


> The trailing 'ata' of 'metadata' doesn't look like a correct
> discriminator to me, as that string may derive from the working tree as
> well.
>
Adrian Buehlmann - May 8, 2015, 10:22 p.m.
On 2015-05-08 23:31, Martin von Zweigbergk wrote:
> 
> 
> On Fri, May 8, 2015 at 2:13 PM Adrian Buehlmann <adrian@cadifra.com
> <mailto:adrian@cadifra.com>> wrote:
> 
>     On 2015-05-08 21:59, Martin von Zweigbergk wrote:
>     > # HG changeset patch
>     > # User Martin von Zweigbergk <martinvonz@google.com
>     <mailto:martinvonz@google.com>>
>     > # Date 1430953094 25200
>     > #      Wed May 06 15:58:14 2015 -0700
>     > # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
>     > # Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
>     > pathencode: for long paths, strip first 5 chars, not first dir
>     >
>     > When encoding long paths, the pure Python code strips the first
>     > directory from the path, while the native code currently strips the
>     > first 5 characters. This discrepancy has not been a problem so far,
>     > since we have not stored anything in directories other than
>     > data/. However, we will soon be storing submanifest revlogs in
>     > metadata/, so the discrepancy will have to go [1]. Since file
>     > collisions are avoided by the hashing alone (which is done on the full
>     > unencoded path), it doesn't really matter whether we drop the first
>     > dir, the first 5 characters, or special-case non-data/. To avoid
>     > touching the C code, let's always strip the first 5 characters.
>     >
>     >  [1] Or maybe elsewhere, but the discrepancy is ugly either way.
>     >
>     > diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
>     > --- a/mercurial/store.py      Thu May 07 16:43:58 2015 -0700
>     > +++ b/mercurial/store.py      Wed May 06 15:58:14 2015 -0700
>     > @@ -187,7 +187,7 @@
>     >
>     >  def _hashencode(path, dotencode):
>     >      digest = _sha(path).hexdigest()
>     > -    le = lowerencode(path).split('/')[1:]
>     > +    le = lowerencode(path[5:]).split('/')
>     >      parts = _auxencode(le, dotencode)
>     >      basename = parts[-1]
>     >      _root, ext = os.path.splitext(basename)
>     > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
>     > --- a/tests/test-hybridencode.py      Thu May 07 16:43:58 2015 -0700
>     > +++ b/tests/test-hybridencode.py      Wed May 06 15:58:14 2015 -0700
>     > @@ -460,3 +460,9 @@
>     >            'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
>     >            'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
>     >            'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
>     > +
>     > +print "paths outside data/ can be encoded"
>     > +show('metadata/dir/00manifest.i')
>     > +show('metadata/12345678/12345678/12345678/12345678/12345678/'
>     > +          '12345678/12345678/12345678/12345678/12345678/12345678/'
>     > +          '12345678/12345678/00manifest.i')
>     > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
>     > --- a/tests/test-hybridencode.py.out  Thu May 07 16:43:58 2015 -0700
>     > +++ b/tests/test-hybridencode.py.out  Wed May 06 15:58:14 2015 -0700
>     > @@ -491,3 +491,10 @@
>     >  A =
>     'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
>     >  B =
>     'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
>     >
>     > +paths outside data/ can be encoded
>     > +A = 'metadata/dir/00manifest.i'
>     > +B = 'metadata/dir/00manifest.i'
>     > +
>     > +A =
>     'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
>     > +B =
>     'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
>     > +
> 
>     The current layout (using the current, 'fncache', repo format) for
>     regular revlogs under .hg/store/ is:
> 
>       data/      for non-path-hashed revlogs
>       dh/        for path-hashed-revlogs
> 
>     You - as I understand it - want to separately store submanifests under a
>     new directory named 'metadata', under 'store' (leaving aside for now
>     that I don't like the name 'metadata', because of its length).
> 
>     Don't you then also need a separate directory for paths of submanifests
>     that trigger the path-hash-encoding?
> 
>     If you store path-hashed submanifest revlogs under 'dh', together with
>     non-submanifest revlogs, won't that lead to potential path collisions
>     inside dh?
> 
> 
> I think not. That's what I tried to explain with the "collisions are
> avoided by the hashing alone" bit.
>  

I do believe you are wrong. The name "metadata" (or parts of it)
together with 00manifest.i (with the .i encoded) may appear as a part of
the users files in the working tree.

Hashing won't solve that collision.

> 
> 
>     The trailing 'ata' of 'metadata' doesn't look like a correct
>     discriminator to me, as that string may derive from the working tree as
>     well.
>
Martin von Zweigbergk - May 8, 2015, 10:27 p.m.
On Fri, May 8, 2015 at 3:22 PM Adrian Buehlmann <adrian@cadifra.com> wrote:

> On 2015-05-08 23:31, Martin von Zweigbergk wrote:
> >
> >
> > On Fri, May 8, 2015 at 2:13 PM Adrian Buehlmann <adrian@cadifra.com
> > <mailto:adrian@cadifra.com>> wrote:
> >
> >     On 2015-05-08 21:59, Martin von Zweigbergk wrote:
> >     > # HG changeset patch
> >     > # User Martin von Zweigbergk <martinvonz@google.com
> >     <mailto:martinvonz@google.com>>
> >     > # Date 1430953094 25200
> >     > #      Wed May 06 15:58:14 2015 -0700
> >     > # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
> >     > # Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
> >     > pathencode: for long paths, strip first 5 chars, not first dir
> >     >
> >     > When encoding long paths, the pure Python code strips the first
> >     > directory from the path, while the native code currently strips the
> >     > first 5 characters. This discrepancy has not been a problem so far,
> >     > since we have not stored anything in directories other than
> >     > data/. However, we will soon be storing submanifest revlogs in
> >     > metadata/, so the discrepancy will have to go [1]. Since file
> >     > collisions are avoided by the hashing alone (which is done on the
> full
> >     > unencoded path), it doesn't really matter whether we drop the first
> >     > dir, the first 5 characters, or special-case non-data/. To avoid
> >     > touching the C code, let's always strip the first 5 characters.
> >     >
> >     >  [1] Or maybe elsewhere, but the discrepancy is ugly either way.
> >     >
> >     > diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
> >     > --- a/mercurial/store.py      Thu May 07 16:43:58 2015 -0700
> >     > +++ b/mercurial/store.py      Wed May 06 15:58:14 2015 -0700
> >     > @@ -187,7 +187,7 @@
> >     >
> >     >  def _hashencode(path, dotencode):
> >     >      digest = _sha(path).hexdigest()
> >     > -    le = lowerencode(path).split('/')[1:]
> >     > +    le = lowerencode(path[5:]).split('/')
> >     >      parts = _auxencode(le, dotencode)
> >     >      basename = parts[-1]
> >     >      _root, ext = os.path.splitext(basename)
> >     > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
> >     > --- a/tests/test-hybridencode.py      Thu May 07 16:43:58 2015
> -0700
> >     > +++ b/tests/test-hybridencode.py      Wed May 06 15:58:14 2015
> -0700
> >     > @@ -460,3 +460,9 @@
> >     >            'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
> >     >            'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
> >     >            'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
> >     > +
> >     > +print "paths outside data/ can be encoded"
> >     > +show('metadata/dir/00manifest.i')
> >     > +show('metadata/12345678/12345678/12345678/12345678/12345678/'
> >     > +          '12345678/12345678/12345678/12345678/12345678/12345678/'
> >     > +          '12345678/12345678/00manifest.i')
> >     > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
> >     > --- a/tests/test-hybridencode.py.out  Thu May 07 16:43:58 2015
> -0700
> >     > +++ b/tests/test-hybridencode.py.out  Wed May 06 15:58:14 2015
> -0700
> >     > @@ -491,3 +491,10 @@
> >     >  A =
> >
>  'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
> >     >  B =
> >
>  'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
> >     >
> >     > +paths outside data/ can be encoded
> >     > +A = 'metadata/dir/00manifest.i'
> >     > +B = 'metadata/dir/00manifest.i'
> >     > +
> >     > +A =
> >
>  'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
> >     > +B =
> >
>  'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
> >     > +
> >
> >     The current layout (using the current, 'fncache', repo format) for
> >     regular revlogs under .hg/store/ is:
> >
> >       data/      for non-path-hashed revlogs
> >       dh/        for path-hashed-revlogs
> >
> >     You - as I understand it - want to separately store submanifests
> under a
> >     new directory named 'metadata', under 'store' (leaving aside for now
> >     that I don't like the name 'metadata', because of its length).
> >
> >     Don't you then also need a separate directory for paths of
> submanifests
> >     that trigger the path-hash-encoding?
> >
> >     If you store path-hashed submanifest revlogs under 'dh', together
> with
> >     non-submanifest revlogs, won't that lead to potential path collisions
> >     inside dh?
> >
> >
> > I think not. That's what I tried to explain with the "collisions are
> > avoided by the hashing alone" bit.
> >
>
> I do believe you are wrong. The name "metadata" (or parts of it)
> together with 00manifest.i (with the .i encoded) may appear as a part of
> the users files in the working tree.
>
> Hashing won't solve that collision.
>

It will (in the vast majority of cases) if what is hashed is different.
That's what I tried to explain with the "which is done on the full
unencoded path" bit. The submanifest will have a path that starts with
"metadata/", while the user's own file in a directory called "metadata"
will start with "data/metadata/". At least from my reading of the code.


>
> >
> >
> >     The trailing 'ata' of 'metadata' doesn't look like a correct
> >     discriminator to me, as that string may derive from the working tree
> as
> >     well.
> >
>
Adrian Buehlmann - May 8, 2015, 11:28 p.m.
On 2015-05-09 00:27, Martin von Zweigbergk wrote:
> 
> 
> On Fri, May 8, 2015 at 3:22 PM Adrian Buehlmann <adrian@cadifra.com
> <mailto:adrian@cadifra.com>> wrote:
> 
>     On 2015-05-08 23:31, Martin von Zweigbergk wrote:
>     >
>     >
>     > On Fri, May 8, 2015 at 2:13 PM Adrian Buehlmann
>     <adrian@cadifra.com <mailto:adrian@cadifra.com>
>     > <mailto:adrian@cadifra.com <mailto:adrian@cadifra.com>>> wrote:
>     >
>     >     On 2015-05-08 21:59, Martin von Zweigbergk wrote:
>     >     > # HG changeset patch
>     >     > # User Martin von Zweigbergk <martinvonz@google.com
>     <mailto:martinvonz@google.com>
>     >     <mailto:martinvonz@google.com <mailto:martinvonz@google.com>>>
>     >     > # Date 1430953094 25200
>     >     > #      Wed May 06 15:58:14 2015 -0700
>     >     > # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
>     >     > # Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
>     >     > pathencode: for long paths, strip first 5 chars, not first dir
>     >     >
>     >     > When encoding long paths, the pure Python code strips the first
>     >     > directory from the path, while the native code currently
>     strips the
>     >     > first 5 characters. This discrepancy has not been a problem
>     so far,
>     >     > since we have not stored anything in directories other than
>     >     > data/. However, we will soon be storing submanifest revlogs in
>     >     > metadata/, so the discrepancy will have to go [1]. Since file
>     >     > collisions are avoided by the hashing alone (which is done
>     on the full
>     >     > unencoded path), it doesn't really matter whether we drop
>     the first
>     >     > dir, the first 5 characters, or special-case non-data/. To avoid
>     >     > touching the C code, let's always strip the first 5 characters.
>     >     >
>     >     >  [1] Or maybe elsewhere, but the discrepancy is ugly either way.
>     >     >
>     >     > diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
>     >     > --- a/mercurial/store.py      Thu May 07 16:43:58 2015 -0700
>     >     > +++ b/mercurial/store.py      Wed May 06 15:58:14 2015 -0700
>     >     > @@ -187,7 +187,7 @@
>     >     >
>     >     >  def _hashencode(path, dotencode):
>     >     >      digest = _sha(path).hexdigest()
>     >     > -    le = lowerencode(path).split('/')[1:]
>     >     > +    le = lowerencode(path[5:]).split('/')
>     >     >      parts = _auxencode(le, dotencode)
>     >     >      basename = parts[-1]
>     >     >      _root, ext = os.path.splitext(basename)
>     >     > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
>     >     > --- a/tests/test-hybridencode.py      Thu May 07 16:43:58
>     2015 -0700
>     >     > +++ b/tests/test-hybridencode.py      Wed May 06 15:58:14
>     2015 -0700
>     >     > @@ -460,3 +460,9 @@
>     >     >            'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
>     >     >            'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
>     >     >            'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
>     >     > +
>     >     > +print "paths outside data/ can be encoded"
>     >     > +show('metadata/dir/00manifest.i')
>     >     > +show('metadata/12345678/12345678/12345678/12345678/12345678/'
>     >     > +         
>     '12345678/12345678/12345678/12345678/12345678/12345678/'
>     >     > +          '12345678/12345678/00manifest.i')
>     >     > diff -r 8179af513aeb -r baa1fb9f4961
>     tests/test-hybridencode.py.out
>     >     > --- a/tests/test-hybridencode.py.out  Thu May 07 16:43:58
>     2015 -0700
>     >     > +++ b/tests/test-hybridencode.py.out  Wed May 06 15:58:14
>     2015 -0700
>     >     > @@ -491,3 +491,10 @@
>     >     >  A =
>     >   
>      'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
>     >     >  B =
>     >   
>      'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
>     >     >
>     >     > +paths outside data/ can be encoded
>     >     > +A = 'metadata/dir/00manifest.i'
>     >     > +B = 'metadata/dir/00manifest.i'
>     >     > +
>     >     > +A =
>     >   
>      'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
>     >     > +B =
>     >   
>      'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
>     >     > +
>     >
>     >     The current layout (using the current, 'fncache', repo format) for
>     >     regular revlogs under .hg/store/ is:
>     >
>     >       data/      for non-path-hashed revlogs
>     >       dh/        for path-hashed-revlogs
>     >
>     >     You - as I understand it - want to separately store
>     submanifests under a
>     >     new directory named 'metadata', under 'store' (leaving aside
>     for now
>     >     that I don't like the name 'metadata', because of its length).
>     >
>     >     Don't you then also need a separate directory for paths of
>     submanifests
>     >     that trigger the path-hash-encoding?
>     >
>     >     If you store path-hashed submanifest revlogs under 'dh',
>     together with
>     >     non-submanifest revlogs, won't that lead to potential path
>     collisions
>     >     inside dh?
>     >
>     >
>     > I think not. That's what I tried to explain with the "collisions are
>     > avoided by the hashing alone" bit.
>     >
> 
>     I do believe you are wrong. The name "metadata" (or parts of it)
>     together with 00manifest.i (with the .i encoded) may appear as a part of
>     the users files in the working tree.
> 
>     Hashing won't solve that collision.
> 
> 
> It will (in the vast majority of cases) if what is hashed is different.
> That's what I tried to explain with the "which is done on the full
> unencoded path" bit. The submanifest will have a path that starts with
> "metadata/", while the user's own file in a directory called "metadata"
> will start with "data/metadata/". At least from my reading of the code.

You're right. The input into the hashing function includes the 'data'
(or 'metadata', in the case of a submanifest paths) prefix, it is added
before and not stripped for the hashing.

I still don't like the name 'metadata' for the additional directory,
though. I don't see the point of taking a name, which is that long. It
will needlessly trigger hashing earlier, making it more frequent than
with a shorter path prefix.

Of course, I agree with bringing the python code in line with the C code.
Matt Mackall - May 13, 2015, 5:57 p.m.
On Fri, 2015-05-08 at 12:59 -0700, Martin von Zweigbergk wrote:
> # HG changeset patch
> # User Martin von Zweigbergk <martinvonz@google.com>
> # Date 1430953094 25200
> #      Wed May 06 15:58:14 2015 -0700
> # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
> # Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
> pathencode: for long paths, strip first 5 chars, not first dir

Queued for stable, thanks. It's obviously correct that the C and Python
implementations should agree, which is the actual substance of this
patch.

Patch

diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
--- a/mercurial/store.py	Thu May 07 16:43:58 2015 -0700
+++ b/mercurial/store.py	Wed May 06 15:58:14 2015 -0700
@@ -187,7 +187,7 @@ 
 
 def _hashencode(path, dotencode):
     digest = _sha(path).hexdigest()
-    le = lowerencode(path).split('/')[1:]
+    le = lowerencode(path[5:]).split('/')
     parts = _auxencode(le, dotencode)
     basename = parts[-1]
     _root, ext = os.path.splitext(basename)
diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
--- a/tests/test-hybridencode.py	Thu May 07 16:43:58 2015 -0700
+++ b/tests/test-hybridencode.py	Wed May 06 15:58:14 2015 -0700
@@ -460,3 +460,9 @@ 
           'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
           'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
           'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
+
+print "paths outside data/ can be encoded"
+show('metadata/dir/00manifest.i')
+show('metadata/12345678/12345678/12345678/12345678/12345678/'
+          '12345678/12345678/12345678/12345678/12345678/12345678/'
+          '12345678/12345678/00manifest.i')
diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
--- a/tests/test-hybridencode.py.out	Thu May 07 16:43:58 2015 -0700
+++ b/tests/test-hybridencode.py.out	Wed May 06 15:58:14 2015 -0700
@@ -491,3 +491,10 @@ 
 A = 'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
 B = 'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
 
+paths outside data/ can be encoded
+A = 'metadata/dir/00manifest.i'
+B = 'metadata/dir/00manifest.i'
+
+A = 'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
+B = 'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
+