Patchwork [4,of,5] manifestv2: add support for reading new manifest format

login
register
mail settings
Submitter Martin von Zweigbergk
Date April 1, 2015, 5:34 p.m.
Message ID <aca6ee57dddf4b397328.1427909689@martinvonz.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/8419/
State Accepted
Headers show

Comments

Martin von Zweigbergk - April 1, 2015, 5:34 p.m.
# HG changeset patch
# User Martin von Zweigbergk <martinvonz@google.com>
# Date 1427520401 25200
#      Fri Mar 27 22:26:41 2015 -0700
# Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
# Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
manifestv2: add support for reading new manifest format

The new manifest format is designed to be smaller, in particular to
produce smaller deltas. It stores hashes in binary and puts the hash
on a new line (for smaller deltas). It also uses stem compression to
save space for long paths. The format has room for metadata, but
that's there only for future-proofing. The parser thus accepts any
metadata and throws it away. For more information, see
http://mercurial.selenic.com/wiki/ManifestV2Plan.

The current manifest format doesn't allow an empty filename, so we use
an empty filename on the first line to tell a manifest of the new
format from the old. Since we still never write manifests in the new
format, the added code is unused, but it is tested by
test-manifest.py.
Mike Hommey - April 2, 2015, 12:01 a.m.
On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von Zweigbergk wrote:
> # HG changeset patch
> # User Martin von Zweigbergk <martinvonz@google.com>
> # Date 1427520401 25200
> #      Fri Mar 27 22:26:41 2015 -0700
> # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
> # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
> manifestv2: add support for reading new manifest format
> 
> The new manifest format is designed to be smaller, in particular to
> produce smaller deltas. It stores hashes in binary and puts the hash
> on a new line (for smaller deltas). It also uses stem compression to
> save space for long paths. The format has room for metadata, but
> that's there only for future-proofing. The parser thus accepts any
> metadata and throws it away. For more information, see
> http://mercurial.selenic.com/wiki/ManifestV2Plan.

I have several questions related to that document:
- Since manifest creation is done when committing, what is the plan wrt
  what should happen when a commit with manifestv2 is pushed (server may
  not support them, or may not want them even if it does)

- What is the metadata expected to be used for in the header and in the
  file entries? How would they be set, and what's the expected interaction
  for merge conflicts on these metadata?

- Why put file entries on two lines? The 20-byte nodeid could be
  preceded with a null character, which would solve the readdelta
  issue mentioned at the end of the document, but maybe the goal is to
  make deltas smaller when only the nodeid changes?

- Relatedly, the manifest sharding plan seems now to essentially be to
  have a manifest per directory, with one revlog each. Is there a plan
  to handle moved/renamed directories like it's done with files?

Mike
Matt Mackall - April 2, 2015, 2:19 a.m.
On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
> On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von Zweigbergk wrote:
> > # HG changeset patch
> > # User Martin von Zweigbergk <martinvonz@google.com>
> > # Date 1427520401 25200
> > #      Fri Mar 27 22:26:41 2015 -0700
> > # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
> > # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
> > manifestv2: add support for reading new manifest format
> > 
> > The new manifest format is designed to be smaller, in particular to
> > produce smaller deltas. It stores hashes in binary and puts the hash
> > on a new line (for smaller deltas). It also uses stem compression to
> > save space for long paths. The format has room for metadata, but
> > that's there only for future-proofing. The parser thus accepts any
> > metadata and throws it away. For more information, see
> > http://mercurial.selenic.com/wiki/ManifestV2Plan.
> 
> I have several questions related to that document:
> - Since manifest creation is done when committing, what is the plan wrt
>   what should happen when a commit with manifestv2 is pushed (server may
>   not support them, or may not want them even if it does)

Not fully decided. Possibly on-the-fly conversion.

> - What is the metadata expected to be used for in the header and in the
>   file entries? How would they be set, and what's the expected interaction
>   for merge conflicts on these metadata?

No precise plans yet. All metadata is presumed to be optional/ignorable,
so can be skipped by merge.

> - Why put file entries on two lines? The 20-byte nodeid could be
>   preceded with a null character, which would solve the readdelta
>   issue mentioned at the end of the document, but maybe the goal is to
>   make deltas smaller when only the nodeid changes?

http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan

> - Relatedly, the manifest sharding plan seems now to essentially be to
>   have a manifest per directory, with one revlog each. Is there a plan
>   to handle moved/renamed directories like it's done with files?

So many questions! Merge already handles directory moves implicitly, so
this should just work without changing existing semantics.
Mike Hommey - April 2, 2015, 2:27 a.m.
On Wed, Apr 01, 2015 at 09:19:03PM -0500, Matt Mackall wrote:
> On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
> > On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von Zweigbergk wrote:
> > > # HG changeset patch
> > > # User Martin von Zweigbergk <martinvonz@google.com>
> > > # Date 1427520401 25200
> > > #      Fri Mar 27 22:26:41 2015 -0700
> > > # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
> > > # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
> > > manifestv2: add support for reading new manifest format
> > > 
> > > The new manifest format is designed to be smaller, in particular to
> > > produce smaller deltas. It stores hashes in binary and puts the hash
> > > on a new line (for smaller deltas). It also uses stem compression to
> > > save space for long paths. The format has room for metadata, but
> > > that's there only for future-proofing. The parser thus accepts any
> > > metadata and throws it away. For more information, see
> > > http://mercurial.selenic.com/wiki/ManifestV2Plan.
> > 
> > I have several questions related to that document:
> > - Since manifest creation is done when committing, what is the plan wrt
> >   what should happen when a commit with manifestv2 is pushed (server may
> >   not support them, or may not want them even if it does)
> 
> Not fully decided. Possibly on-the-fly conversion.

The sha1 for a manifestv2 would be the same as the corresponding
(flattened) manifestv1? O_o

> 
> > - What is the metadata expected to be used for in the header and in the
> >   file entries? How would they be set, and what's the expected interaction
> >   for merge conflicts on these metadata?
> 
> No precise plans yet. All metadata is presumed to be optional/ignorable,
> so can be skipped by merge.
> 
> > - Why put file entries on two lines? The 20-byte nodeid could be
> >   preceded with a null character, which would solve the readdelta
> >   issue mentioned at the end of the document, but maybe the goal is to
> >   make deltas smaller when only the nodeid changes?
> 
> http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan
> 
> > - Relatedly, the manifest sharding plan seems now to essentially be to
> >   have a manifest per directory, with one revlog each. Is there a plan
> >   to handle moved/renamed directories like it's done with files?
> 
> So many questions! Merge already handles directory moves implicitly, so
> this should just work without changing existing semantics.

This question was not related to merges, but to the copyrev thing there
is for files. I guess it's not very important to have that for
manifests.

Mike
Gregory Szorc - April 2, 2015, 2:42 a.m.
On Wed, Apr 1, 2015 at 7:27 PM, Mike Hommey <mh@glandium.org> wrote:

> On Wed, Apr 01, 2015 at 09:19:03PM -0500, Matt Mackall wrote:
> > On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
> > > On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von Zweigbergk wrote:
> > > > # HG changeset patch
> > > > # User Martin von Zweigbergk <martinvonz@google.com>
> > > > # Date 1427520401 25200
> > > > #      Fri Mar 27 22:26:41 2015 -0700
> > > > # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
> > > > # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
> > > > manifestv2: add support for reading new manifest format
> > > >
> > > > The new manifest format is designed to be smaller, in particular to
> > > > produce smaller deltas. It stores hashes in binary and puts the hash
> > > > on a new line (for smaller deltas). It also uses stem compression to
> > > > save space for long paths. The format has room for metadata, but
> > > > that's there only for future-proofing. The parser thus accepts any
> > > > metadata and throws it away. For more information, see
> > > > http://mercurial.selenic.com/wiki/ManifestV2Plan.
> > >
> > > I have several questions related to that document:
> > > - Since manifest creation is done when committing, what is the plan wrt
> > >   what should happen when a commit with manifestv2 is pushed (server
> may
> > >   not support them, or may not want them even if it does)
> >
> > Not fully decided. Possibly on-the-fly conversion.
>
> The sha1 for a manifestv2 would be the same as the corresponding
> (flattened) manifestv1? O_o
>

I think the point you are trying to make is there would be at least 2
SHA-1s for every changeset, depending on how manifests are computed. That
seems extremely confusing.

Another question: how can an existing repo seamlessly switch to the new
manifest format? Presumably we'll want to "upgrade" the Firefox repo to
both directory manifests and manifestv2 for performance benefits. But since
manifestv2 is a requires-time thing, that would mean rewriting the entire
manifest to v2. And that would change manifest SHA-1's which would
invalidate every existing changeset SHA-1. That's a non-starter for us
unless we can seamlessly handle requests for old changesets (hgweb URLs,
clients updating to old changesets for bisection, etc).
Martin von Zweigbergk - April 2, 2015, 3:47 a.m.
On Wed, Apr 1, 2015 at 7:42 PM Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> On Wed, Apr 1, 2015 at 7:27 PM, Mike Hommey <mh@glandium.org> wrote:
>
>> On Wed, Apr 01, 2015 at 09:19:03PM -0500, Matt Mackall wrote:
>> > On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
>> > > On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von Zweigbergk wrote:
>> > > > # HG changeset patch
>> > > > # User Martin von Zweigbergk <martinvonz@google.com>
>> > > > # Date 1427520401 25200
>> > > > #      Fri Mar 27 22:26:41 2015 -0700
>> > > > # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
>> > > > # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
>> > > > manifestv2: add support for reading new manifest format
>> > > >
>> > > > The new manifest format is designed to be smaller, in particular to
>> > > > produce smaller deltas. It stores hashes in binary and puts the hash
>> > > > on a new line (for smaller deltas). It also uses stem compression to
>> > > > save space for long paths. The format has room for metadata, but
>> > > > that's there only for future-proofing. The parser thus accepts any
>> > > > metadata and throws it away. For more information, see
>> > > > http://mercurial.selenic.com/wiki/ManifestV2Plan.
>> > >
>> > > I have several questions related to that document:
>> > > - Since manifest creation is done when committing, what is the plan
>> wrt
>> > >   what should happen when a commit with manifestv2 is pushed (server
>> may
>> > >   not support them, or may not want them even if it does)
>> >
>> > Not fully decided. Possibly on-the-fly conversion.
>>
>> The sha1 for a manifestv2 would be the same as the corresponding
>> (flattened) manifestv1? O_o
>>
>
> I think the point you are trying to make is there would be at least 2
> SHA-1s for every changeset, depending on how manifests are computed. That
> seems extremely confusing.
>

Yes, that would be confusing. What this series adds is a new hash. What
Matt refers to is a BC-mode where the manifest is stored in v2 format in
the revlog, but the nodeid is calculated as if the content were v1. Once
that's done, we can convert between the formats on the fly. In this
BC-mode, we would have to produce the full-text manifest in both formats on
commit, and the on-the-fly conversion would be somewhat costly too. The
benefit is of course that the manifest revlog would be smaller (20-40%).
(Note that if we do add a BC-mode, we'd probably have to be careful not to
allow any metadata in the (v2) manifests, since that would not be part of
the hash and could (?) open up for some attack.)


> Another question: how can an existing repo seamlessly switch to the new
> manifest format?
>

As I wrote in the message for patch 3 in this series, I think that should
be safe, but for now, we're keeping it simple by filling in the requires at
repo creation time. My original assumption was that we would fill in the
requires on the commit with the flag on. There seems to be no precedent for
such behavior, but it seems to make sense to me (largefiles would be close
to that, but not quite).


> Presumably we'll want to "upgrade" the Firefox repo to both directory
> manifests and manifestv2 for performance benefits.
>

I had assumed that Firefox would be using this in ~5 years. I'm curious
what a more accurate number is. How soon can you require your developers to
have upgraded to a certain version of hg?


> But since manifestv2 is a requires-time thing, that would mean rewriting
> the entire manifest to v2. And that would change manifest SHA-1's which
> would invalidate every existing changeset SHA-1. That's a non-starter for
> us unless we can seamlessly handle requests for old changesets (hgweb URLs,
> clients updating to old changesets for bisection, etc).
>

So either the BC-mode with on-the-fly conversion or we could allow
switching to the new formats on an existing repo. For tree manifests, I
don't think there will be a BC-mode, so if that turns out to be useful to
you, you'd probably have to require clients to upgrade at that point anyway.

I'm a little surprised, but happy, that you mention only existing hashes.
You seem to be considering upgrading earlier than I had expected.
Gregory Szorc - April 2, 2015, 4:24 a.m.
On Wed, Apr 1, 2015 at 8:47 PM, Martin von Zweigbergk <martinvonz@google.com
> wrote:

>
>
> On Wed, Apr 1, 2015 at 7:42 PM Gregory Szorc <gregory.szorc@gmail.com>
> wrote:
>
>> On Wed, Apr 1, 2015 at 7:27 PM, Mike Hommey <mh@glandium.org> wrote:
>>
>>> On Wed, Apr 01, 2015 at 09:19:03PM -0500, Matt Mackall wrote:
>>> > On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
>>> > > On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von Zweigbergk
>>> wrote:
>>> > > > # HG changeset patch
>>> > > > # User Martin von Zweigbergk <martinvonz@google.com>
>>> > > > # Date 1427520401 25200
>>> > > > #      Fri Mar 27 22:26:41 2015 -0700
>>> > > > # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
>>> > > > # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
>>> > > > manifestv2: add support for reading new manifest format
>>> > > >
>>> > > > The new manifest format is designed to be smaller, in particular to
>>> > > > produce smaller deltas. It stores hashes in binary and puts the
>>> hash
>>> > > > on a new line (for smaller deltas). It also uses stem compression
>>> to
>>> > > > save space for long paths. The format has room for metadata, but
>>> > > > that's there only for future-proofing. The parser thus accepts any
>>> > > > metadata and throws it away. For more information, see
>>> > > > http://mercurial.selenic.com/wiki/ManifestV2Plan.
>>> > >
>>> > > I have several questions related to that document:
>>> > > - Since manifest creation is done when committing, what is the plan
>>> wrt
>>> > >   what should happen when a commit with manifestv2 is pushed (server
>>> may
>>> > >   not support them, or may not want them even if it does)
>>> >
>>> > Not fully decided. Possibly on-the-fly conversion.
>>>
>>> The sha1 for a manifestv2 would be the same as the corresponding
>>> (flattened) manifestv1? O_o
>>>
>>
>> I think the point you are trying to make is there would be at least 2
>> SHA-1s for every changeset, depending on how manifests are computed. That
>> seems extremely confusing.
>>
>
> Yes, that would be confusing. What this series adds is a new hash. What
> Matt refers to is a BC-mode where the manifest is stored in v2 format in
> the revlog, but the nodeid is calculated as if the content were v1. Once
> that's done, we can convert between the formats on the fly. In this
> BC-mode, we would have to produce the full-text manifest in both formats on
> commit, and the on-the-fly conversion would be somewhat costly too. The
> benefit is of course that the manifest revlog would be smaller (20-40%).
> (Note that if we do add a BC-mode, we'd probably have to be careful not to
> allow any metadata in the (v2) manifests, since that would not be part of
> the hash and could (?) open up for some attack.)
>

That's an interesting idea. But as you said, it locks the server into
BC-compatible behavior pretty much indefinitely.

It's almost like you want to upgrade the server then have the server
advertise manifest capabilities to the client. Older clients either
wouldn't be able to push or we could do some server-side rebase magic and
push down the rewritten changeset SHA-1 to the client somehow. Maybe the
client would maintain a SHA-1 map file? hg-git has explored something
similar.

This makes my head hurt.


>
>
>> Another question: how can an existing repo seamlessly switch to the new
>> manifest format?
>>
>
> As I wrote in the message for patch 3 in this series, I think that should
> be safe, but for now, we're keeping it simple by filling in the requires at
> repo creation time. My original assumption was that we would fill in the
> requires on the commit with the flag on. There seems to be no precedent for
> such behavior, but it seems to make sense to me (largefiles would be close
> to that, but not quite).
>
>
>> Presumably we'll want to "upgrade" the Firefox repo to both directory
>> manifests and manifestv2 for performance benefits.
>>
>
> I had assumed that Firefox would be using this in ~5 years. I'm curious
> what a more accurate number is. How soon can you require your developers to
> have upgraded to a certain version of hg?
>
>
>> But since manifestv2 is a requires-time thing, that would mean rewriting
>> the entire manifest to v2. And that would change manifest SHA-1's which
>> would invalidate every existing changeset SHA-1. That's a non-starter for
>> us unless we can seamlessly handle requests for old changesets (hgweb URLs,
>> clients updating to old changesets for bisection, etc).
>>
>
> So either the BC-mode with on-the-fly conversion or we could allow
> switching to the new formats on an existing repo. For tree manifests, I
> don't think there will be a BC-mode, so if that turns out to be useful to
> you, you'd probably have to require clients to upgrade at that point anyway.
>
> I'm a little surprised, but happy, that you mention only existing hashes.
> You seem to be considering upgrading earlier than I had expected.
>

We can force people use a modern Mercurial if there is a compelling reason
to do so. It's annoying to force people to upgrade, sure. But if it's all
performance and feature wins, I don't think many will complain. We've done
this before in January 2013 by requiring Python 2.7 to build Firefox. We
can do it again if there are good reasons.

What concerns us more than forcing a software upgrade onto people is
dealing with a repo rewrite. References to existing SHA-1 need to work
forever. Tons of automation would need to transition. The cost for a
one-time transition would be significant. A global flag day would be far
more expensive than a gradual transition. Maybe we would start playing new
commits to both repo versions and have downstream consumers start pulling
from the new repo before a push-only flag day? I don't know. We should talk
about this in Montreal.

But it's not just Mozilla. Every Mercurial user is in the same boat.
Although, only large repos would likely have anything significant to gain.
I think we need to consider the implications beyond what Google, Facebook,
and Mozilla are willing to tolerate. Mozilla is probably a decent proxy for
the average company or organization that doesn't have the strong machine
management that Facebook or Google are able to provide.
Martin von Zweigbergk - April 2, 2015, 4:36 a.m.
On Wed, Apr 1, 2015 at 9:24 PM Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> On Wed, Apr 1, 2015 at 8:47 PM, Martin von Zweigbergk <
> martinvonz@google.com> wrote:
>
>>
>>
>> On Wed, Apr 1, 2015 at 7:42 PM Gregory Szorc <gregory.szorc@gmail.com>
>> wrote:
>>
>>> On Wed, Apr 1, 2015 at 7:27 PM, Mike Hommey <mh@glandium.org> wrote:
>>>
>>>> On Wed, Apr 01, 2015 at 09:19:03PM -0500, Matt Mackall wrote:
>>>> > On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
>>>> > > On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von Zweigbergk
>>>> wrote:
>>>> > > > # HG changeset patch
>>>> > > > # User Martin von Zweigbergk <martinvonz@google.com>
>>>> > > > # Date 1427520401 25200
>>>> > > > #      Fri Mar 27 22:26:41 2015 -0700
>>>> > > > # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
>>>> > > > # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
>>>> > > > manifestv2: add support for reading new manifest format
>>>> > > >
>>>> > > > The new manifest format is designed to be smaller, in particular
>>>> to
>>>> > > > produce smaller deltas. It stores hashes in binary and puts the
>>>> hash
>>>> > > > on a new line (for smaller deltas). It also uses stem compression
>>>> to
>>>> > > > save space for long paths. The format has room for metadata, but
>>>> > > > that's there only for future-proofing. The parser thus accepts any
>>>> > > > metadata and throws it away. For more information, see
>>>> > > > http://mercurial.selenic.com/wiki/ManifestV2Plan.
>>>> > >
>>>> > > I have several questions related to that document:
>>>> > > - Since manifest creation is done when committing, what is the plan
>>>> wrt
>>>> > >   what should happen when a commit with manifestv2 is pushed
>>>> (server may
>>>> > >   not support them, or may not want them even if it does)
>>>> >
>>>> > Not fully decided. Possibly on-the-fly conversion.
>>>>
>>>> The sha1 for a manifestv2 would be the same as the corresponding
>>>> (flattened) manifestv1? O_o
>>>>
>>>
>>> I think the point you are trying to make is there would be at least 2
>>> SHA-1s for every changeset, depending on how manifests are computed. That
>>> seems extremely confusing.
>>>
>>
>> Yes, that would be confusing. What this series adds is a new hash. What
>> Matt refers to is a BC-mode where the manifest is stored in v2 format in
>> the revlog, but the nodeid is calculated as if the content were v1. Once
>> that's done, we can convert between the formats on the fly. In this
>> BC-mode, we would have to produce the full-text manifest in both formats on
>> commit, and the on-the-fly conversion would be somewhat costly too. The
>> benefit is of course that the manifest revlog would be smaller (20-40%).
>> (Note that if we do add a BC-mode, we'd probably have to be careful not to
>> allow any metadata in the (v2) manifests, since that would not be part of
>> the hash and could (?) open up for some attack.)
>>
>
> That's an interesting idea. But as you said, it locks the server into
> BC-compatible behavior pretty much indefinitely.
>
> It's almost like you want to upgrade the server then have the server
> advertise manifest capabilities to the client. Older clients either
> wouldn't be able to push or we could do some server-side rebase magic and
> push down the rewritten changeset SHA-1 to the client somehow. Maybe the
> client would maintain a SHA-1 map file? hg-git has explored something
> similar.
>
> This makes my head hurt.
>

IIUC, you're (not really) suggesting different hashes on the client? Sounds
too complex to me.


>
>
>>
>>
>>> Another question: how can an existing repo seamlessly switch to the new
>>> manifest format?
>>>
>>
>> As I wrote in the message for patch 3 in this series, I think that should
>> be safe, but for now, we're keeping it simple by filling in the requires at
>> repo creation time. My original assumption was that we would fill in the
>> requires on the commit with the flag on. There seems to be no precedent for
>> such behavior, but it seems to make sense to me (largefiles would be close
>> to that, but not quite).
>>
>>
>>> Presumably we'll want to "upgrade" the Firefox repo to both directory
>>> manifests and manifestv2 for performance benefits.
>>>
>>
>> I had assumed that Firefox would be using this in ~5 years. I'm curious
>> what a more accurate number is. How soon can you require your developers to
>> have upgraded to a certain version of hg?
>>
>>
>>> But since manifestv2 is a requires-time thing, that would mean rewriting
>>> the entire manifest to v2. And that would change manifest SHA-1's which
>>> would invalidate every existing changeset SHA-1. That's a non-starter for
>>> us unless we can seamlessly handle requests for old changesets (hgweb URLs,
>>> clients updating to old changesets for bisection, etc).
>>>
>>
>> So either the BC-mode with on-the-fly conversion or we could allow
>> switching to the new formats on an existing repo. For tree manifests, I
>> don't think there will be a BC-mode, so if that turns out to be useful to
>> you, you'd probably have to require clients to upgrade at that point anyway.
>>
>> I'm a little surprised, but happy, that you mention only existing hashes.
>> You seem to be considering upgrading earlier than I had expected.
>>
>
> We can force people use a modern Mercurial if there is a compelling reason
> to do so. It's annoying to force people to upgrade, sure. But if it's all
> performance and feature wins, I don't think many will complain. We've done
> this before in January 2013 by requiring Python 2.7 to build Firefox. We
> can do it again if there are good reasons.
>
> What concerns us more than forcing a software upgrade onto people is
> dealing with a repo rewrite. References to existing SHA-1 need to work
> forever. Tons of automation would need to transition. The cost for a
> one-time transition would be significant. A global flag day would be far
> more expensive than a gradual transition. Maybe we would start playing new
> commits to both repo versions and have downstream consumers start pulling
> from the new repo before a push-only flag day? I don't know. We should talk
> about this in Montreal.
>

I would also hate for us to force a history rewrite. I've been planning for
mid-history format switch all along, and the reason for setting the
requires value at repo creation is just to keep it simple for now. I think
we should be able to relax that later. If we were not planning for that, we
wouldn't have bothered to find a way of telling old manifests from new (the
initial empty path).

So I really don't think you should have to talk about this in Montreal :-)
I won't be there, btw. I'm sure Matt will chime in if he disagrees with me.


> But it's not just Mozilla. Every Mercurial user is in the same boat.
> Although, only large repos would likely have anything significant to gain.
> I think we need to consider the implications beyond what Google, Facebook,
> and Mozilla are willing to tolerate. Mozilla is probably a decent proxy for
> the average company or organization that doesn't have the strong machine
> management that Facebook or Google are able to provide.
>
Mike Hommey - April 2, 2015, 5:14 a.m.
On Wed, Apr 01, 2015 at 09:24:47PM -0700, Gregory Szorc wrote:
> We can force people use a modern Mercurial if there is a compelling reason
> to do so. It's annoying to force people to upgrade, sure. But if it's all
> performance and feature wins, I don't think many will complain. We've done
> this before in January 2013 by requiring Python 2.7 to build Firefox. We
> can do it again if there are good reasons.

In January 2013, python 2.7 had been released for 2.5 years. Requiring
the very last version of Mercurial is another dimension.

Mike
Mike Hommey - April 3, 2015, 1:38 a.m.
On Wed, Apr 01, 2015 at 09:19:03PM -0500, Matt Mackall wrote:
> On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
> > - Why put file entries on two lines? The 20-byte nodeid could be
> >   preceded with a null character, which would solve the readdelta
> >   issue mentioned at the end of the document, but maybe the goal is to
> >   make deltas smaller when only the nodeid changes?
> 
> http://mercurial.selenic.com/wiki/ImprovingManifestCompressionPlan

  "3.1. Delta-friendly line breaks

  The bdiff delta format is based on line breaks."

Well, that's not actually true. The delta format used in revdiffs is
line-break agnostic. The diff algorithm that generates them only splits
at line-breaks, though, but revdiffs could just as well contain offsets
that aren't line breaks.

Mike
Pierre-Yves David - April 13, 2015, 3:44 p.m.
On 04/01/2015 09:24 PM, Gregory Szorc wrote:
> On Wed, Apr 1, 2015 at 8:47 PM, Martin von Zweigbergk
> <martinvonz@google.com <mailto:martinvonz@google.com>> wrote:
>
>
>
>     On Wed, Apr 1, 2015 at 7:42 PM Gregory Szorc
>     <gregory.szorc@gmail.com <mailto:gregory.szorc@gmail.com>> wrote:
>
>         On Wed, Apr 1, 2015 at 7:27 PM, Mike Hommey <mh@glandium.org
>         <mailto:mh@glandium.org>> wrote:
>
>             On Wed, Apr 01, 2015 at 09:19:03PM -0500, Matt Mackall wrote:
>             > On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
>             > > On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von Zweigbergk wrote:
>             > > > # HG changeset patch
>             > > > # User Martin von Zweigbergk <martinvonz@google.com <mailto:martinvonz@google.com>>
>             > > > # Date 1427520401 25200
>             > > > #      Fri Mar 27 22:26:41 2015 -0700
>             > > > # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
>             > > > # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
>             > > > manifestv2: add support for reading new manifest format
>             > > >
>             > > > The new manifest format is designed to be smaller, in particular to
>             > > > produce smaller deltas. It stores hashes in binary and puts the hash
>             > > > on a new line (for smaller deltas). It also uses stem compression to
>             > > > save space for long paths. The format has room for metadata, but
>             > > > that's there only for future-proofing. The parser thus accepts any
>             > > > metadata and throws it away. For more information, see
>             > > >http://mercurial.selenic.com/wiki/ManifestV2Plan.
>             > >
>             > > I have several questions related to that document:
>             > > - Since manifest creation is done when committing, what is the plan wrt
>             > >   what should happen when a commit with manifestv2 is pushed (server may
>             > >   not support them, or may not want them even if it does)
>             >
>             > Not fully decided. Possibly on-the-fly conversion.
>
>             The sha1 for a manifestv2 would be the same as the corresponding
>             (flattened) manifestv1? O_o
>
>
>         I think the point you are trying to make is there would be at
>         least 2 SHA-1s for every changeset, depending on how manifests
>         are computed. That seems extremely confusing.
>
>
>     Yes, that would be confusing. What this series adds is a new hash.
>     What Matt refers to is a BC-mode where the manifest is stored in v2
>     format in the revlog, but the nodeid is calculated as if the content
>     were v1. Once that's done, we can convert between the formats on the
>     fly. In this BC-mode, we would have to produce the full-text
>     manifest in both formats on commit, and the on-the-fly conversion
>     would be somewhat costly too. The benefit is of course that the
>     manifest revlog would be smaller (20-40%). (Note that if we do add a
>     BC-mode, we'd probably have to be careful not to allow any metadata
>     in the (v2) manifests, since that would not be part of the hash and
>     could (?) open up for some attack.)
>
>
> That's an interesting idea. But as you said, it locks the server into
> BC-compatible behavior pretty much indefinitely.
>
> It's almost like you want to upgrade the server then have the server
> advertise manifest capabilities to the client. Older clients either
> wouldn't be able to push or we could do some server-side rebase magic
> and push down the rewritten changeset SHA-1 to the client somehow. Maybe
> the client would maintain a SHA-1 map file? hg-git has explored
> something similar.
>
> This makes my head hurt.
>
>         Another question: how can an existing repo seamlessly switch to
>         the new manifest format?
>
>
>     As I wrote in the message for patch 3 in this series, I think that
>     should be safe, but for now, we're keeping it simple by filling in
>     the requires at repo creation time. My original assumption was that
>     we would fill in the requires on the commit with the flag on. There
>     seems to be no precedent for such behavior, but it seems to make
>     sense to me (largefiles would be close to that, but not quite).
>
>         Presumably we'll want to "upgrade" the Firefox repo to both
>         directory manifests and manifestv2 for performance benefits.
>
>
>     I had assumed that Firefox would be using this in ~5 years. I'm
>     curious what a more accurate number is. How soon can you require
>     your developers to have upgraded to a certain version of hg?
>
>         But since manifestv2 is a requires-time thing, that would mean
>         rewriting the entire manifest to v2. And that would change
>         manifest SHA-1's which would invalidate every existing changeset
>         SHA-1. That's a non-starter for us unless we can seamlessly
>         handle requests for old changesets (hgweb URLs, clients updating
>         to old changesets for bisection, etc).
>
>
>     So either the BC-mode with on-the-fly conversion or we could allow
>     switching to the new formats on an existing repo. For tree
>     manifests, I don't think there will be a BC-mode, so if that turns
>     out to be useful to you, you'd probably have to require clients to
>     upgrade at that point anyway.
>
>     I'm a little surprised, but happy, that you mention only existing
>     hashes. You seem to be considering upgrading earlier than I had
>     expected.
>
>
> We can force people use a modern Mercurial if there is a compelling
> reason to do so. It's annoying to force people to upgrade, sure. But if
> it's all performance and feature wins, I don't think many will complain.
> We've done this before in January 2013 by requiring Python 2.7 to build
> Firefox. We can do it again if there are good reasons.
>
> What concerns us more than forcing a software upgrade onto people is
> dealing with a repo rewrite. References to existing SHA-1 need to work
> forever. Tons of automation would need to transition. The cost for a
> one-time transition would be significant. A global flag day would be far
> more expensive than a gradual transition. Maybe we would start playing
> new commits to both repo versions and have downstream consumers start
> pulling from the new repo before a push-only flag day? I don't know. We
> should talk about this in Montreal.

Transition to a new hash was discussed at the New York Sprint (Nov 
2013). Here is what I can remember of it.

- Who ever break the hash, should also move the hashing algorithm to sha356.

- We should probably move the format to allow "multiple hash" per 
changeset. This would allow to things:

   - Keeping the old hash for reference,
   - Moving to other hash algorithm (hash side) or algorithm (Mercurial 
side)

   New server/client would start computing the new hash alongside the 
old hash. At some point a project can decide to stop accepting changeset 
that only had the old hash, and eventually drop it entirely.

However, this feature MUST NOT turned into a 
"git-alias/omg-security-flaw-everywhere" feature.

> But it's not just Mozilla. Every Mercurial user is in the same boat.
> Although, only large repos would likely have anything significant to
> gain. I think we need to consider the implications beyond what Google,
> Facebook, and Mozilla are willing to tolerate. Mozilla is probably a
> decent proxy for the average company or organization that doesn't have
> the strong machine management that Facebook or Google are able to provide.

I agree with Greg, on the fact that we need a way to smoothly transition 
to project that started small but eventually becomes big.

Because of that, I think we -should- have the same hashing algorithm for 
flat repository and sharded one.

A big change here is that the hash of a revision will not be directly 
computed for the binary content. But using special logic.

One of the issue, is that the directory "hash" will have parent 
information in they hash. So the flat version needs to duplicate this 
information to be able to produce similar hash.
Martin von Zweigbergk - April 13, 2015, 4:46 p.m.
On Mon, Apr 13, 2015 at 8:45 AM Pierre-Yves David <
pierre-yves.david@ens-lyon.org> wrote:

>
>
> On 04/01/2015 09:24 PM, Gregory Szorc wrote:
> > On Wed, Apr 1, 2015 at 8:47 PM, Martin von Zweigbergk
> > <martinvonz@google.com <mailto:martinvonz@google.com>> wrote:
> >
> >
> >
> >     On Wed, Apr 1, 2015 at 7:42 PM Gregory Szorc
> >     <gregory.szorc@gmail.com <mailto:gregory.szorc@gmail.com>> wrote:
> >
> >         On Wed, Apr 1, 2015 at 7:27 PM, Mike Hommey <mh@glandium.org
> >         <mailto:mh@glandium.org>> wrote:
> >
> >             On Wed, Apr 01, 2015 at 09:19:03PM -0500, Matt Mackall wrote:
> >             > On Thu, 2015-04-02 at 09:01 +0900, Mike Hommey wrote:
> >             > > On Wed, Apr 01, 2015 at 10:34:49AM -0700, Martin von
> Zweigbergk wrote:
> >             > > > # HG changeset patch
> >             > > > # User Martin von Zweigbergk <martinvonz@google.com
> <mailto:martinvonz@google.com>>
> >             > > > # Date 1427520401 25200
> >             > > > #      Fri Mar 27 22:26:41 2015 -0700
> >             > > > # Node ID aca6ee57dddf4b39732833a2bb603dcd19148754
> >             > > > # Parent  7530c75651b04d04e0871c84dfda487b4e9e96b4
> >             > > > manifestv2: add support for reading new manifest format
> >             > > >
> >             > > > The new manifest format is designed to be smaller, in
> particular to
> >             > > > produce smaller deltas. It stores hashes in binary and
> puts the hash
> >             > > > on a new line (for smaller deltas). It also uses stem
> compression to
> >             > > > save space for long paths. The format has room for
> metadata, but
> >             > > > that's there only for future-proofing. The parser thus
> accepts any
> >             > > > metadata and throws it away. For more information, see
> >             > > >http://mercurial.selenic.com/wiki/ManifestV2Plan.
> >             > >
> >             > > I have several questions related to that document:
> >             > > - Since manifest creation is done when committing, what
> is the plan wrt
> >             > >   what should happen when a commit with manifestv2 is
> pushed (server may
> >             > >   not support them, or may not want them even if it does)
> >             >
> >             > Not fully decided. Possibly on-the-fly conversion.
> >
> >             The sha1 for a manifestv2 would be the same as the
> corresponding
> >             (flattened) manifestv1? O_o
> >
> >
> >         I think the point you are trying to make is there would be at
> >         least 2 SHA-1s for every changeset, depending on how manifests
> >         are computed. That seems extremely confusing.
> >
> >
> >     Yes, that would be confusing. What this series adds is a new hash.
> >     What Matt refers to is a BC-mode where the manifest is stored in v2
> >     format in the revlog, but the nodeid is calculated as if the content
> >     were v1. Once that's done, we can convert between the formats on the
> >     fly. In this BC-mode, we would have to produce the full-text
> >     manifest in both formats on commit, and the on-the-fly conversion
> >     would be somewhat costly too. The benefit is of course that the
> >     manifest revlog would be smaller (20-40%). (Note that if we do add a
> >     BC-mode, we'd probably have to be careful not to allow any metadata
> >     in the (v2) manifests, since that would not be part of the hash and
> >     could (?) open up for some attack.)
> >
> >
> > That's an interesting idea. But as you said, it locks the server into
> > BC-compatible behavior pretty much indefinitely.
> >
> > It's almost like you want to upgrade the server then have the server
> > advertise manifest capabilities to the client. Older clients either
> > wouldn't be able to push or we could do some server-side rebase magic
> > and push down the rewritten changeset SHA-1 to the client somehow. Maybe
> > the client would maintain a SHA-1 map file? hg-git has explored
> > something similar.
> >
> > This makes my head hurt.
> >
> >         Another question: how can an existing repo seamlessly switch to
> >         the new manifest format?
> >
> >
> >     As I wrote in the message for patch 3 in this series, I think that
> >     should be safe, but for now, we're keeping it simple by filling in
> >     the requires at repo creation time. My original assumption was that
> >     we would fill in the requires on the commit with the flag on. There
> >     seems to be no precedent for such behavior, but it seems to make
> >     sense to me (largefiles would be close to that, but not quite).
> >
> >         Presumably we'll want to "upgrade" the Firefox repo to both
> >         directory manifests and manifestv2 for performance benefits.
> >
> >
> >     I had assumed that Firefox would be using this in ~5 years. I'm
> >     curious what a more accurate number is. How soon can you require
> >     your developers to have upgraded to a certain version of hg?
> >
> >         But since manifestv2 is a requires-time thing, that would mean
> >         rewriting the entire manifest to v2. And that would change
> >         manifest SHA-1's which would invalidate every existing changeset
> >         SHA-1. That's a non-starter for us unless we can seamlessly
> >         handle requests for old changesets (hgweb URLs, clients updating
> >         to old changesets for bisection, etc).
> >
> >
> >     So either the BC-mode with on-the-fly conversion or we could allow
> >     switching to the new formats on an existing repo. For tree
> >     manifests, I don't think there will be a BC-mode, so if that turns
> >     out to be useful to you, you'd probably have to require clients to
> >     upgrade at that point anyway.
> >
> >     I'm a little surprised, but happy, that you mention only existing
> >     hashes. You seem to be considering upgrading earlier than I had
> >     expected.
> >
> >
> > We can force people use a modern Mercurial if there is a compelling
> > reason to do so. It's annoying to force people to upgrade, sure. But if
> > it's all performance and feature wins, I don't think many will complain.
> > We've done this before in January 2013 by requiring Python 2.7 to build
> > Firefox. We can do it again if there are good reasons.
> >
> > What concerns us more than forcing a software upgrade onto people is
> > dealing with a repo rewrite. References to existing SHA-1 need to work
> > forever. Tons of automation would need to transition. The cost for a
> > one-time transition would be significant. A global flag day would be far
> > more expensive than a gradual transition. Maybe we would start playing
> > new commits to both repo versions and have downstream consumers start
> > pulling from the new repo before a push-only flag day? I don't know. We
> > should talk about this in Montreal.
>
> Transition to a new hash was discussed at the New York Sprint (Nov
> 2013). Here is what I can remember of it.
>
> - Who ever break the hash, should also move the hashing algorithm to
> sha356.
>

40 bytes from sha256 should be enough, right? We're concerned about a
broken hash algorithm, IIUC, not so much about hash collisions.


>
> - We should probably move the format to allow "multiple hash" per
> changeset. This would allow to things:
>
>    - Keeping the old hash for reference,
>    - Moving to other hash algorithm (hash side) or algorithm (Mercurial
> side)
>
>    New server/client would start computing the new hash alongside the
> old hash. At some point a project can decide to stop accepting changeset
> that only had the old hash, and eventually drop it entirely.
>

I don't understand how this would work. Are the parent pointers always (for
both old-style and new-style commits) pointing to the old-style hash until
old-style hashes are no longer allowed? Would file and manifest revlogs be
using the old-style hash until the migration is complete or would you have
duplicate index files and shared data files?



>
> However, this feature MUST NOT turned into a
> "git-alias/omg-security-flaw-everywhere" feature.
>
> > But it's not just Mozilla. Every Mercurial user is in the same boat.
> > Although, only large repos would likely have anything significant to
> > gain. I think we need to consider the implications beyond what Google,
> > Facebook, and Mozilla are willing to tolerate. Mozilla is probably a
> > decent proxy for the average company or organization that doesn't have
> > the strong machine management that Facebook or Google are able to
> provide.
>
> I agree with Greg, on the fact that we need a way to smoothly transition
> to project that started small but eventually becomes big.
>

I also agree with Greg. I think we should allow transitioning from one type
of manifest (and changelog) to another mid-history.


>
> Because of that, I think we -should- have the same hashing algorithm for
> flat repository and sharded one.
>

I think that's a nice idea, but I don't want to delay our (Google's)
project too much because of it. We will not care about changing from flat
to sharded. As long as we agree on how to calculate the hash, there should
be nothing preventing us from implementing support for that hashing of flat
manifests *after* we have it implemented for sharded manifests (and
possibly even in wide use). That also lets us see how bad it is to let
*all* users use tree manifests. If that turns out not too bad, then there
is little reason to invest in making it possible.


>
> A big change here is that the hash of a revision will not be directly
> computed for the binary content. But using special logic.
>

For flat manifests, yes, I agree.


>
> One of the issue, is that the directory "hash" will have parent
> information in they hash. So the flat version needs to duplicate this
> information to be able to produce similar hash.
>

Right, the flat manifests would probably gain an entry for every directory.


> --
> Pierre-Yves David
>

Patch

diff -r 7530c75651b0 -r aca6ee57dddf mercurial/manifest.py
--- a/mercurial/manifest.py	Tue Mar 31 22:45:45 2015 -0700
+++ b/mercurial/manifest.py	Fri Mar 27 22:26:41 2015 -0700
@@ -11,8 +11,7 @@ 
 
 propertycache = util.propertycache
 
-def _parse(data):
-    """Generates (path, node, flags) tuples from a manifest text"""
+def _parsev1(data):
     # This method does a little bit of excessive-looking
     # precondition checking. This is so that the behavior of this
     # class exactly matches its C counterpart to try and help
@@ -31,6 +30,34 @@ 
         else:
             yield f, revlog.bin(n), ''
 
+def _parsev2(data):
+    metadataend = data.find('\n')
+    # Just ignore metadata for now
+    pos = metadataend + 1
+    prevf = ''
+    while pos < len(data):
+        end = data.find('\n', pos + 1) # +1 to skip stem length byte
+        if end == -1:
+            raise ValueError('Manifest ended with incomplete file entry.')
+        stemlen = ord(data[pos])
+        items = data[pos + 1:end].split('\0')
+        f = prevf[:stemlen] + items[0]
+        if prevf > f:
+            raise ValueError('Manifest entries not in sorted order.')
+        fl = items[1]
+        # Just ignore metadata (items[2:] for now)
+        n = data[end + 1:end + 21]
+        yield f, n, fl
+        pos = end + 22
+        prevf = f
+
+def _parse(data):
+    """Generates (path, node, flags) tuples from a manifest text"""
+    if data.startswith('\0'):
+        return iter(_parsev2(data))
+    else:
+        return iter(_parsev1(data))
+
 def _text(it):
     """Given an iterator over (path, node, flags) tuples, returns a manifest
     text"""
@@ -116,7 +143,13 @@ 
 
 class manifestdict(object):
     def __init__(self, data=''):
-        self._lm = _lazymanifest(data)
+        if data.startswith('\0'):
+            #_lazymanifest can not parse v2
+            self._lm = _lazymanifest('')
+            for f, n, fl in _parsev2(data):
+                self._lm[f] = n, fl
+        else:
+            self._lm = _lazymanifest(data)
 
     def __getitem__(self, key):
         return self._lm[key][0]
diff -r 7530c75651b0 -r aca6ee57dddf tests/test-manifest.py
--- a/tests/test-manifest.py	Tue Mar 31 22:45:45 2015 -0700
+++ b/tests/test-manifest.py	Fri Mar 27 22:26:41 2015 -0700
@@ -8,6 +8,7 @@ 
 from mercurial import match as matchmod
 
 EMTPY_MANIFEST = ''
+EMTPY_MANIFEST_V2 = '\0\n'
 
 HASH_1 = '1' * 40
 BIN_HASH_1 = binascii.unhexlify(HASH_1)
@@ -24,6 +25,42 @@ 
          'flag2': 'l',
          }
 
+# Same data as A_SHORT_MANIFEST
+A_SHORT_MANIFEST_V2 = (
+    '\0\n'
+    '\x00bar/baz/qux.py\0%(flag2)s\n%(hash2)s\n'
+    '\x00foo\0%(flag1)s\n%(hash1)s\n'
+    ) % {'hash1': BIN_HASH_1,
+         'flag1': '',
+         'hash2': BIN_HASH_2,
+         'flag2': 'l',
+         }
+
+# Same data as A_SHORT_MANIFEST
+A_METADATA_MANIFEST = (
+    '\0foo\0bar\n'
+    '\x00bar/baz/qux.py\0%(flag2)s\0foo\0bar\n%(hash2)s\n' # flag and metadata
+    '\x00foo\0%(flag1)s\0foo\n%(hash1)s\n' # no flag, but metadata
+    ) % {'hash1': BIN_HASH_1,
+         'flag1': '',
+         'hash2': BIN_HASH_2,
+         'flag2': 'l',
+         }
+
+A_STEM_COMPRESSED_MANIFEST = (
+    '\0\n'
+    '\x00bar/baz/qux.py\0%(flag2)s\n%(hash2)s\n'
+    '\x04qux/foo.py\0%(flag1)s\n%(hash1)s\n' # simple case of 4 stem chars
+    '\x0az.py\0%(flag1)s\n%(hash1)s\n' # tricky newline = 10 stem characters
+    '\x00%(verylongdir)sx/x\0\n%(hash1)s\n'
+    '\xffx/y\0\n%(hash2)s\n' # more than 255 stem chars
+    ) % {'hash1': BIN_HASH_1,
+         'flag1': '',
+         'hash2': BIN_HASH_2,
+         'flag2': 'l',
+         'verylongdir': 255 * 'x',
+         }
+
 A_DEEPER_MANIFEST = (
     'a/b/c/bar.py\0%(hash3)s%(flag1)s\n'
     'a/b/c/bar.txt\0%(hash1)s%(flag1)s\n'
@@ -77,6 +114,11 @@ 
         self.assertEqual(0, len(m))
         self.assertEqual([], list(m))
 
+    def testEmptyManifestv2(self):
+        m = parsemanifest(EMTPY_MANIFEST_V2)
+        self.assertEqual(0, len(m))
+        self.assertEqual([], list(m))
+
     def testManifest(self):
         m = parsemanifest(A_SHORT_MANIFEST)
         self.assertEqual(['bar/baz/qux.py', 'foo'], list(m))
@@ -86,6 +128,25 @@ 
         self.assertEqual('', m.flags('foo'))
         self.assertRaises(KeyError, lambda : m['wat'])
 
+    def testParseManifestV2(self):
+        m1 = parsemanifest(A_SHORT_MANIFEST)
+        m2 = parsemanifest(A_SHORT_MANIFEST_V2)
+        # Should have same content as A_SHORT_MANIFEST
+        self.assertEqual(m1.text(), m2.text())
+
+    def testParseManifestMetadata(self):
+        # Metadata is for future-proofing and should be accepted but ignored
+        m = parsemanifest(A_METADATA_MANIFEST)
+        self.assertEqual(A_SHORT_MANIFEST, m.text())
+
+    def testParseManifestStemCompression(self):
+        m = parsemanifest(A_STEM_COMPRESSED_MANIFEST)
+        self.assertIn('bar/baz/qux.py', m)
+        self.assertIn('bar/qux/foo.py', m)
+        self.assertIn('bar/qux/foz.py', m)
+        self.assertIn(256 * 'x' + '/x', m)
+        self.assertIn(256 * 'x' + '/y', m)
+
     def testSetItem(self):
         want = BIN_HASH_1