Patchwork changegroup: allow sending snapshot deltas in cg2

login
register
mail settings
Submitter adgar@google.com
Date Nov. 21, 2014, 10:53 p.m.
Message ID <ce1d4cdad3e2a324198a.1416610411@adgar.nyc.corp.google.com>
Download mbox | patch
Permalink /patch/6821/
State Rejected
Headers show

Comments

adgar@google.com - Nov. 21, 2014, 10:53 p.m.
# HG changeset patch
# User Mike Edgar <adgar@google.com>
# Date 1416610038 18000
#      Fri Nov 21 17:47:18 2014 -0500
# Node ID ce1d4cdad3e2a324198a348f5a62f86e9b0e1a73
# Parent  a179db3db9b96b38c10c491e6e7e7ad5f40a7787
changegroup: allow sending snapshot deltas in cg2

The changegroup2 format allows each revision to be sent with
a configurable base. That base is presently restricted to
p1, p2, or the previous revision in the revlog. By allowing
a null base, we can send snapshot deltas when it is efficient
to do so.
Matt Mackall - Nov. 22, 2014, 1:28 a.m.
On Fri, 2014-11-21 at 17:53 -0500, Mike Edgar wrote:
> # HG changeset patch
> # User Mike Edgar <adgar@google.com>
> # Date 1416610038 18000
> #      Fri Nov 21 17:47:18 2014 -0500
> # Node ID ce1d4cdad3e2a324198a348f5a62f86e9b0e1a73
> # Parent  a179db3db9b96b38c10c491e6e7e7ad5f40a7787
> changegroup: allow sending snapshot deltas in cg2
> 
> The changegroup2 format allows each revision to be sent with
> a configurable base. That base is presently restricted to
> p1, p2, or the previous revision in the revlog. By allowing
> a null base, we can send snapshot deltas when it is efficient
> to do so.

Not sure what you mean by efficient here. We generally assume that
bandwidth is more scarce than CPU, so calculating a new delta is
generally preferred to sending a full revision. This seems to prefer the
opposite trade-off?

> diff -r a179db3db9b9 -r ce1d4cdad3e2 mercurial/changegroup.py
> --- a/mercurial/changegroup.py	Wed Nov 19 23:15:07 2014 -0800
> +++ b/mercurial/changegroup.py	Fri Nov 21 17:47:18 2014 -0500
> @@ -455,9 +455,9 @@
>  
>      def deltaparent(self, revlog, rev, p1, p2, prev):
>          dp = revlog.deltaparent(rev)
> -        # avoid storing full revisions; pick prev in those cases
> -        # also pick prev when we can't be sure remote has dp
> -        if dp == nullrev or (dp != p1 and dp != p2 and dp != prev):
> +        # if revlog delta base is null, send a snapshot, no known good base.
> +        # otherwise, pick prev when we can't be sure remote has dp
> +        if dp != nullrev and dp != p1 and dp != p2 and dp != prev:
>              return prev
>          return dp
>  
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Pierre-Yves David - Nov. 25, 2014, 9:17 p.m.
On 11/21/2014 05:28 PM, Matt Mackall wrote:
> On Fri, 2014-11-21 at 17:53 -0500, Mike Edgar wrote:
>> # HG changeset patch
>> # User Mike Edgar <adgar@google.com>
>> # Date 1416610038 18000
>> #      Fri Nov 21 17:47:18 2014 -0500
>> # Node ID ce1d4cdad3e2a324198a348f5a62f86e9b0e1a73
>> # Parent  a179db3db9b96b38c10c491e6e7e7ad5f40a7787
>> changegroup: allow sending snapshot deltas in cg2
>>
>> The changegroup2 format allows each revision to be sent with
>> a configurable base. That base is presently restricted to
>> p1, p2, or the previous revision in the revlog. By allowing
>> a null base, we can send snapshot deltas when it is efficient
>> to do so.
>
> Not sure what you mean by efficient here. We generally assume that
> bandwidth is more scarce than CPU, so calculating a new delta is
> generally preferred to sending a full revision. This seems to prefer the
> opposite trade-off?

This change comes to support the "Censored" nodes effort. The censored 
node cannot be used as delta base (and would likely be inefficient 
anyway) so we have an alternative. It some case it would be an easy but 
suboptinal solution to issue a full delta. In some other it will be the 
only available option.


>
>> diff -r a179db3db9b9 -r ce1d4cdad3e2 mercurial/changegroup.py
>> --- a/mercurial/changegroup.py	Wed Nov 19 23:15:07 2014 -0800
>> +++ b/mercurial/changegroup.py	Fri Nov 21 17:47:18 2014 -0500
>> @@ -455,9 +455,9 @@
>>
>>       def deltaparent(self, revlog, rev, p1, p2, prev):
>>           dp = revlog.deltaparent(rev)
>> -        # avoid storing full revisions; pick prev in those cases
>> -        # also pick prev when we can't be sure remote has dp
>> -        if dp == nullrev or (dp != p1 and dp != p2 and dp != prev):
>> +        # if revlog delta base is null, send a snapshot, no known good base.
>> +        # otherwise, pick prev when we can't be sure remote has dp
>> +        if dp != nullrev and dp != p1 and dp != p2 and dp != prev:

However, this part seems related to the create of bundle, not there 
consumption. Does this mean the unpacker already have the logic to deal 
with base againts nullrev in the middle of the stream?
Matt Mackall - Nov. 25, 2014, 11:58 p.m.
On Tue, 2014-11-25 at 13:17 -0800, Pierre-Yves David wrote:
> 
> On 11/21/2014 05:28 PM, Matt Mackall wrote:
> > On Fri, 2014-11-21 at 17:53 -0500, Mike Edgar wrote:
> >> # HG changeset patch
> >> # User Mike Edgar <adgar@google.com>
> >> # Date 1416610038 18000
> >> #      Fri Nov 21 17:47:18 2014 -0500
> >> # Node ID ce1d4cdad3e2a324198a348f5a62f86e9b0e1a73
> >> # Parent  a179db3db9b96b38c10c491e6e7e7ad5f40a7787
> >> changegroup: allow sending snapshot deltas in cg2
> >>
> >> The changegroup2 format allows each revision to be sent with
> >> a configurable base. That base is presently restricted to
> >> p1, p2, or the previous revision in the revlog. By allowing
> >> a null base, we can send snapshot deltas when it is efficient
> >> to do so.
> >
> > Not sure what you mean by efficient here. We generally assume that
> > bandwidth is more scarce than CPU, so calculating a new delta is
> > generally preferred to sending a full revision. This seems to prefer the
> > opposite trade-off?
> 
> This change comes to support the "Censored" nodes effort. The censored 
> node cannot be used as delta base (and would likely be inefficient 
> anyway) so we have an alternative. It some case it would be an easy but 
> suboptinal solution to issue a full delta. In some other it will be the 
> only available option.

This doesn't get me any closer to answering the above concerns.

> >> diff -r a179db3db9b9 -r ce1d4cdad3e2 mercurial/changegroup.py
> >> --- a/mercurial/changegroup.py	Wed Nov 19 23:15:07 2014 -0800
> >> +++ b/mercurial/changegroup.py	Fri Nov 21 17:47:18 2014 -0500
> >> @@ -455,9 +455,9 @@
> >>
> >>       def deltaparent(self, revlog, rev, p1, p2, prev):
> >>           dp = revlog.deltaparent(rev)
> >> -        # avoid storing full revisions; pick prev in those cases
> >> -        # also pick prev when we can't be sure remote has dp
> >> -        if dp == nullrev or (dp != p1 and dp != p2 and dp != prev):
> >> +        # if revlog delta base is null, send a snapshot, no known good base.
> >> +        # otherwise, pick prev when we can't be sure remote has dp
> >> +        if dp != nullrev and dp != p1 and dp != p2 and dp != prev:
> 
> However, this part seems related to the create of bundle, not there 
> consumption. Does this mean the unpacker already have the logic to deal 
> with base againts nullrev in the middle of the stream?

It's what the general in generaldelta means: delta against any revision.
adgar@google.com - Nov. 27, 2014, 8:37 p.m.
Sorry for taking a while to get back to this thread - I wanted to take
a big step
back before committing to any one direction.

On Tue, Nov 25, 2014 at 6:58 PM, Matt Mackall <mpm@selenic.com> wrote:
>
> On Tue, 2014-11-25 at 13:17 -0800, Pierre-Yves David wrote:
> >
> > On 11/21/2014 05:28 PM, Matt Mackall wrote:
> > > On Fri, 2014-11-21 at 17:53 -0500, Mike Edgar wrote:
> > >> # HG changeset patch
> > >> # User Mike Edgar <adgar@google.com>
> > >> # Date 1416610038 18000
> > >> #      Fri Nov 21 17:47:18 2014 -0500
> > >> # Node ID ce1d4cdad3e2a324198a348f5a62f86e9b0e1a73
> > >> # Parent  a179db3db9b96b38c10c491e6e7e7ad5f40a7787
> > >> changegroup: allow sending snapshot deltas in cg2
> > >>
> > >> The changegroup2 format allows each revision to be sent with
> > >> a configurable base. That base is presently restricted to
> > >> p1, p2, or the previous revision in the revlog. By allowing
> > >> a null base, we can send snapshot deltas when it is efficient
> > >> to do so.
> > >
> > > Not sure what you mean by efficient here. We generally assume that
> > > bandwidth is more scarce than CPU, so calculating a new delta is
> > > generally preferred to sending a full revision. This seems to prefer the
> > > opposite trade-off?
> >
> > This change comes to support the "Censored" nodes effort. The censored
> > node cannot be used as delta base (and would likely be inefficient
> > anyway) so we have an alternative. It some case it would be an easy but
> > suboptinal solution to issue a full delta. In some other it will be the
> > only available option.
>
> This doesn't get me any closer to answering the above concerns.

I agree with Matt, this change is wrong. I came upon the "no snapshot"
limitation in the course of implementing censorship exchange, but too quickly
assumed lifting the "no snapshot" restriction would be appropriate in
non-censor cases too.

The revlog might not have a deltaparent because there is actually no suitable
delta base globally, but there are purely local reasons it might have stored a
snapshot. Recent changes to limit delta chain length are one example (hg:
76effa770ff9). Variation in compression algorithms are another.

This change trusts the revlog whenever it has no delta base, but the revlog
only provides a *hint* as to a suitable delta base, and changegroup/exchange
ignores that hint if it's unhelpful in the exchange context. For example, if
it can't prove the recipient has the delta base, it ignores the hint. I agree
with Matt that the bandwidth/CPU tradeoff is also an important reason to
ignore the hint.

.
As for the censor line of work: I have a new approach that I'm writing up
and will share on a Wiki page early next week. It will include a channel for
signaling that a revision is tombstoned without decoding the revision's
text. This new signal can ultimately be considered in code paths like this
one.
Gregory Szorc - Nov. 29, 2014, 7:22 p.m.
Mike,

I had a random thought the other day: could your work on censored nodes be
"abused" to inject "pre-history" onto existing repositories?

I ask because when Mozilla moved Firefox from CVS to Mercurial, we started
with a clean repository instead of importing CVS history into Mercurial. If
we were doing this from scratch today, we would likely convert the CVS
history to Mercurial.

Anyway, I'm very casually trying to come up with a way for us to transition
to a repository with full CVS history. Before censored nodes work, I always
thought that we'd need to employ some extension that maintained a SHA-1 map
or similar hackery. I see some of the work around censored nodes and adding
exceptions to hash verification and if I squint hard enough, I think I
could see a way for this to be "abused" to allow pre-history injection.
e.g. we could rewrite the current root commit to have an alternate parent
coming from CVS and we could disable hash verification on that 1 node by
pretending it is a censored node or similar. I haven't been paying super
close attention to your censored node work. I'm curious if this is in the
realm of possibility.

Gregory

On Thu, Nov 27, 2014 at 12:37 PM, Michael Edgar <adgar@google.com> wrote:

> Sorry for taking a while to get back to this thread - I wanted to take
> a big step
> back before committing to any one direction.
>
> On Tue, Nov 25, 2014 at 6:58 PM, Matt Mackall <mpm@selenic.com> wrote:
> >
> > On Tue, 2014-11-25 at 13:17 -0800, Pierre-Yves David wrote:
> > >
> > > On 11/21/2014 05:28 PM, Matt Mackall wrote:
> > > > On Fri, 2014-11-21 at 17:53 -0500, Mike Edgar wrote:
> > > >> # HG changeset patch
> > > >> # User Mike Edgar <adgar@google.com>
> > > >> # Date 1416610038 18000
> > > >> #      Fri Nov 21 17:47:18 2014 -0500
> > > >> # Node ID ce1d4cdad3e2a324198a348f5a62f86e9b0e1a73
> > > >> # Parent  a179db3db9b96b38c10c491e6e7e7ad5f40a7787
> > > >> changegroup: allow sending snapshot deltas in cg2
> > > >>
> > > >> The changegroup2 format allows each revision to be sent with
> > > >> a configurable base. That base is presently restricted to
> > > >> p1, p2, or the previous revision in the revlog. By allowing
> > > >> a null base, we can send snapshot deltas when it is efficient
> > > >> to do so.
> > > >
> > > > Not sure what you mean by efficient here. We generally assume that
> > > > bandwidth is more scarce than CPU, so calculating a new delta is
> > > > generally preferred to sending a full revision. This seems to prefer
> the
> > > > opposite trade-off?
> > >
> > > This change comes to support the "Censored" nodes effort. The censored
> > > node cannot be used as delta base (and would likely be inefficient
> > > anyway) so we have an alternative. It some case it would be an easy but
> > > suboptinal solution to issue a full delta. In some other it will be the
> > > only available option.
> >
> > This doesn't get me any closer to answering the above concerns.
>
> I agree with Matt, this change is wrong. I came upon the "no snapshot"
> limitation in the course of implementing censorship exchange, but too
> quickly
> assumed lifting the "no snapshot" restriction would be appropriate in
> non-censor cases too.
>
> The revlog might not have a deltaparent because there is actually no
> suitable
> delta base globally, but there are purely local reasons it might have
> stored a
> snapshot. Recent changes to limit delta chain length are one example (hg:
> 76effa770ff9). Variation in compression algorithms are another.
>
> This change trusts the revlog whenever it has no delta base, but the revlog
> only provides a *hint* as to a suitable delta base, and
> changegroup/exchange
> ignores that hint if it's unhelpful in the exchange context. For example,
> if
> it can't prove the recipient has the delta base, it ignores the hint. I
> agree
> with Matt that the bandwidth/CPU tradeoff is also an important reason to
> ignore the hint.
>
> .
> As for the censor line of work: I have a new approach that I'm writing up
> and will share on a Wiki page early next week. It will include a channel
> for
> signaling that a revision is tombstoned without decoding the revision's
> text. This new signal can ultimately be considered in code paths like this
> one.
>
>
>
>
>
> --
> Michael Edgar | Software Engineer | adgar@google.com | 518-496-6958
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
adgar@google.com - Dec. 1, 2014, 1:39 a.m.
On Sat, Nov 29, 2014 at 2:22 PM, Gregory Szorc <gregory.szorc@gmail.com> wrote:
>
> Mike,
>
> I had a random thought the other day: could your work on censored nodes be "abused" to inject "pre-history" onto existing repositories?


Hi Gregory,

Thanks for taking interest! Unfortunately, censor support as envisioned
probably won't help as directly as you'd like. Presently I've been limiting
censorship support to filelogs, since the goal was to hide sensitive content
and I'm trying super hard to keep the surface area minimal.

An extension or core addition supporting your use case will definitely be
easier once the censorship work settles into core. The details around exchange
are general and seem the trickiest part so far.

The additional work for changelogs would be to define how tombstone metadata is
stored (presumably a key in "extra"), check it in changelog's #checkhash
method, then and add special casing for censorship whenever changelog entries
get read.

Let me know if you have any more questions! I anticipate sharing a detailed
design on a wiki page early this week.

Cheers!

Mike


>
> On Thu, Nov 27, 2014 at 12:37 PM, Michael Edgar <adgar@google.com> wrote:
>>
>> Sorry for taking a while to get back to this thread - I wanted to take
>> a big step
>> back before committing to any one direction.
>>
>> On Tue, Nov 25, 2014 at 6:58 PM, Matt Mackall <mpm@selenic.com> wrote:
>> >
>> > On Tue, 2014-11-25 at 13:17 -0800, Pierre-Yves David wrote:
>> > >
>> > > On 11/21/2014 05:28 PM, Matt Mackall wrote:
>> > > > On Fri, 2014-11-21 at 17:53 -0500, Mike Edgar wrote:
>> > > >> # HG changeset patch
>> > > >> # User Mike Edgar <adgar@google.com>
>> > > >> # Date 1416610038 18000
>> > > >> #      Fri Nov 21 17:47:18 2014 -0500
>> > > >> # Node ID ce1d4cdad3e2a324198a348f5a62f86e9b0e1a73
>> > > >> # Parent  a179db3db9b96b38c10c491e6e7e7ad5f40a7787
>> > > >> changegroup: allow sending snapshot deltas in cg2
>> > > >>
>> > > >> The changegroup2 format allows each revision to be sent with
>> > > >> a configurable base. That base is presently restricted to
>> > > >> p1, p2, or the previous revision in the revlog. By allowing
>> > > >> a null base, we can send snapshot deltas when it is efficient
>> > > >> to do so.
>> > > >
>> > > > Not sure what you mean by efficient here. We generally assume that
>> > > > bandwidth is more scarce than CPU, so calculating a new delta is
>> > > > generally preferred to sending a full revision. This seems to prefer the
>> > > > opposite trade-off?
>> > >
>> > > This change comes to support the "Censored" nodes effort. The censored
>> > > node cannot be used as delta base (and would likely be inefficient
>> > > anyway) so we have an alternative. It some case it would be an easy but
>> > > suboptinal solution to issue a full delta. In some other it will be the
>> > > only available option.
>> >
>> > This doesn't get me any closer to answering the above concerns.
>>
>> I agree with Matt, this change is wrong. I came upon the "no snapshot"
>> limitation in the course of implementing censorship exchange, but too quickly
>> assumed lifting the "no snapshot" restriction would be appropriate in
>> non-censor cases too.
>>
>> The revlog might not have a deltaparent because there is actually no suitable
>> delta base globally, but there are purely local reasons it might have stored a
>> snapshot. Recent changes to limit delta chain length are one example (hg:
>> 76effa770ff9). Variation in compression algorithms are another.
>>
>> This change trusts the revlog whenever it has no delta base, but the revlog
>> only provides a *hint* as to a suitable delta base, and changegroup/exchange
>> ignores that hint if it's unhelpful in the exchange context. For example, if
>> it can't prove the recipient has the delta base, it ignores the hint. I agree
>> with Matt that the bandwidth/CPU tradeoff is also an important reason to
>> ignore the hint.
>>
>> .
>> As for the censor line of work: I have a new approach that I'm writing up
>> and will share on a Wiki page early next week. It will include a channel for
>> signaling that a revision is tombstoned without decoding the revision's
>> text. This new signal can ultimately be considered in code paths like this
>> one.
>>
>>
>>
>>
>>
>> --
>> Michael Edgar | Software Engineer | adgar@google.com | 518-496-6958
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@selenic.com
>> http://selenic.com/mailman/listinfo/mercurial-devel
>
>

Patch

diff -r a179db3db9b9 -r ce1d4cdad3e2 mercurial/changegroup.py
--- a/mercurial/changegroup.py	Wed Nov 19 23:15:07 2014 -0800
+++ b/mercurial/changegroup.py	Fri Nov 21 17:47:18 2014 -0500
@@ -455,9 +455,9 @@ 
 
     def deltaparent(self, revlog, rev, p1, p2, prev):
         dp = revlog.deltaparent(rev)
-        # avoid storing full revisions; pick prev in those cases
-        # also pick prev when we can't be sure remote has dp
-        if dp == nullrev or (dp != p1 and dp != p2 and dp != prev):
+        # if revlog delta base is null, send a snapshot, no known good base.
+        # otherwise, pick prev when we can't be sure remote has dp
+        if dp != nullrev and dp != p1 and dp != p2 and dp != prev:
             return prev
         return dp