Patchwork [1,of,5,rfc] tests: explore some bdiff cases

login
register
mail settings
Submitter Mads Kiilerich
Date Nov. 3, 2016, 9:34 p.m.
Message ID <f6408efe0d0f4179fe6c.1478208851@madski>
Download mbox | patch
Permalink /patch/17322/
State Deferred
Headers show

Comments

Mads Kiilerich - Nov. 3, 2016, 9:34 p.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1478208837 -3600
#      Thu Nov 03 22:33:57 2016 +0100
# Node ID f6408efe0d0f4179fe6cc2b967164c1b4567f3d6
# Parent  d06c049695e6ad3219e7479c65ce98a2f123e878
tests: explore some bdiff cases
Yuya Nishihara - Nov. 6, 2016, 9:07 a.m.
On Thu, 03 Nov 2016 22:34:11 +0100, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1478208837 -3600
> #      Thu Nov 03 22:33:57 2016 +0100
> # Node ID f6408efe0d0f4179fe6cc2b967164c1b4567f3d6
> # Parent  d06c049695e6ad3219e7479c65ce98a2f123e878
> tests: explore some bdiff cases
> 
> diff --git a/tests/test-bhalf.t b/tests/test-bhalf.t
> new file mode 100644
> --- /dev/null
> +++ b/tests/test-bhalf.t

'#require no-pure' is necessary since we use difflib in pure.

The other changes in this series look good to me, but it's bdiff.c so I
don't queue them.
Mads Kiilerich - Nov. 6, 2016, 3:56 p.m.
On 11/06/2016 10:07 AM, Yuya Nishihara wrote:
> On Thu, 03 Nov 2016 22:34:11 +0100, Mads Kiilerich wrote:
>> # HG changeset patch
>> # User Mads Kiilerich <madski@unity3d.com>
>> # Date 1478208837 -3600
>> #      Thu Nov 03 22:33:57 2016 +0100
>> # Node ID f6408efe0d0f4179fe6cc2b967164c1b4567f3d6
>> # Parent  d06c049695e6ad3219e7479c65ce98a2f123e878
>> tests: explore some bdiff cases
>>
>> diff --git a/tests/test-bhalf.t b/tests/test-bhalf.t
>> new file mode 100644
>> --- /dev/null
>> +++ b/tests/test-bhalf.t
> '#require no-pure' is necessary since we use difflib in pure.
>
> The other changes in this series look good to me, but it's bdiff.c so I
> don't queue them.

Thanks for reviewing and the positive feedback. I will try to polish it 
for "real" submission.

For the last patch, I wonder if it would be better to add a post 
processing step that - given all the chunks - try to shift/rotate all 
match sequences to be as early as possible (and thus deltas to be as 
late and "appending" as possible). That could give more readable diffs, 
especially when combined with heuristics for preferring chunks starting 
with the lowest amount of indentation.

One lesson from these changes seems to be that it is a problem that we 
use the same low level diff algorithm for revlog delta storage and 
bundles and for readable patch diffs. One idea that got mentioned at the 
latest sprint was to use zstandard for storage and "just" seed it with 
the "a" version of the file as dictionary and let it compress the "b" 
side. That might be a better long term solution.

More short term, I wonder how much we could gain from somehow teaching 
bdiff to consider both parents for each chunk instead of just using 
deltas from one side and store chunks from the other verbatim. I think 
that could make a significant difference for repositories with a lot of 
big merges in files or the manifest.

/Mads
Pierre-Yves David - Nov. 7, 2016, 2:47 p.m.
On 11/06/2016 04:56 PM, Mads Kiilerich wrote:
> On 11/06/2016 10:07 AM, Yuya Nishihara wrote:
>> On Thu, 03 Nov 2016 22:34:11 +0100, Mads Kiilerich wrote:
>>> # HG changeset patch
>>> # User Mads Kiilerich <madski@unity3d.com>
>>> # Date 1478208837 -3600
>>> #      Thu Nov 03 22:33:57 2016 +0100
>>> # Node ID f6408efe0d0f4179fe6cc2b967164c1b4567f3d6
>>> # Parent  d06c049695e6ad3219e7479c65ce98a2f123e878
>>> tests: explore some bdiff cases
>>>
>>> diff --git a/tests/test-bhalf.t b/tests/test-bhalf.t
>>> new file mode 100644
>>> --- /dev/null
>>> +++ b/tests/test-bhalf.t
>> '#require no-pure' is necessary since we use difflib in pure.
>>
>> The other changes in this series look good to me, but it's bdiff.c so I
>> don't queue them.
>
> Thanks for reviewing and the positive feedback. I will try to polish it
> for "real" submission.

Like yuya, it does not seems to be anything wrong in this series (and it 
seems to move in the right direction). That last last statement from you 
is bit confusing to me as the series seems okay for inclusion but you 
seems to imply you'll resent it.

I'm also adding Jun and Greg in CC as they have been looking into bdiff 
recently and I'm curious about what they might think.

Also, it might be a good idea to have some timing data in the descript 
to show that the new code is not making things slower (but we really 
should just have some automated performance tracking for this at this 
point :-/)

> For the last patch, I wonder if it would be better to add a post
> processing step that - given all the chunks - try to shift/rotate all
> match sequences to be as early as possible (and thus deltas to be as
> late and "appending" as possible). That could give more readable diffs,
> especially when combined with heuristics for preferring chunks starting
> with the lowest amount of indentation.
>
> One lesson from these changes seems to be that it is a problem that we
> use the same low level diff algorithm for revlog delta storage and
> bundles and for readable patch diffs. One idea that got mentioned at the
> latest sprint was to use zstandard for storage and "just" seed it with
> the "a" version of the file as dictionary and let it compress the "b"
> side. That might be a better long term solution.

That's a nice and funny solution I wonder what are the performance and 
compression level we gains from that.

> More short term, I wonder how much we could gain from somehow teaching
> bdiff to consider both parents for each chunk instead of just using
> deltas from one side and store chunks from the other verbatim. I think
> that could make a significant difference for repositories with a lot of
> big merges in files or the manifest.
>
> /Mads
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Gregory Szorc - Nov. 7, 2016, 5:34 p.m.
On Mon, Nov 7, 2016 at 6:47 AM, Pierre-Yves David <
pierre-yves.david@ens-lyon.org> wrote:

>
>
> On 11/06/2016 04:56 PM, Mads Kiilerich wrote:
>
>> On 11/06/2016 10:07 AM, Yuya Nishihara wrote:
>>
>>> On Thu, 03 Nov 2016 22:34:11 +0100, Mads Kiilerich wrote:
>>>
>>>> # HG changeset patch
>>>> # User Mads Kiilerich <madski@unity3d.com>
>>>> # Date 1478208837 -3600
>>>> #      Thu Nov 03 22:33:57 2016 +0100
>>>> # Node ID f6408efe0d0f4179fe6cc2b967164c1b4567f3d6
>>>> # Parent  d06c049695e6ad3219e7479c65ce98a2f123e878
>>>> tests: explore some bdiff cases
>>>>
>>>> diff --git a/tests/test-bhalf.t b/tests/test-bhalf.t
>>>> new file mode 100644
>>>> --- /dev/null
>>>> +++ b/tests/test-bhalf.t
>>>>
>>> '#require no-pure' is necessary since we use difflib in pure.
>>>
>>> The other changes in this series look good to me, but it's bdiff.c so I
>>> don't queue them.
>>>
>>
>> Thanks for reviewing and the positive feedback. I will try to polish it
>> for "real" submission.
>>
>
> Like yuya, it does not seems to be anything wrong in this series (and it
> seems to move in the right direction). That last last statement from you is
> bit confusing to me as the series seems okay for inclusion but you seems to
> imply you'll resent it.
>
> I'm also adding Jun and Greg in CC as they have been looking into bdiff
> recently and I'm curious about what they might think.
>
> Also, it might be a good idea to have some timing data in the descript to
> show that the new code is not making things slower (but we really should
> just have some automated performance tracking for this at this point :-/)
>
> For the last patch, I wonder if it would be better to add a post
>> processing step that - given all the chunks - try to shift/rotate all
>> match sequences to be as early as possible (and thus deltas to be as
>> late and "appending" as possible). That could give more readable diffs,
>> especially when combined with heuristics for preferring chunks starting
>> with the lowest amount of indentation.
>>
>> One lesson from these changes seems to be that it is a problem that we
>> use the same low level diff algorithm for revlog delta storage and
>> bundles and for readable patch diffs. One idea that got mentioned at the
>> latest sprint was to use zstandard for storage and "just" seed it with
>> the "a" version of the file as dictionary and let it compress the "b"
>> side. That might be a better long term solution.
>>
>
> That's a nice and funny solution I wonder what are the performance and
> compression level we gains from that.


Using zstd dictionary compression for alternate revlog delta storage was
proposed. However, experimenting with dictionary compression is far down my
priority list for zstd because it requires figuring out how to store
dictionary data. It is much simpler to use zstd as essentially a drop-in
replacement for zlib and figure out the dictionary foo later.

My priorities for zstd are roughly:

1. wire protocol for bundle exchange
2. bundle compression
3. revlog storage
4. dictionary compression and anything else related to changing revlog
"schema"


>
>
> More short term, I wonder how much we could gain from somehow teaching
>> bdiff to consider both parents for each chunk instead of just using
>> deltas from one side and store chunks from the other verbatim. I think
>> that could make a significant difference for repositories with a lot of
>> big merges in files or the manifest.
>>
>> /Mads
>>
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>
> --
> Pierre-Yves David
>

Patch

diff --git a/tests/test-bhalf.t b/tests/test-bhalf.t
new file mode 100644
--- /dev/null
+++ b/tests/test-bhalf.t
@@ -0,0 +1,140 @@ 
+A couple of test cases exploring the bdiff implementation.
+
+Diff of boring files:
+
+  $ hg init repo1
+  $ cd repo1
+  $ (for i in `seq 15`; do echo "once upon a time $i"; echo "The quick brown fox jumps over the lazy dog"; done; echo) > this-is-the-filename
+  $ hg add this-is-the-filename
+  $ hg ci -m "commit message commit message commit message commit message commit message commit message commit message commit message"
+  $ (for i in `seq 15`; do echo "twice upon a time $i"; echo "The quick brown fox jumps over the lazy dog"; done; echo) > this-is-the-filename
+  $ hg diff --git
+  diff --git a/this-is-the-filename b/this-is-the-filename
+  --- a/this-is-the-filename
+  +++ b/this-is-the-filename
+  @@ -1,31 +1,31 @@
+  -once upon a time 1
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 2
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 3
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 4
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 5
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 6
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 7
+  +twice upon a time 1
+   The quick brown fox jumps over the lazy dog
+  -once upon a time 8
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 9
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 10
+  +twice upon a time 2
+   The quick brown fox jumps over the lazy dog
+  -once upon a time 11
+  -The quick brown fox jumps over the lazy dog
+  -once upon a time 12
+  +twice upon a time 3
+   The quick brown fox jumps over the lazy dog
+  -once upon a time 13
+  +twice upon a time 4
+   The quick brown fox jumps over the lazy dog
+  -once upon a time 14
+  +twice upon a time 5
+   The quick brown fox jumps over the lazy dog
+  -once upon a time 15
+  +twice upon a time 6
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 7
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 8
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 9
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 10
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 11
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 12
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 13
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 14
+  +The quick brown fox jumps over the lazy dog
+  +twice upon a time 15
+   The quick brown fox jumps over the lazy dog
+   
+That's an odd diff for a trivial change!
+
+  $ hg ci -m "commit message commit message commit message commit message commit message commit message commit message commit message"
+  $ hg bundle --base null ../bundle.hg
+  2 changesets found
+  $ cd ..
+
+  $ f --size bundle.hg
+  bundle.hg: size=878
+
+Explore some bdiff implementation edge cases:
+
+  $ hg init repo2
+  $ cd repo2
+  $ cat << EOF >> x
+  > a
+  > EOF
+  $ cat << EOF >> y
+  > a
+  > a
+  > a
+  > EOF
+  $ cat << EOF >> z
+  > a
+  > a
+  > a
+  > a
+  > a
+  > EOF
+  $ hg ci -qAm0
+  $ cat << EOF > x
+  > a
+  > a
+  > a
+  > EOF
+  $ cat << EOF > y
+  > a
+  > EOF
+  $ cat << EOF > z
+  > a
+  > EOF
+  $ hg diff --git
+  diff --git a/x b/x
+  --- a/x
+  +++ b/x
+  @@ -1,1 +1,3 @@
+   a
+  +a
+  +a
+  diff --git a/y b/y
+  --- a/y
+  +++ b/y
+  @@ -1,3 +1,1 @@
+   a
+  -a
+  -a
+  diff --git a/z b/z
+  --- a/z
+  +++ b/z
+  @@ -1,5 +1,1 @@
+  -a
+   a
+  -a
+  -a
+  -a
+
+x and y shows the preference for adding / removing at the end of sequences ...
+z just seems weird.
+
+  $ cd ..