Patchwork [1,of,1] bdiff.c: implemented block-delimiting to better deal with long "lines"

login
register
mail settings
Submitter Friedrich Kastner-Masilko
Date May 14, 2014, 9:54 p.m.
Message ID <1b57d1650cd2d5aa8c6c.1400104473@CAT36756.festo.net>
Download mbox | patch
Permalink /patch/4738/
State Rejected
Headers show

Comments

Friedrich Kastner-Masilko - May 14, 2014, 9:54 p.m.
# HG changeset patch
# User Friedrich Kastner-Masilko <kastner-masilko@at.festo.com>
# Date 1400101077 -7200
#      Wed May 14 22:57:57 2014 +0200
# Node ID 1b57d1650cd2d5aa8c6cc103c344ecbd8fbabc39
# Parent  1ae3cd6f836c3c96ee3e9a872c8e966750910c2d
bdiff.c: implemented block-delimiting to better deal with long "lines"

Recent XML-based file formats often resemble human-readable text without a single line-break. This mostly comes from serialization of binary data into the XML format without well-forming the content for viewing. Storing such files with the current revlog implementation results in ineffective storage due to the used bdiff line-based algorithm. Since bdiff creates chunks based on the line-break mark, the whole file content is considered as one chunk, thus creating a delta as big (or even bigger) as the file itself.

This patch is introducing block-limiting of lines. All lines encountered will be split into 4k blocks, thus giving the algorithm a chance to create smaller deltas, especially if the changes are at the end of the file. Especially for growing content where the header of the file is never changed, this patch increases the storage efficiency. However, with changes at the beginning of the file the block-limiting is not changing the results w.r.t. the original algorithm. The same is true for standard usage with text-files: because these usually contain lines shorter than 4k characters, the patch never kicks in.
Siddharth Agarwal - May 14, 2014, 10:25 p.m.
On 05/14/2014 02:54 PM, Friedrich Kastner-Masilko wrote:
> # HG changeset patch
> # User Friedrich Kastner-Masilko <kastner-masilko@at.festo.com>
> # Date 1400101077 -7200
> #      Wed May 14 22:57:57 2014 +0200
> # Node ID 1b57d1650cd2d5aa8c6cc103c344ecbd8fbabc39
> # Parent  1ae3cd6f836c3c96ee3e9a872c8e966750910c2d
> bdiff.c: implemented block-delimiting to better deal with long "lines"

http://mercurial.selenic.com/wiki/ContributingChanges

In particular, note the bit about "all lines less than 80 characters" in 
the patch description.


>
> Recent XML-based file formats often resemble human-readable text without a single line-break. This mostly comes from serialization of binary data into the XML format without well-forming the content for viewing. Storing such files with the current revlog implementation results in ineffective storage due to the used bdiff line-based algorithm. Since bdiff creates chunks based on the line-break mark, the whole file content is considered as one chunk, thus creating a delta as big (or even bigger) as the file itself.
>
> This patch is introducing block-limiting of lines. All lines encountered will be split into 4k blocks, thus giving the algorithm a chance to create smaller deltas, especially if the changes are at the end of the file. Especially for growing content where the header of the file is never changed, this patch increases the storage efficiency. However, with changes at the beginning of the file the block-limiting is not changing the results w.r.t. the original algorithm. The same is true for standard usage with text-files: because these usually contain lines shorter than 4k characters, the patch never kicks in.

On the face of it, I think this is a pretty decent idea.

It isn't clear from the patch description whether diffs produced by 
block-limiting are backwards-compatible. i.e. would a diff produced by 
this tweaked bdiff be applicable by an older Mercurial? It doesn't seem 
like you've changed the code that applies diffs, so I presume the answer 
is yes. Even so, it'll be good to have that explicitly stated in the 
description.

Can you add tests to make sure that this works and produces a split output?

There's a pure-Python implementation at 
http://selenic.com/hg/file/1ae3cd6f836c/mercurial/pure/bdiff.py -- could 
you make the same change to this as well?


>
> diff -r 1ae3cd6f836c -r 1b57d1650cd2 mercurial/bdiff.c
> --- a/mercurial/bdiff.c	Tue May 13 19:29:45 2014 -0500
> +++ b/mercurial/bdiff.c	Wed May 14 22:57:57 2014 +0200
> @@ -36,16 +36,17 @@
>   static int splitlines(const char *a, Py_ssize_t len, struct line **lr)
>   {
>   	unsigned hash;
> -	int i;
> +	int i, j;
>   	const char *p, *b = a;
>   	const char * const plast = a + len - 1;
>   	struct line *l;
>   
>   	/* count the lines */
>   	i = 1; /* extra line for sentinel */
> -	for (p = a; p < a + len; p++)
> -		if (*p == '\n' || p == plast)
> -			i++;
> +	j = 0; /* block counter for fixed line delimiting */
> +	for (p = a; p < a + len; p++, j++)
> +		if (*p == '\n' || p == plast || j >= 4096)
> +			i++, j = 0;
>   
>   	*lr = l = (struct line *)malloc(sizeof(struct line) * i);
>   	if (!l)
> @@ -53,11 +54,11 @@
>   
>   	/* build the line array and calculate hashes */
>   	hash = 0;
> -	for (p = a; p < a + len; p++) {
> +	for (p = a, j = 0; p < a + len; p++, j++) {
>   		/* Leonid Yuriev's hash */
>   		hash = (hash * 1664525) + (unsigned char)*p + 1013904223;
>   
> -		if (*p == '\n' || p == plast) {
> +		if (*p == '\n' || p == plast || j >= 4096) {
>   			l->hash = hash;
>   			hash = 0;
>   			l->len = p - b + 1;
> @@ -65,6 +66,7 @@
>   			l->n = INT_MAX;
>   			l++;
>   			b = p + 1;
> +			j = 0;
>   		}
>   	}
>   
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Adrian Buehlmann - May 14, 2014, 11:03 p.m.
On 2014-05-14 23:54, Friedrich Kastner-Masilko wrote:

> diff -r 1ae3cd6f836c -r 1b57d1650cd2 mercurial/bdiff.c
> --- a/mercurial/bdiff.c	Tue May 13 19:29:45 2014 -0500
> +++ b/mercurial/bdiff.c	Wed May 14 22:57:57 2014 +0200
> @@ -36,16 +36,17 @@
>  static int splitlines(const char *a, Py_ssize_t len, struct line **lr)
>  {
>  	unsigned hash;
> -	int i;
> +	int i, j;
>  	const char *p, *b = a;
>  	const char * const plast = a + len - 1;
>  	struct line *l;
>  
>  	/* count the lines */
>  	i = 1; /* extra line for sentinel */
> -	for (p = a; p < a + len; p++)
> -		if (*p == '\n' || p == plast)
> -			i++;
> +	j = 0; /* block counter for fixed line delimiting */
> +	for (p = a; p < a + len; p++, j++)
> +		if (*p == '\n' || p == plast || j >= 4096)
> +			i++, j = 0;
>  
>  	*lr = l = (struct line *)malloc(sizeof(struct line) * i);
>  	if (!l)
> @@ -53,11 +54,11 @@
>  
>  	/* build the line array and calculate hashes */
>  	hash = 0;
> -	for (p = a; p < a + len; p++) {
> +	for (p = a, j = 0; p < a + len; p++, j++) {
>  		/* Leonid Yuriev's hash */
>  		hash = (hash * 1664525) + (unsigned char)*p + 1013904223;
>  
> -		if (*p == '\n' || p == plast) {
> +		if (*p == '\n' || p == plast || j >= 4096) {
>  			l->hash = hash;
>  			hash = 0;
>  			l->len = p - b + 1;

Is this assignment to l->len still correct for the case where j is 4096?

> @@ -65,6 +66,7 @@
>  			l->n = INT_MAX;
>  			l++;
>  			b = p + 1;

Are you sure that "b = p + 1;" is not an off by one error for the case
where j is 4096?

> +			j = 0;
>  		}
>  	}
Friedrich Kastner-Masilko - May 14, 2014, 11:41 p.m.
> From: Siddharth Agarwal [mailto:sid@less-broken.com]
> 
> http://mercurial.selenic.com/wiki/ContributingChanges
> 
> In particular, note the bit about "all lines less than 80 characters"
> in the patch description.

Sorry for that, it's been a while since I've last tried to contribute.
 
> On the face of it, I think this is a pretty decent idea.

Thanks. The idea is from this thread, BTW: http://www.selenic.com/pipermail/mercurial/2014-May/047081.html
 
> It isn't clear from the patch description whether diffs produced by
> block-limiting are backwards-compatible. i.e. would a diff produced by
> this tweaked bdiff be applicable by an older Mercurial? It doesn't seem
> like you've changed the code that applies diffs, so I presume the
> answer
> is yes. Even so, it'll be good to have that explicitly stated in the
> description.

Yes, being minimally invasive was the point of this. AFAIK, bdiff generates hunks that consists of starting point, end point, bytes to exchange, and the actual content. The algorithm used to generate those hunks should not influence the algorithm consuming them, otherwise there is implicit knowledge involved. AFAIK, mpatch.c is not doing that, thus old Mercurial versions should have no problem reading revlogs generated by this bdiff.

> Can you add tests to make sure that this works and produces a split
> output?

What do you mean with split output? I can make tests that check if the situation shown in the thread linked above is fixed appropriately, i.e. not always storing a snapshot.
 
> There's a pure-Python implementation at
> http://selenic.com/hg/file/1ae3cd6f836c/mercurial/pure/bdiff.py --
> could
> you make the same change to this as well?

I could, but I don't know if it would be as minimally invasive (performance-wise) as the C version is.



> From: Adrian Buehlmann [mailto:adrian@cadifra.com]
>
> > -		if (*p == '\n' || p == plast) {
> > +		if (*p == '\n' || p == plast || j >= 4096) {
> >  			l->hash = hash;
> >  			hash = 0;
> >  			l->len = p - b + 1;
> 
> Is this assignment to l->len still correct for the case where j is
> 4096?

Yes. IMHO, there is no difference between having a character's content or an "outside" event like a counter decide whether a "line" should be enclosed or not. Of course "len" is not depicting the length of an actual line here, but it is correctly storing the length of the chunk. And this is what counts here AFAICS.
 
> > @@ -65,6 +66,7 @@
> >  			l->n = INT_MAX;
> >  			l++;
> >  			b = p + 1;
> 
> Are you sure that "b = p + 1;" is not an off by one error for the case
> where j is 4096?

Yes, I am. Not only for the above reasons, but also for the fact that stored files with the patch in effect restore just fine, even though the patch definitely changed the deltas involved as shown by different sizes. Wouldn't an off-by-one error here completely spoil the file's content on restoring a block-delimited delta-chain?

regards,
Fritz



Development Software Systems
Festo Gesellschaft m.b.H.
Linzer Strasse 227
Austria - 1140 Wien

Firmenbuch Wien
FN 38435y
UID: ATU14650108

Tel: +43(1)91075-198
Fax: 
www.festo.at

Der Inhalt dieser E-Mail und moeglicher Anhaenge sind ausschliesslich fuer den bezeichneten Adressaten bestimmt.
Jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder Weitergabe des Inhalts dieser E-Mail und
moeglicher Anhaenge durch unberechtigte Dritte ist unzulaessig. Wir bitten Sie, sich mit dem Absender der E-Mail in
Verbindung zu setzen, falls Sie nicht der Adressat dieser E-Mail sind sowie das Material von Ihrem Computer zu loeschen.

This e-mail and any attachments are confidential and intended solely for the addressee. The perusal, publication, copying or
dissemination of the contents of this e-mail by unauthorised third parties is prohibited. If you are not the intended recipient of this
e-mail, please delete it and immediately notify the sender.
Siddharth Agarwal - May 14, 2014, 11:46 p.m.
On 05/14/2014 04:41 PM, Kastner Masilko, Friedrich wrote:
>> From: Siddharth Agarwal [mailto:sid@less-broken.com]
>>
>> http://mercurial.selenic.com/wiki/ContributingChanges
>>
>> In particular, note the bit about "all lines less than 80 characters"
>> in the patch description.
> Sorry for that, it's been a while since I've last tried to contribute.
>   
>> On the face of it, I think this is a pretty decent idea.
> Thanks. The idea is from this thread, BTW: http://www.selenic.com/pipermail/mercurial/2014-May/047081.html
>   
>> It isn't clear from the patch description whether diffs produced by
>> block-limiting are backwards-compatible. i.e. would a diff produced by
>> this tweaked bdiff be applicable by an older Mercurial? It doesn't seem
>> like you've changed the code that applies diffs, so I presume the
>> answer
>> is yes. Even so, it'll be good to have that explicitly stated in the
>> description.
> Yes, being minimally invasive was the point of this. AFAIK, bdiff generates hunks that consists of starting point, end point, bytes to exchange, and the actual content. The algorithm used to generate those hunks should not influence the algorithm consuming them, otherwise there is implicit knowledge involved. AFAIK, mpatch.c is not doing that, thus old Mercurial versions should have no problem reading revlogs generated by this bdiff.
>
>> Can you add tests to make sure that this works and produces a split
>> output?
> What do you mean with split output? I can make tests that check if the situation shown in the thread linked above is fixed appropriately, i.e. not always storing a snapshot.

Yes, that's what I meant. Sorry about not being clear enough.

>   
>> There's a pure-Python implementation at
>> http://selenic.com/hg/file/1ae3cd6f836c/mercurial/pure/bdiff.py --
>> could
>> you make the same change to this as well?
> I could, but I don't know if it would be as minimally invasive (performance-wise) as the C version is.

If you're using the pure-Python version of Mercurial, you almost 
certainly don't care about performance, so that's OK.
Matt Mackall - May 27, 2014, 9:36 p.m.
On Wed, 2014-05-14 at 23:54 +0200, Friedrich Kastner-Masilko wrote:
> # HG changeset patch
> # User Friedrich Kastner-Masilko <kastner-masilko@at.festo.com>
> # Date 1400101077 -7200
> #      Wed May 14 22:57:57 2014 +0200
> # Node ID 1b57d1650cd2d5aa8c6cc103c344ecbd8fbabc39
> # Parent  1ae3cd6f836c3c96ee3e9a872c8e966750910c2d
> bdiff.c: implemented block-delimiting to better deal with long "lines"
> 
> Recent XML-based file formats often resemble human-readable text
> without a single line-break. This mostly comes from serialization of
> binary data into the XML format without well-forming the content for
> viewing. Storing such files with the current revlog implementation
> results in ineffective storage due to the used bdiff line-based
> algorithm. Since bdiff creates chunks based on the line-break mark,
> the whole file content is considered as one chunk, thus creating a
> delta as big (or even bigger) as the file itself.
> 
> This patch is introducing block-limiting of lines. All lines
> encountered will be split into 4k blocks, thus giving the algorithm a
> chance to create smaller deltas, especially if the changes are at the
> end of the file. Especially for growing content where the header of
> the file is never changed, this patch increases the storage
> efficiency. However, with changes at the beginning of the file the
> block-limiting is not changing the results w.r.t. the original
> algorithm. The same is true for standard usage with text-files:
> because these usually contain lines shorter than 4k characters, the
> patch never kicks in.

So this looks fine as far as it goes, but I think we should try to go
one further: getting somewhat repeatable block boundaries even if we
insert in the middle.

For instance, we could have a hierarchy of block break rules:

- break on newline
- break on [?_%] if > 1k  # some other rare characters?
- break if > 4k

This might give the delta engine some opportunities to realign.
Pierre-Yves David - Aug. 21, 2014, 5:07 p.m.
On 05/28/2014 02:29 AM, Kastner Masilko, Friedrich wrote:
>> From: Matt Mackall [mailto:mpm@selenic.com]
>>
>> So this looks fine as far as it goes, but I think we should try to go
>> one further: getting somewhat repeatable block boundaries even if we
>> insert in the middle.
>>
>> For instance, we could have a hierarchy of block break rules:
>>
>> - break on newline
>> - break on [?_%] if > 1k  # some other rare characters?
>> - break if > 4k
>>
>> This might give the delta engine some opportunities to realign.
>
> I think that a fixed heuristic (rare characters) will not do much in terms of improvement for the described use-case of serialized XML files. That said, I've already experimented with a statistical approach:
>
> 1. During the first N bytes of the line, count every possible byte (only costs a bit of memory and one additional indirect addressing + incrementing). This will give a localized distribution of characters within the "line".
> 2. If no break occurred during N bytes, use one of the characters with a distribution closest to a general newline distribution in text-files (estimated at 1.5%) as break-marker _in addition_ to newline.
> 3. As soon as the first newline occurs, break and repeat from start.
>
> This can be improved by means of continuously keeping the distribution up-to-date while running up to the newline, but this is IMHO too costly to do (calculating best marker according to distribution) on every byte. A fixed point of strategy-change is faster to do, especially with N chosen on a 2^x position.
>
> So far I've got the best results with N=256 on various formats, but I worry about how this would influence visualization of diffs for the standard use-case. Your "rare characters" proposal is worth a try, though.

Any news on this front? The general idea for this patch with interesting 
and I would be happy to see the topic come to a conclusion.

Patch

diff -r 1ae3cd6f836c -r 1b57d1650cd2 mercurial/bdiff.c
--- a/mercurial/bdiff.c	Tue May 13 19:29:45 2014 -0500
+++ b/mercurial/bdiff.c	Wed May 14 22:57:57 2014 +0200
@@ -36,16 +36,17 @@ 
 static int splitlines(const char *a, Py_ssize_t len, struct line **lr)
 {
 	unsigned hash;
-	int i;
+	int i, j;
 	const char *p, *b = a;
 	const char * const plast = a + len - 1;
 	struct line *l;
 
 	/* count the lines */
 	i = 1; /* extra line for sentinel */
-	for (p = a; p < a + len; p++)
-		if (*p == '\n' || p == plast)
-			i++;
+	j = 0; /* block counter for fixed line delimiting */
+	for (p = a; p < a + len; p++, j++)
+		if (*p == '\n' || p == plast || j >= 4096)
+			i++, j = 0;
 
 	*lr = l = (struct line *)malloc(sizeof(struct line) * i);
 	if (!l)
@@ -53,11 +54,11 @@ 
 
 	/* build the line array and calculate hashes */
 	hash = 0;
-	for (p = a; p < a + len; p++) {
+	for (p = a, j = 0; p < a + len; p++, j++) {
 		/* Leonid Yuriev's hash */
 		hash = (hash * 1664525) + (unsigned char)*p + 1013904223;
 
-		if (*p == '\n' || p == plast) {
+		if (*p == '\n' || p == plast || j >= 4096) {
 			l->hash = hash;
 			hash = 0;
 			l->len = p - b + 1;
@@ -65,6 +66,7 @@ 
 			l->n = INT_MAX;
 			l++;
 			b = p + 1;
+			j = 0;
 		}
 	}