Patchwork [2,of,8,V7] bookmarks: introduce binary encoding

login
register
mail settings
Submitter Stanislau Hlebik
Date Nov. 11, 2016, 5:02 p.m.
Message ID <0BE55F3B-0890-4423-B952-00D5C27FE552@fb.com>
Download mbox | patch
Permalink /patch/17505/
State Superseded
Headers show

Comments

Stanislau Hlebik - Nov. 11, 2016, 5:02 p.m.
From: Gregory Szorc <gregory.szorc@gmail.com>

Date: Friday, November 11, 2016 at 5:20 AM
To: Stanislau Hlebik <stash@fb.com>
Cc: mercurial-devel <mercurial-devel@mercurial-scm.org>
Subject: Re: [PATCH 2 of 8 V7] bookmarks: introduce binary encoding

On Wed, Nov 2, 2016 at 7:06 AM, Stanislau Hlebik <stash@fb.com<mailto:stash@fb.com>> wrote:
# HG changeset patch
# User Stanislau Hlebik <stash@fb.com<mailto:stash@fb.com>>
# Date 1478016027 25200
#      Tue Nov 01 09:00:27 2016 -0700
# Branch stable
# Node ID f3da841e3d47ccbb0be3892b521607c09fdeab13
# Parent  a56a624a8a42b93ec980d2c284756a38719dffe6
bookmarks: introduce binary encoding

Bookmarks binary encoding will be used for `bookmarks` bundle2 part.
The format is: <4 bytes - bookmark size, big-endian><bookmark>
               <1 byte - 0 if node is empty 1 otherwise><20 bytes node>
BookmarksEncodeError and BookmarksDecodeError maybe thrown if input is
incorrect.


 from .i18n import _
 from .node import (
@@ -23,6 +25,70 @@
     util,
 )

+_NONEMPTYNODE = struct.pack('?', False)
+_EMPTYNODE = struct.pack('?', True)

TIL "?" is a valid format character for struct!

+
+def _unpackbookmarksize(stream):
+    """Returns 0 if stream is empty.
+    """
+
+    expectedsize = struct.calcsize('>i')
+    encodedbookmarksize = stream.read(expectedsize)
+    if len(encodedbookmarksize) == 0:
+        return 0

if not encodedbookmarksize:

+    if len(encodedbookmarksize) != expectedsize:
+        raise ValueError(
+            _('cannot decode bookmark size: '
+              'expected size: %d, actual size: %d') %
+            (expectedsize, len(encodedbookmarksize)))
+    return struct.unpack('>i', encodedbookmarksize)[0]
+
+def encodebookmarks(bookmarks):
+    """Encodes bookmark to node mapping to the byte string.
+
+    Format: <4 bytes - bookmark size><bookmark>
+            <1 byte - 0 if node is empty 1 otherwise><20 bytes node>
+    Node may be 20 byte binary string, 40 byte hex string or empty
+    """

I could make the argument for grouping the data by all sizes+nodes then all the names. This allows nice byte aligned packing and unpacking and some opportunities for indexing. It may also yield slightly better compression as the bookmark names will all be next to each other (although you'd have to measure that because compression window sizes may save us). But I'm having difficulty convincing myself it is worth it. Of course, having this data layout would also prohibit streaming. I think it is worth mentioning the alternative but probably not worth switching.

+
+    for bookmark, node in bookmarks.iteritems():

You'll want to use a sorted() here or eventual tests for multiple bookmarks will be unstable.

+        yield struct.pack('>i', (len(bookmark)))
+        yield encoding.fromlocal(bookmark)
+        if node:
+            if len(node) != 20 and len(node) != 40:
+                raise ValueError(_('node must be 20 or bytes long'))
+            if len(node) == 40:
+                node = bin(node)

I think the API should accept a single node type: the 20 byte binary node.


I’m fine with accepting only 20 byte nodes, but I don’t understand why accepting 40 byte nodes is bad.
Can you please explain?



+            yield _NONEMPTYNODE
+            yield node
+        else:
+            yield _EMPTYNODE

I'm on the fence regarding this single byte flag. On one hand, being explicit is nice. On the other, a sequence of \x00 identifies the null or empty node in so many other places. What's your reasoning here?

I think I replied somewhere but this letter got lost probably.
I’m using empty node to show that bookmark is deleted.
Nullid can be used for the same purpose but then  it would be impossible to push/pull bookmarks that point to null commit
(I know this is a rare case, but it’s possible to do it now).

+
+def decodebookmarks(buf):
+    """Decodes buffer into bookmark to node mapping.
+
+    Node is either 20 bytes or empty.
+    """
+
+    stream = StringIO.StringIO(buf)
+    bookmarks = {}
+    bookmarksize = _unpackbookmarksize(stream)
+    while bookmarksize:
+        bookmark = stream.read(bookmarksize)
+        if len(bookmark) != bookmarksize:
+            raise ValueError(
+                _('cannot decode bookmark: expected size: %d, '
+                'actual size: %d') % (bookmarksize, len(bookmark)))
+        bookmark = encoding.tolocal(bookmark)
+        packedemptynodeflag = stream.read(struct.calcsize('?'))
+        emptynode = struct.unpack('?', packedemptynodeflag)[0]

Yes, `struct` does cache things. But I don't like relying on it inside a loop. Please create a `struct.Struct` instance outside the loop.

+        node = ''
+        if not emptynode:
+            node = stream.read(20)
+        bookmarks[bookmark] = node

Alternatively, we could yield key, value pairs. That would facilitate streaming and allow the part to adapt to storing multiple values for the same bookmark. (YAGNI applies so if multiple nodes per bookmark isn't something you think we would need in the immediate future, don't bother.) New bundle parts aren't that hard to invent.

+        bookmarksize = _unpackbookmarksize(stream)
+    return bookmarks
+
 def _getbkfile(repo):
     """Hook so that extensions that mess with the store can hook bm storage.

_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org<mailto:Mercurial-devel@mercurial-scm.org>
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_mailman_listinfo_mercurial-2Ddevel&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=1EQ58Dmb5uX1qHujcsT1Mg&m=ScomB9LY6Dj69k_tV10POcPVOn0NOTawu4nKgHZXVcM&s=yCmp6Z-zHlSnEMRrNUI8m71rBxfXOAs3Lz5kehoqut8&e=>
Gregory Szorc - Nov. 11, 2016, 7:04 p.m.
On Fri, Nov 11, 2016 at 9:02 AM, Stanislau Hlebik <stash@fb.com> wrote:

>
>
>
>
> *From: *Gregory Szorc <gregory.szorc@gmail.com>
> *Date: *Friday, November 11, 2016 at 5:20 AM
> *To: *Stanislau Hlebik <stash@fb.com>
> *Cc: *mercurial-devel <mercurial-devel@mercurial-scm.org>
> *Subject: *Re: [PATCH 2 of 8 V7] bookmarks: introduce binary encoding
>
>
>
> On Wed, Nov 2, 2016 at 7:06 AM, Stanislau Hlebik <stash@fb.com> wrote:
>
> # HG changeset patch
> # User Stanislau Hlebik <stash@fb.com>
> # Date 1478016027 25200
> #      Tue Nov 01 09:00:27 2016 -0700
> # Branch stable
> # Node ID f3da841e3d47ccbb0be3892b521607c09fdeab13
> # Parent  a56a624a8a42b93ec980d2c284756a38719dffe6
> bookmarks: introduce binary encoding
>
> Bookmarks binary encoding will be used for `bookmarks` bundle2 part.
> The format is: <4 bytes - bookmark size, big-endian><bookmark>
>                <1 byte - 0 if node is empty 1 otherwise><20 bytes node>
> BookmarksEncodeError and BookmarksDecodeError maybe thrown if input is
> incorrect.
>
> diff --git a/mercurial/bookmarks.py b/mercurial/bookmarks.py
> --- a/mercurial/bookmarks.py
> +++ b/mercurial/bookmarks.py
> @@ -7,8 +7,10 @@
>
>  from __future__ import absolute_import
>
> +import StringIO
>
>
>
> This module isn't available on Python 3. Use util.stringio or (preferably)
> io.BytesIO (since it assumes it is operating on bytes and raises if it sees
> unicode types).
>
>
>
>  import errno
>  import os
> +import struct
>
>  from .i18n import _
>  from .node import (
> @@ -23,6 +25,70 @@
>      util,
>  )
>
> +_NONEMPTYNODE = struct.pack('?', False)
> +_EMPTYNODE = struct.pack('?', True)
>
>
>
> TIL "?" is a valid format character for struct!
>
>
>
> +
> +def _unpackbookmarksize(stream):
> +    """Returns 0 if stream is empty.
> +    """
> +
> +    expectedsize = struct.calcsize('>i')
> +    encodedbookmarksize = stream.read(expectedsize)
> +    if len(encodedbookmarksize) == 0:
> +        return 0
>
>
>
> if not encodedbookmarksize:
>
>
>
> +    if len(encodedbookmarksize) != expectedsize:
> +        raise ValueError(
> +            _('cannot decode bookmark size: '
> +              'expected size: %d, actual size: %d') %
> +            (expectedsize, len(encodedbookmarksize)))
> +    return struct.unpack('>i', encodedbookmarksize)[0]
> +
> +def encodebookmarks(bookmarks):
> +    """Encodes bookmark to node mapping to the byte string.
> +
> +    Format: <4 bytes - bookmark size><bookmark>
> +            <1 byte - 0 if node is empty 1 otherwise><20 bytes node>
> +    Node may be 20 byte binary string, 40 byte hex string or empty
> +    """
>
>
>
> I could make the argument for grouping the data by all sizes+nodes then
> all the names. This allows nice byte aligned packing and unpacking and some
> opportunities for indexing. It may also yield slightly better compression
> as the bookmark names will all be next to each other (although you'd have
> to measure that because compression window sizes may save us). But I'm
> having difficulty convincing myself it is worth it. Of course, having this
> data layout would also prohibit streaming. I think it is worth mentioning
> the alternative but probably not worth switching.
>
>
>
> +
> +    for bookmark, node in bookmarks.iteritems():
>
>
>
> You'll want to use a sorted() here or eventual tests for multiple
> bookmarks will be unstable.
>
>
>
> +        yield struct.pack('>i', (len(bookmark)))
> +        yield encoding.fromlocal(bookmark)
> +        if node:
> +            if len(node) != 20 and len(node) != 40:
> +                raise ValueError(_('node must be 20 or bytes long'))
> +            if len(node) == 40:
> +                node = bin(node)
>
>
>
> I think the API should accept a single node type: the 20 byte binary node.
>
>
>
>
>
> I’m fine with accepting only 20 byte nodes, but I don’t understand why
> accepting 40 byte nodes is bad.
>
> Can you please explain?
>

The 40 byte hex version strictly exists for humans and for serialization to
non-binary safe formats. It is derived from the canonical, 20 byte value.
Many internal APIs only speak in terms of the 20 byte node. Or if you are
interacting with revlogs and/or assuming a consistent store state, an
integer representing the revlog index can be used instead.

Accepting both 20 and 40 byte versions encourages inconsistent internal
usage for representing nodes. This leads to lots of code converting between
formats because APIs aren't consistent. In addition, when we eventually
support something other than SHA-1, we'll have new node lengths we need to
support. It will be easier to limit ourselves to assuming binary nodes are
used everywhere.

The hex version of nodes should only exists "at the edges" and should be
converted to binary nodes (or revlog revision integers) as quickly as
possible.


>
>
>
>
>
> +            yield _NONEMPTYNODE
> +            yield node
> +        else:
> +            yield _EMPTYNODE
>
>
>
> I'm on the fence regarding this single byte flag. On one hand, being
> explicit is nice. On the other, a sequence of \x00 identifies the null or
> empty node in so many other places. What's your reasoning here?
>
>
>
> I think I replied somewhere but this letter got lost probably.
>
> I’m using empty node to show that bookmark is deleted.
>
> Nullid can be used for the same purpose but then  it would be impossible
> to push/pull bookmarks that point to null commit
>
> (I know this is a rare case, but it’s possible to do it now).
>
>
>
> +
>
> +def decodebookmarks(buf):
> +    """Decodes buffer into bookmark to node mapping.
> +
> +    Node is either 20 bytes or empty.
> +    """
> +
> +    stream = StringIO.StringIO(buf)
> +    bookmarks = {}
> +    bookmarksize = _unpackbookmarksize(stream)
> +    while bookmarksize:
> +        bookmark = stream.read(bookmarksize)
> +        if len(bookmark) != bookmarksize:
> +            raise ValueError(
> +                _('cannot decode bookmark: expected size: %d, '
> +                'actual size: %d') % (bookmarksize, len(bookmark)))
> +        bookmark = encoding.tolocal(bookmark)
> +        packedemptynodeflag = stream.read(struct.calcsize('?'))
> +        emptynode = struct.unpack('?', packedemptynodeflag)[0]
>
>
>
> Yes, `struct` does cache things. But I don't like relying on it inside a
> loop. Please create a `struct.Struct` instance outside the loop.
>
>
>
> +        node = ''
> +        if not emptynode:
> +            node = stream.read(20)
> +        bookmarks[bookmark] = node
>
>
>
> Alternatively, we could yield key, value pairs. That would facilitate
> streaming and allow the part to adapt to storing multiple values for the
> same bookmark. (YAGNI applies so if multiple nodes per bookmark isn't
> something you think we would need in the immediate future, don't bother.)
> New bundle parts aren't that hard to invent.
>
>
>
> +        bookmarksize = _unpackbookmarksize(stream)
> +    return bookmarks
> +
>  def _getbkfile(repo):
>      """Hook so that extensions that mess with the store can hook bm
> storage.
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_mailman_listinfo_mercurial-2Ddevel&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=1EQ58Dmb5uX1qHujcsT1Mg&m=ScomB9LY6Dj69k_tV10POcPVOn0NOTawu4nKgHZXVcM&s=yCmp6Z-zHlSnEMRrNUI8m71rBxfXOAs3Lz5kehoqut8&e=>
>
>
>

Patch

diff --git a/mercurial/bookmarks.py b/mercurial/bookmarks.py

--- a/mercurial/bookmarks.py

+++ b/mercurial/bookmarks.py

@@ -7,8 +7,10 @@ 


 from __future__ import absolute_import

+import StringIO


This module isn't available on Python 3. Use util.stringio or (preferably) io.BytesIO (since it assumes it is operating on bytes and raises if it sees unicode types).

 import errno
 import os
+import struct