Patchwork [7,of,9,RFC] pushkey: support for encoding and decoding raw listkeys dicts

login
register
mail settings
Submitter Gregory Szorc
Date Aug. 14, 2016, 9:10 p.m.
Message ID <eb2bc1ac7869ad255965.1471209006@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/16286/
State Changes Requested
Headers show

Comments

Gregory Szorc - Aug. 14, 2016, 9:10 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1471207500 25200
#      Sun Aug 14 13:45:00 2016 -0700
# Node ID eb2bc1ac7869ad255965d16004524a95cea83c9d
# Parent  1fe812eb8b9e79d1182c4a6593e7ce8fa2938264
pushkey: support for encoding and decoding raw listkeys dicts

Now that we have support for retrieving raw/binary versions of pushkey
data, the last step before we expose it on the wire protocol is a
method to encode and decode it. This patch implements those functions.

The new listkeys data representation is framed binary data. We simply
have pairs of frames corresponding to keys and values. A 0 length
key signals end of payload.

All binary sequences can be encoded in keys and values. Of course, not
all binary values can be used in existing namespaces because it may
not be encoded properly in the existing wire protocol command. But
going forward we can do whatever we want in new namespaces.
Yuya Nishihara - Aug. 16, 2016, 1:57 p.m.
On Sun, 14 Aug 2016 14:10:06 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1471207500 25200
> #      Sun Aug 14 13:45:00 2016 -0700
> # Node ID eb2bc1ac7869ad255965d16004524a95cea83c9d
> # Parent  1fe812eb8b9e79d1182c4a6593e7ce8fa2938264
> pushkey: support for encoding and decoding raw listkeys dicts

> +def encodekeysraw(keys):
> +    """Encode pushkey namespace keys using a binary encoding.
> +
> +    The response consists of framed data packets of the form:
> +
> +        <size> <data>
> +
> +    Where the ``size`` is a little endian 32-bit integer.
> +
> +    Data is emitted in pairs of frames where the first frame is the key
> +    name and the second frame is the value.
> +
> +    A frame with size 0 indicates end of stream.
> +    """
> +    s = struct.struct('<I')
> +
> +    chunks = []
> +    for k, v in keys:
> +        assert not isinstance(k, encoding.localstr)
> +        assert not isinstance(v, encoding.localstr)
> +
> +        chunks.append(s.pack(len(k)))
> +        chunks.append(k)
> +        chunks.append(s.pack(len(v)))
> +        chunks.append(v)

I heard we should stick to big endian. The cost of byte-order conversion
should be pretty cheap.

And if we're trying to reduce the payload size, it might be too large to
add 4-byte length field for each key/value pair.
Gregory Szorc - Aug. 16, 2016, 5:15 p.m.
On Tue, Aug 16, 2016 at 6:57 AM, Yuya Nishihara <yuya@tcha.org> wrote:

> On Sun, 14 Aug 2016 14:10:06 -0700, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1471207500 25200
> > #      Sun Aug 14 13:45:00 2016 -0700
> > # Node ID eb2bc1ac7869ad255965d16004524a95cea83c9d
> > # Parent  1fe812eb8b9e79d1182c4a6593e7ce8fa2938264
> > pushkey: support for encoding and decoding raw listkeys dicts
>
> > +def encodekeysraw(keys):
> > +    """Encode pushkey namespace keys using a binary encoding.
> > +
> > +    The response consists of framed data packets of the form:
> > +
> > +        <size> <data>
> > +
> > +    Where the ``size`` is a little endian 32-bit integer.
> > +
> > +    Data is emitted in pairs of frames where the first frame is the key
> > +    name and the second frame is the value.
> > +
> > +    A frame with size 0 indicates end of stream.
> > +    """
> > +    s = struct.struct('<I')
> > +
> > +    chunks = []
> > +    for k, v in keys:
> > +        assert not isinstance(k, encoding.localstr)
> > +        assert not isinstance(v, encoding.localstr)
> > +
> > +        chunks.append(s.pack(len(k)))
> > +        chunks.append(k)
> > +        chunks.append(s.pack(len(v)))
> > +        chunks.append(v)
>
> I heard we should stick to big endian. The cost of byte-order conversion
> should be pretty cheap.
>
> And if we're trying to reduce the payload size, it might be too large to
> add 4-byte length field for each key/value pair.
>

Yeah. Keys can almost certainly be 16-bit. Values I'm less sure of. I could
see us going over 64k for things like obsolescence markers. Although, there
is something to be said about limiting sizes. We shouldn't be having
listkeys transfer large payloads because it is intrinsically a buffered
command/protocol and large payloads should be streamed for performance
reasons.

If we really cared about sizes, we could introduce a "varint." I'm pretty
sure I've heard the Googler's propose this before.
Pierre-Yves David - Aug. 17, 2016, 12:03 a.m.
On 08/16/2016 03:57 PM, Yuya Nishihara wrote:
> On Sun, 14 Aug 2016 14:10:06 -0700, Gregory Szorc wrote:
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1471207500 25200
>> #      Sun Aug 14 13:45:00 2016 -0700
>> # Node ID eb2bc1ac7869ad255965d16004524a95cea83c9d
>> # Parent  1fe812eb8b9e79d1182c4a6593e7ce8fa2938264
>> pushkey: support for encoding and decoding raw listkeys dicts
>
>> +def encodekeysraw(keys):
>> +    """Encode pushkey namespace keys using a binary encoding.
>> +
>> +    The response consists of framed data packets of the form:
>> +
>> +        <size> <data>
>> +
>> +    Where the ``size`` is a little endian 32-bit integer.
>> +
>> +    Data is emitted in pairs of frames where the first frame is the key
>> +    name and the second frame is the value.
>> +
>> +    A frame with size 0 indicates end of stream.
>> +    """
>> +    s = struct.struct('<I')
>> +
>> +    chunks = []
>> +    for k, v in keys:
>> +        assert not isinstance(k, encoding.localstr)
>> +        assert not isinstance(v, encoding.localstr)
>> +
>> +        chunks.append(s.pack(len(k)))
>> +        chunks.append(k)
>> +        chunks.append(s.pack(len(v)))
>> +        chunks.append(v)
>
> I heard we should stick to big endian. The cost of byte-order conversion
> should be pretty cheap.
>
> And if we're trying to reduce the payload size, it might be too large to
> add 4-byte length field for each key/value pair.

yeah, all our network protocol is big endian.
(but also see my comment about using bundle2 instead of direct listkey here)

Patch

diff --git a/mercurial/pushkey.py b/mercurial/pushkey.py
--- a/mercurial/pushkey.py
+++ b/mercurial/pushkey.py
@@ -2,16 +2,18 @@ 
 #
 # Copyright 2010 Matt Mackall <mpm@selenic.com>
 #
 # This software may be used and distributed according to the terms of the
 # GNU General Public License version 2 or any later version.
 
 from __future__ import absolute_import
 
+import struct
+
 from . import (
     bookmarks,
     encoding,
     obsolete,
     phases,
 )
 
 def _nslist(repo):
@@ -63,8 +65,62 @@  def encodekeys(keys):
 
 def decodekeys(data):
     """decode the content of a pushkey namespace from exchange over the wire"""
     result = {}
     for l in data.splitlines():
         k, v = l.split('\t')
         result[decode(k)] = decode(v)
     return result
+
+def encodekeysraw(keys):
+    """Encode pushkey namespace keys using a binary encoding.
+
+    The response consists of framed data packets of the form:
+
+        <size> <data>
+
+    Where the ``size`` is a little endian 32-bit integer.
+
+    Data is emitted in pairs of frames where the first frame is the key
+    name and the second frame is the value.
+
+    A frame with size 0 indicates end of stream.
+    """
+    s = struct.struct('<I')
+
+    chunks = []
+    for k, v in keys:
+        assert not isinstance(k, encoding.localstr)
+        assert not isinstance(v, encoding.localstr)
+
+        chunks.append(s.pack(len(k)))
+        chunks.append(k)
+        chunks.append(s.pack(len(v)))
+        chunks.append(v)
+
+    # Size 0 chunk signals end of payload.
+    chunks.append(s.pack(0))
+
+    return ''.join(chunks)
+
+def decodekeysraw(data):
+    """Decode value encoded by ``rawencodekeys``
+
+    Returns a dict with bytes keys and values.
+    """
+    s = struct.struct('<')
+    offset = 0
+    result = {}
+
+    while True:
+        l = s.unpack_from(data, offset)
+        offset += s.size
+        if l == 0:
+            break
+
+        key = data[offset:offset + l]
+        l = s.unpack_from(data, offset)
+        offset += s.size
+        value = data[offset:offset + l]
+        result[key] = value
+
+    return result