Patchwork [evolve-ext-V5] evolve: add new methods _evolvestatewrite, _evolvestateread, _evolvestatedelete

login
register
mail settings
Submitter Shusen LIU
Date Jan. 26, 2016, 11:43 p.m.
Message ID <c82a0edf16d90727e307.1453851787@dev1221.lla1.facebook.com>
Download mbox | patch
Permalink /patch/12891/
State Changes Requested
Delegated to: Pierre-Yves David
Headers show

Comments

Shusen LIU - Jan. 26, 2016, 11:43 p.m.
# HG changeset patch
# User Shusen LIU <liushusen@fb.com>
# Date 1453329104 28800
#      Wed Jan 20 14:31:44 2016 -0800
# Node ID c82a0edf16d90727e307e2c4b17bd1202e414fb7
# Parent  ea7523380efa894213582ca1600dc472baa7b25e
evolve: add new methods _evolvestatewrite, _evolvestateread, _evolvestatedelete

This patch introduces three new methods to read/write/delete data in vfs file
'evolvestate'. Provide the first version of impl of these methods, the
_fm0evolvestateversion = 0.
This enable us to persist evolvestate data, restore evolvestate data, and
delete it after it's no longer needed. And allows us to support a continued
keywork to implement evolve state.
Bryan O'Sullivan - Jan. 27, 2016, 12:01 a.m.
On Tue, Jan 26, 2016 at 3:43 PM, Shusen LIU <liushusen@fb.com> wrote:


> +## Parsing and writing of version "0"
>

We have historically avoided numbered versions in favour of capabilities. I
don't have a strong opinion on whether to do that here, just raising this
as diverging from that practice.


> +#
> +# The file contains two line.
> +# First line is the version number.
> +# Second line is the the data in json format
> +
> +_fm0evolvestateversion = 0
>

What does _fm indicate here?


> +
> +def _fm0evolvestateread(repo):
> +    serialized_data = repo.vfs.read('evolvestate').split('\n')[1]
> +    data = dict([(str(k), str(v))
> +                 for k,v in json.loads(serialized_data).iteritems()])
>

This doesn't make sense to me. Why are you turning everything in the
serialised json into strings? Wouldn't they already be in the desired form
if you wrote them out?


> +    return data
>

This is redundant.


> +# mapping to read/write various evolvestate formats
> +# <version> -> (reader, writer)
> +formats = {_fm0evolvestateversion: (_fm0evolvestateread,
> _fm0evolvestatewrite),}
> +defaultversion = _fm0evolvestateversion
>

I think you should really put these functions in a class and refer to that
class. My feeling is that using a tuple is less elegant.

+
> +def _evolvestatewrite(repo, data, version=defaultversion):
> +    formats[version][1](repo, data)
>

And here's why I don't like the tuple: now we've some indexing into a
nested structure, instead of a reference to a name.

+def _evolvestateread(repo):
> +    diskversion = int(repo.vfs.read('evolvestate').split('\n')[0])
>

If the evolve state is big, then reading the first line is going to become
very expensive because you're reading the entire file.
Yuya Nishihara - Jan. 27, 2016, 3:07 p.m.
On Tue, 26 Jan 2016 15:43:07 -0800, Shusen LIU wrote:
> # HG changeset patch
> # User Shusen LIU <liushusen@fb.com>
> # Date 1453329104 28800
> #      Wed Jan 20 14:31:44 2016 -0800
> # Node ID c82a0edf16d90727e307e2c4b17bd1202e414fb7
> # Parent  ea7523380efa894213582ca1600dc472baa7b25e
> evolve: add new methods _evolvestatewrite, _evolvestateread, _evolvestatedelete

> +def _fm0evolvestateread(repo):
> +    serialized_data = repo.vfs.read('evolvestate').split('\n')[1]
> +    data = dict([(str(k), str(v))
> +                 for k,v in json.loads(serialized_data).iteritems()])
> +    return data
> +
> +def _fm0evolvestatewrite(repo, data):
> +    serialized_data = json.dumps(data)
> +    repo.vfs.write('evolvestate',
> +                   '\n'.join([str(_fm0evolvestateversion), serialized_data]))

I'm not sure this is the case, but JSON is really bad at handling non-UTF-8
string such as a file name. If 'evolvestate' is not intended to be read by
other tools, JSON won't be a good choice.
Matt Mackall - Jan. 27, 2016, 9:56 p.m.
On Thu, 2016-01-28 at 00:07 +0900, Yuya Nishihara wrote:
> On Tue, 26 Jan 2016 15:43:07 -0800, Shusen LIU wrote:
> > # HG changeset patch
> > # User Shusen LIU <liushusen@fb.com>
> > # Date 1453329104 28800
> > #      Wed Jan 20 14:31:44 2016 -0800
> > # Node ID c82a0edf16d90727e307e2c4b17bd1202e414fb7
> > # Parent  ea7523380efa894213582ca1600dc472baa7b25e
> > evolve: add new methods _evolvestatewrite, _evolvestateread,
> > _evolvestatedelete
> 
> > +def _fm0evolvestateread(repo):
> > +    serialized_data = repo.vfs.read('evolvestate').split('\n')[1]
> > +    data = dict([(str(k), str(v))
> > +                 for k,v in json.loads(serialized_data).iteritems()])
> > +    return data
> > +
> > +def _fm0evolvestatewrite(repo, data):
> > +    serialized_data = json.dumps(data)
> > +    repo.vfs.write('evolvestate',
> > +                   '\n'.join([str(_fm0evolvestateversion),
> > serialized_data]))
> 
> I'm not sure this is the case, but JSON is really bad at handling non-UTF-8
> string such as a file name. If 'evolvestate' is not intended to be read by
> other tools, JSON won't be a good choice.

Indeed. Also problematic:

- XML (also can't handle non-Unicode data)[1]
- pickle (executes arbitrary code on read)
- marshal (executes arbitrary code on read)

[1] and frequently executes arbitrary code on read, but not by design
-- 
Mathematics is the supreme nostalgia of our time.
Shusen LIU - Jan. 29, 2016, 1:09 p.m.
What would be the suggested format for storing `evolvestate`?



On 1/27/16, 21:56, "Matt Mackall" <mpm@selenic.com> wrote:

>On Thu, 2016-01-28 at 00:07 +0900, Yuya Nishihara wrote:

>> On Tue, 26 Jan 2016 15:43:07 -0800, Shusen LIU wrote:

>> > # HG changeset patch

>> > # User Shusen LIU <liushusen@fb.com>

>> > # Date 1453329104 28800

>> > #      Wed Jan 20 14:31:44 2016 -0800

>> > # Node ID c82a0edf16d90727e307e2c4b17bd1202e414fb7

>> > # Parent  ea7523380efa894213582ca1600dc472baa7b25e

>> > evolve: add new methods _evolvestatewrite, _evolvestateread,

>> > _evolvestatedelete

>> 

>> > +def _fm0evolvestateread(repo):

>> > +    serialized_data = repo.vfs.read('evolvestate').split('\n')[1]

>> > +    data = dict([(str(k), str(v))

>> > +                 for k,v in json.loads(serialized_data).iteritems()])

>> > +    return data

>> > +

>> > +def _fm0evolvestatewrite(repo, data):

>> > +    serialized_data = json.dumps(data)

>> > +    repo.vfs.write('evolvestate',

>> > +                   '\n'.join([str(_fm0evolvestateversion),

>> > serialized_data]))

>> 

>> I'm not sure this is the case, but JSON is really bad at handling non-UTF-8

>> string such as a file name. If 'evolvestate' is not intended to be read by

>> other tools, JSON won't be a good choice.

>

>Indeed. Also problematic:

>

>- XML (also can't handle non-Unicode data)[1]

>- pickle (executes arbitrary code on read)

>- marshal (executes arbitrary code on read)

>

>[1] and frequently executes arbitrary code on read, but not by design

>-- 

>Mathematics is the supreme nostalgia of our time.

>
Bryan O'Sullivan - Jan. 29, 2016, 4:12 p.m.
On Fri, Jan 29, 2016 at 5:09 AM, Shusen Liu <liushusen@fb.com> wrote:

> What would be the suggested format for storing `evolvestate`?
>

We usually hand-code our own simple formats. This makes switching to a C
version far easier if performance becomes a concern (which it often does).
Pierre-Yves David - Jan. 30, 2016, 3:47 p.m.
On 01/27/2016 01:01 AM, Bryan O'Sullivan wrote:
> On Tue, Jan 26, 2016 at 3:43 PM, Shusen LIU <liushusen@fb.com
> <mailto:liushusen@fb.com>> wrote:
>
>     +## Parsing and writing of version "0"
>
>
> We have historically avoided numbered versions in favour of
> capabilities. I don't have a strong opinion on whether to do that here,
> just raising this as diverging from that practice.

Actually, we have been using version number extensively in the context 
of file format recently. Binary formats are a quite rigid structure and 
the one disk format have limited scope and change enough that having 
version number is good enough. Especially because the version number is 
usually just used to decide which parser to use. Nowaday the content of 
the file itself already account for some extensibility.

(so, yes, we want a version number in this on disk file)

>     +# The file contains two line.
>     +# First line is the version number.
>     +# Second line is the the data in json format
>     +
>     +_fm0evolvestateversion = 0
>
>
> What does _fm indicate here?

_fm0 is used elsewhere in the code to distinct function that handle 
obsstore format-0 and obsstore format-1.

This is not very critical here (since we don't have multiple evolve 
state version yet). But it does not hurt.

>
>     +
>     +def _fm0evolvestateread(repo):
>     +    serialized_data = repo.vfs.read('evolvestate').split('\n')[1]
>     +    data = dict([(str(k), str(v))
>     +                 for k,v in json.loads(serialized_data).iteritems()])
>
>
> This doesn't make sense to me. Why are you turning everything in the
> serialised json into strings? Wouldn't they already be in the desired
> form if you wrote them out?

json is unicode, and Mercurial wants (good old) string. We serialise 
string, but they come back as unicode object. We don't want unicode object.

That said, this approach is probably too naive because we'll want to be 
able to store something else that just string as value (list, other 
dict, etc), so this blind conversion if doing to break.

>     +# mapping to read/write various evolvestate formats
>     +# <version> -> (reader, writer)
>     +formats = {_fm0evolvestateversion: (_fm0evolvestateread,
>     _fm0evolvestatewrite),}
>     +defaultversion = _fm0evolvestateversion
>
>
> I think you should really put these functions in a class and refer to
> that class. My feeling is that using a tuple is less elegant.

Matt have quite strong feeling about class as namespace. In my opinion, 
as we only have two elements, using a tuple is not too problematic.

Patch

diff --git a/hgext/evolve.py b/hgext/evolve.py
--- a/hgext/evolve.py
+++ b/hgext/evolve.py
@@ -67,6 +67,7 @@ 
 import collections
 import socket
 import errno
+import json
 sha1re = re.compile(r'\b[0-9a-f]{6,40}\b')
 
 import mercurial
@@ -3723,3 +3724,40 @@ 
         repo._bookmarks[book] = dest.node()
     if oldbookmarks or destbookmarks:
         repo._bookmarks.recordchange(tr)
+
+## Parsing and writing of version "0"
+#
+# The file contains two line.
+# First line is the version number.
+# Second line is the the data in json format
+
+_fm0evolvestateversion = 0
+
+def _fm0evolvestateread(repo):
+    serialized_data = repo.vfs.read('evolvestate').split('\n')[1]
+    data = dict([(str(k), str(v))
+                 for k,v in json.loads(serialized_data).iteritems()])
+    return data
+
+def _fm0evolvestatewrite(repo, data):
+    serialized_data = json.dumps(data)
+    repo.vfs.write('evolvestate',
+                   '\n'.join([str(_fm0evolvestateversion), serialized_data]))
+
+# mapping to read/write various evolvestate formats
+# <version> -> (reader, writer)
+formats = {_fm0evolvestateversion: (_fm0evolvestateread, _fm0evolvestatewrite),}
+defaultversion = _fm0evolvestateversion
+
+def _evolvestatewrite(repo, data, version=defaultversion):
+    formats[version][1](repo, data)
+
+def _evolvestateread(repo):
+    diskversion = int(repo.vfs.read('evolvestate').split('\n')[0])
+    if diskversion not in formats:
+        raise error.Abort(_('parsing evolvestate: unknown version %r')
+                         % diskversion)
+    return formats[diskversion][0](repo)
+
+def _evolvestatedelete(repo):
+    util.unlinkpath(repo.join('evolvestate'), ignoremissing=True)