Submitter | phabricator |
---|---|
Date | Jan. 11, 2020, 5:05 p.m. |
Message ID | <differential-rev-PHID-DREV-qgolzovj7uoxzv3d4k5e-req@mercurial-scm.org> |
Download | mbox | patch |
Permalink | /patch/44264/ |
State | Superseded |
Headers | show |
Comments
This revision now requires changes to proceed. martinvonz added inline comments. martinvonz requested changes to this revision. INLINE COMMENTS > nodemap.py:47-49 > +def persistent_data(index): > + """return the serialised data of a nodemap for a given index > + """ nit: call it `serialize()` to match the description? or maybe `serialize_index()` and rename `_dump_trie()` to `_serialize_trie()` > nodemap.py:57 > +NO_ENTRY = -1 > +# rev-0 need to be -2 because 0 is used by block, -1 is a special value. > +REV_OFFSET = 2 nit: replace "rev-0" by "rev 0" so it's not interpreted as subtraction > nodemap.py:66 > + > + Note that this is an involution. > + """ nit: maybe "... involution (is its own inverse)" because i don't think most people know what an involution is Also, you could rename the function to `encode_rev()`. You could then do: # encode_rev is its own inverse decode_rev = encode_rev Or not even define that since it's not used > nodemap.py:79 > + > + The nodemap store revision number for each unique prefix. > + s/store/stores/ > nodemap.py:81 > + > + Each block is a dictionnary with key in `[0, 15]`. Value are either > + another block or a revision number. s/nn/n/ s/key/keys/ s/Value/Values/ > nodemap.py:85 > + root = {} > + for rev in range(len(index)): > + hex = nodemod.hex(index[rev][7]) nit: will `rev, entry in enumerate(index)` help? (i know we don't care about speed) > nodemap.py:112-118 > + while current_hex[level] == other_hex[level]: > + new = {} > + block[_to_int(current_hex[level])] = new > + block = new > + level += 1 > + block[_to_int(current_hex[level])] = current_rev > + block[_to_int(other_hex[level])] = other_rev Would it work to replace these lines by the following? new = {} block[_to_int(current_hex[level])] = new _insert_into_block(index, level + 1, new, other_rev, other_hex) _insert_into_block(index, level + 1, new, current_rev, current_hex) > nodemap.py:121 > + > +def _dump_trie(root): > + """serialise a nodemap trie as mentioned above, i think `_serialize_tree()` would be a clearer name (it matches how you described it in the docstring) > nodemap.py:145 > + > +def _dump_block(block_node, block_map): > + """serialise a single block and this could be `_serialize_block()` > nodemap.py:155 > + > +def _to_value(item, block_map): > + """serialize any value as an integer""" `serialize_value()`? REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
marmoute added inline comments. INLINE COMMENTS > martinvonz wrote in nodemap.py:112-118 > Would it work to replace these lines by the following? > > new = {} > block[_to_int(current_hex[level])] = new > _insert_into_block(index, level + 1, new, other_rev, other_hex) > _insert_into_block(index, level + 1, new, current_rev, current_hex) no, we need the value of block to be updated after each iteration otherwise the next iteration would not descend to the level it just created. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
martinvonz added inline comments. INLINE COMMENTS > marmoute wrote in nodemap.py:112-118 > no, we need the value of block to be updated after each iteration otherwise the next iteration would not descend to the level it just created. The test case still passes with that change, though.... REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
martinvonz added inline comments. INLINE COMMENTS > debugcommands.py:2084 > +) > +def debugnodemap(ui, repo, **args): > + """write and inspect on disk nodemap could you also rename `args` to `opts`? or is there a reason to not follow the same convention we follow everywhere (?) else? REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
marmoute added inline comments. INLINE COMMENTS > martinvonz wrote in debugcommands.py:2084 > could you also rename `args` to `opts`? or is there a reason to not follow the same convention we follow everywhere (?) else? No good reason for this to not be `opts`. I'll rename it. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
marmoute added a comment. As stated yesterday, I'll have limited screen exposure time available for the next couple of day. I am going to ignore any minor nits during that time. (Am I happy to look at them afterward and follow up). Right now, I have only identified the `s/**args/**opts/` as non nits. I am sending an updated version that update that part. If any remains, please highlight any other major complains about this patch that is worth blocking the rest of the series. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
marmoute added inline comments. INLINE COMMENTS > martinvonz wrote in nodemap.py:47-49 > nit: call it `serialize()` to match the description? or maybe `serialize_index()` and rename `_dump_trie()` to `_serialize_trie()` I don't really like `serialize` because we won't have a `deserialize` step. The data should be usable directly with a mmap from disk. So if we standardize function name on something I would rather go toward "persist" (and "parse" for the python function that checks it). This can apply to the "dump" function and docstring. > martinvonz wrote in nodemap.py:57 > nit: replace "rev-0" by "rev 0" so it's not interpreted as subtraction subtraction by zero would be odd, but why not. > martinvonz wrote in nodemap.py:66 > nit: maybe "... involution (is its own inverse)" because i don't think most people know what an involution is > > Also, you could rename the function to `encode_rev()`. You could then do: > > # encode_rev is its own inverse > decode_rev = encode_rev > > Or not even define that since it's not used I am not convinced by the encode_rev/decode_rev form. I don't think it very important to define `involution`, people can look for it if they need to. The reste of the docstring already explain the functionworks both way. > martinvonz wrote in nodemap.py:85 > nit: will `rev, entry in enumerate(index)` help? (i know we don't care about speed) We need to index the entry anyway, so that won't change the line much. I find the current form clearer. > martinvonz wrote in nodemap.py:112-118 > The test case still passes with that change, though.... With more eyes, I am seeing what you are doing now. Yeah I think if would work. And it is more consistent with the recursivity used int he previous conditional block. Updating the patch, > martinvonz wrote in nodemap.py:121 > as mentioned above, i think `_serialize_tree()` would be a clearer name (it matches how you described it in the docstring) I would rather move to `_persist_trie`. What do you think? > martinvonz wrote in nodemap.py:145 > and this could be `_serialize_block()` so it would be `_persist_block()` > martinvonz wrote in nodemap.py:155 > `serialize_value()`? and `_persist_value()` REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
martinvonz added inline comments. INLINE COMMENTS > marmoute wrote in nodemap.py:47-49 > I don't really like `serialize` because we won't have a `deserialize` step. The data should be usable directly with a mmap from disk. So if we standardize function name on something I would rather go toward "persist" (and "parse" for the python function that checks it). > > This can apply to the "dump" function and docstring. Unless the Python code is going to mmap the file contents directly into Python objects (which I don't think is going to happen), there will be serialization and deserialization involved. > martinvonz wrote in nodemap.py:66 > nit: maybe "... involution (is its own inverse)" because i don't think most people know what an involution is > > Also, you could rename the function to `encode_rev()`. You could then do: > > # encode_rev is its own inverse > decode_rev = encode_rev > > Or not even define that since it's not used > I don't think it very important to define involution, people can look for it if they need to. But it's helpful to be helpful, no? Is there a good reason *not* to try to help? Maybe I'm in a small minority who didn't know what an involution is? > marmoute wrote in nodemap.py:85 > We need to index the entry anyway, so that won't change the line much. I find the current form clearer. I don't see why you'd be against `enumerate()` here. Are you simply against `enumerate()` in general? REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
marmoute added inline comments. INLINE COMMENTS > martinvonz wrote in nodemap.py:47-49 > Unless the Python code is going to mmap the file contents directly into Python objects (which I don't think is going to happen), there will be serialization and deserialization involved. The Python code is not meant to be use in production, especially not the parsing one (because it bring no advantage). So I want the API to be named after the actual semantic, not after this annecdotical python code. > martinvonz wrote in nodemap.py:85 > I don't see why you'd be against `enumerate()` here. Are you simply against `enumerate()` in general? I am not against enumerate in general, I just feel like direct indexing is clearer here (and the different is not that important here anyway). REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
marmoute added inline comments. INLINE COMMENTS > marmoute wrote in nodemap.py:85 > I am not against enumerate in general, I just feel like direct indexing is clearer here (and the different is not that important here anyway). Also, using `enumerate` in the incremental version of this code would make things sinificantly messier, so it won't. And keeping the two code similar has value. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
martinvonz added inline comments. INLINE COMMENTS > marmoute wrote in nodemap.py:47-49 > The Python code is not meant to be use in production, especially not the parsing one (because it bring no advantage). > > So I want the API to be named after the actual semantic, not after this annecdotical python code. The actual semantic of this code (the Python code) is to serialize the data. To me, "persist" implies writing it to disk, which is not what this code does. I think it's just misleading to call it that even if the Rust code will actually be writing directly to disk. REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
marmoute added inline comments. INLINE COMMENTS > martinvonz wrote in nodemap.py:47-49 > The actual semantic of this code (the Python code) is to serialize the data. To me, "persist" implies writing it to disk, which is not what this code does. I think it's just misleading to call it that even if the Rust code will actually be writing directly to disk. On my side, I feel like `serialize` is misleading because this is not meant to have a "deserialization" phase taht reconstruct the data in memory. Can we wait for the bulk of the code to be in before we decide on a name and do a (simpler) changeset to align things to the naming scheme? REPOSITORY rHG Mercurial CHANGES SINCE LAST ACTION https://phab.mercurial-scm.org/D7834/new/ REVISION DETAIL https://phab.mercurial-scm.org/D7834 To: marmoute, #hg-reviewers, martinvonz Cc: martinvonz, mjpieters, mercurial-devel
Patch
diff --git a/tests/test-persistent-nodemap.t b/tests/test-persistent-nodemap.t new file mode 100644 --- /dev/null +++ b/tests/test-persistent-nodemap.t @@ -0,0 +1,26 @@ +=================================== +Test the persistent on-disk nodemap +=================================== + + + $ hg init test-repo + $ cd test-repo + $ hg debugbuilddag .+5000 + $ hg debugnodemap --dump | f --sha256 --bytes=256 --hexdump --size + size=245760, sha256=5dbe62ab98a26668b544063d4d674ac4452ba903ee8895c52fd21d9bbd771e09 + 0000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 0010: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 0020: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 0030: ff ff ff ff ff ff fa c6 ff ff ff ff ff ff ff ff |................| + 0040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 0050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 0060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ed b7 |................| + 0070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 0080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ee 38 |...............8| + 0090: 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff |................| + 00a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 00b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 00c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 00d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 00e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| + 00f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| diff --git a/tests/test-help.t b/tests/test-help.t --- a/tests/test-help.t +++ b/tests/test-help.t @@ -1017,6 +1017,7 @@ print merge state debugnamecomplete complete "names" - tags, open branch names, bookmark names + debugnodemap write and inspect on disk nodemap debugobsolete create arbitrary obsolete marker debugoptADV (no help text available) diff --git a/tests/test-completion.t b/tests/test-completion.t --- a/tests/test-completion.t +++ b/tests/test-completion.t @@ -107,6 +107,7 @@ debugmanifestfulltextcache debugmergestate debugnamecomplete + debugnodemap debugobsolete debugp1copies debugp2copies @@ -289,6 +290,7 @@ debugmanifestfulltextcache: clear, add debugmergestate: debugnamecomplete: + debugnodemap: dump debugobsolete: flags, record-parents, rev, exclusive, index, delete, date, user, template debugp1copies: rev debugp2copies: rev diff --git a/mercurial/revlogutils/nodemap.py b/mercurial/revlogutils/nodemap.py --- a/mercurial/revlogutils/nodemap.py +++ b/mercurial/revlogutils/nodemap.py @@ -7,9 +7,156 @@ # GNU General Public License version 2 or any later version. from __future__ import absolute_import -from .. import error + +import struct + +from .. import ( + error, + node as nodemod, +) class NodeMap(dict): def __missing__(self, x): raise error.RevlogError(b'unknown node: %s' % x) + + +### Nodemap Trie +# +# This is a simple reference implementation to compute and serialise a nodemap +# trie. This reference implementation is write only. The python version of this +# is not expected to be actually used, since it wont provide performance +# improvement over existing non-persistent C implementation. +# +# The nodemap is serialized as Trie using 4bits-address/16-entries block. each +# revision can be adressed using its node shortest prefix. +# +# The trie is stored as a sequence of block. Each block contains 16 entries +# (signed 64bit integer, big endian). Each entry can be one of the following: +# +# * value >= 0 -> index of sub-block +# * value == -1 -> no value +# * value < -1 -> a revision value: rev = -(value+10) +# +# The implementation focus on simplicity, not on performance. A Rust +# implementation should provide a efficient version of the same serialization. +# This reference python implementation is never meant to be extensively use in +# production. + + +def persistent_data(index): + """return the serialised data of a nodemap for a given index + """ + trie = _build_trie(index) + return _dump_trie(trie) + + +S_BLOCK = struct.Struct(">" + ("q" * 16)) + +NO_ENTRY = -1 +# rev-0 need to be -2 because 0 is used by block, -1 is a special value. +REV_OFFSET = -2 + + +def _transform_rev(rev): + """Return the number used to represent the rev in the tree. + + (or retrieve a rev number from such representation) + + Note that this is an involution. + """ + return -(rev + REV_OFFSET) + + +def _to_int(hex_digit): + """turn an hexadecimal digit into a proper integer""" + return int(hex_digit, 16) + + +def _build_trie(index): + """build a nodemap trie + + The nodemap store revision number for each unique prefix. + + Each block is a dictionnary with key in `[0, 15]`. Value are either + another block or a revision number. + """ + root = {} + for rev in range(len(index)): + hex = nodemod.hex(index[rev][7]) + _insert_into_block(index, 0, root, rev, hex) + return root + + +def _insert_into_block(index, level, block, current_rev, current_hex): + """insert a new revision in a block + + index: the index we are adding revision for + level: the depth of the current block in the trie + block: the block currently being considered + current_rev: the revision number we are adding + current_hex: the hexadecimal representation of the of that revision + """ + entry = block.get(_to_int(current_hex[level])) + if entry is None: + # no entry, simply store the revision number + block[_to_int(current_hex[level])] = current_rev + elif isinstance(entry, dict): + # need to recurse to an underlying block + _insert_into_block(index, level + 1, entry, current_rev, current_hex) + else: + # collision with a previously unique prefix, inserting new + # vertices to fit both entry. + other_hex = nodemod.hex(index[entry][7]) + other_rev = entry + while current_hex[level] == other_hex[level]: + new = {} + block[_to_int(current_hex[level])] = new + block = new + level += 1 + block[_to_int(current_hex[level])] = current_rev + block[_to_int(other_hex[level])] = other_rev + + +def _dump_trie(root): + """serialise a nodemap trie + + See `_build_trie` for nodemap trie structure""" + block_map = {} + chunks = [] + for tn in _walk_trie(root): + block_map[id(tn)] = len(chunks) + chunks.append(_dump_block(tn, block_map)) + return ''.join(chunks) + + +def _walk_trie(block): + """yield all the block in a trie + + Children blocks are always yield before their parent block. + """ + for (_, item) in sorted(block.items()): + if isinstance(item, dict): + for sub_block in _walk_trie(item): + yield sub_block + yield block + + +def _dump_block(block_node, block_map): + """serialise a single block + + Children block are assumed to be already serialized and present in + block_map. + """ + data = tuple(_to_value(block_node.get(i), block_map) for i in range(16)) + return S_BLOCK.pack(*data) + + +def _to_value(item, block_map): + """serialize any value as an integer""" + if item is None: + return NO_ENTRY + elif isinstance(item, dict): + return block_map[id(item)] + else: + return _transform_rev(item) diff --git a/mercurial/debugcommands.py b/mercurial/debugcommands.py --- a/mercurial/debugcommands.py +++ b/mercurial/debugcommands.py @@ -93,7 +93,10 @@ stringutil, ) -from .revlogutils import deltas as deltautil +from .revlogutils import ( + deltas as deltautil, + nodemap, +) release = lockmod.release @@ -2075,6 +2078,20 @@ @command( + b'debugnodemap', + [('', b'dump', False, _(b'write a (binary) serialised nodemap on stdin'))], +) +def debugnodemap(ui, repo, **args): + """write and inspect on disk nodemap + """ + if args['dump']: + unfi = repo.unfiltered() + cl = unfi.changelog + data = nodemap.persistent_data(cl.index) + ui.write(data) + + +@command( b'debugobsolete', [ (b'', b'flags', 0, _(b'markers flag')),