Patchwork [03,of,10,V2] sparse-revlog: rework the way we enforce chunk size limit

login
register
mail settings
Submitter Boris Feld
Date Nov. 15, 2018, 10:38 a.m.
Message ID <20e272f4f88c761b6155.1542278321@localhost.localdomain>
Download mbox | patch
Permalink /patch/36593/
State Accepted
Headers show

Comments

Boris Feld - Nov. 15, 2018, 10:38 a.m.
# HG changeset patch
# User Boris Feld <boris.feld@octobus.net>
# Date 1541782717 -3600
#      Fri Nov 09 17:58:37 2018 +0100
# Node ID 20e272f4f88c761b61556b4a4db08890fac21f56
# Parent  2251b59ed38df5764429e9b924eee57c4a7babae
# EXP-Topic sparse-perf
# Available At https://bitbucket.org/octobus/mercurial-devel/
#              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r 20e272f4f88c
sparse-revlog: rework the way we enforce chunk size limit

We move from a O(N) algorithm to a O(log(N)) algorithm.

The previous algorithm was traversing the whole delta chain, looking for the
exact point where it became too big. This would result in most of the delta
chain to be traversed.

Instead, we now use a "binary" approach, slicing the chain in two until we
have a chunk of the appropriate size.

We still keep the previous algorithm for the snapshots part. There are few of
them and they are large bits of data distant from each other. So the previous
algorithm should work well in that case.

To take a practical example of restoring manifest revision '59547c40bc4c' for
a reference NetBeans repository (using sparse-revlog). The media time of the
step `slice-sparse-chain` of `perfrevlogrevision` improve from 1.109 ms to
0.660 ms.
Yuya Nishihara - Nov. 15, 2018, 12:12 p.m.
On Thu, 15 Nov 2018 11:38:41 +0100, Boris Feld wrote:
> # HG changeset patch
> # User Boris Feld <boris.feld@octobus.net>
> # Date 1541782717 -3600
> #      Fri Nov 09 17:58:37 2018 +0100
> # Node ID 20e272f4f88c761b61556b4a4db08890fac21f56
> # Parent  2251b59ed38df5764429e9b924eee57c4a7babae
> # EXP-Topic sparse-perf
> # Available At https://bitbucket.org/octobus/mercurial-devel/
> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r 20e272f4f88c
> sparse-revlog: rework the way we enforce chunk size limit

Queued up to this, thanks.

I don't wanna read more C today. I'll review the remainder tomorrow or this
weekend.

> > +    == mixed case ==
> +    >>> revlog = _testrevlog(data, snapshot=[0, 1, 2])
> +    >>> list(_slicechunktosize(revlog, range(9), 5))

Fixed range(9) to list(range(9)) for Py3.

>      startrevidx = 0
> -    startdata = revlog.start(revs[0])
>      endrevidx = 0
>      iterrevs = enumerate(revs)
>      next(iterrevs) # skip first rev.
> +    # first step: get snapshots out of the way
>      for idx, r in iterrevs:
>          span = revlog.end(r) - startdata
> -        if span <= targetsize:
> +        snapshot = revlog.issnapshot(r)
> +        if span <= targetsize and snapshot:
>              endrevidx = idx
>          else:
>              chunk = _trimchunk(revlog, revs, startrevidx, endrevidx + 1)

Can you update this "startrevidx..endrevidx" to be a half-open range? It's
confusing that "endrevidx" is inclusive here, but isn't later in this function.

> +    # for the others, we use binary slicing to quickly converge toward valid
> +    # chunks (otherwise, we might end up looking for start/end of many
> +    # revisions). This logic is not looking for the perfect slicing point, it
> +    # focuses on quickly converging toward valid chunks.
> +    nbitem = len(revs)
> +    while (enddata - startdata) > targetsize:
> +        endrevidx = nbitem
> +        if nbitem - startrevidx <= 1:
> +            break # protect against individual chunk larger than limit
> +        localenddata = revlog.end(revs[endrevidx - 1])
> +        span = localenddata - startdata
> +        while (localenddata - startdata) > targetsize:
> +            if endrevidx - startrevidx <= 1:
> +                break # protect against individual chunk larger than limit
> +            endrevidx -= (endrevidx - startrevidx) // 2
> +            localenddata = revlog.end(revs[endrevidx -1])
> +            span = localenddata - startdata

'span' isn't used.

Patch

diff --git a/mercurial/revlogutils/deltas.py b/mercurial/revlogutils/deltas.py
--- a/mercurial/revlogutils/deltas.py
+++ b/mercurial/revlogutils/deltas.py
@@ -79,7 +79,7 @@  def slicechunk(revlog, revs, targetsize=
     If individual revisions chunk are larger than this limit, they will still
     be raised individually.
 
-    >>> revlog = _testrevlog([
+    >>> data = [
     ...  5,  #00 (5)
     ...  10, #01 (5)
     ...  12, #02 (2)
@@ -96,7 +96,8 @@  def slicechunk(revlog, revs, targetsize=
     ...  85, #13 (11)
     ...  86, #14 (1)
     ...  91, #15 (5)
-    ... ])
+    ... ]
+    >>> revlog = _testrevlog(data, snapshot=range(16))
 
     >>> list(slicechunk(revlog, list(range(16))))
     [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]
@@ -133,7 +134,7 @@  def _slicechunktosize(revlog, revs, targ
     happens when "minimal gap size" interrupted the slicing or when chain are
     built in a way that create large blocks next to each other.
 
-    >>> revlog = _testrevlog([
+    >>> data = [
     ...  3,  #0 (3)
     ...  5,  #1 (2)
     ...  6,  #2 (1)
@@ -143,7 +144,10 @@  def _slicechunktosize(revlog, revs, targ
     ...  12, #6 (1)
     ...  13, #7 (1)
     ...  14, #8 (1)
-    ... ])
+    ... ]
+
+    == All snapshots cases ==
+    >>> revlog = _testrevlog(data, snapshot=range(9))
 
     Cases where chunk is already small enough
     >>> list(_slicechunktosize(revlog, [0], 3))
@@ -178,20 +182,66 @@  def _slicechunktosize(revlog, revs, targ
     [[1], [3]]
     >>> list(_slicechunktosize(revlog, [3, 4, 5], 2))
     [[3], [5]]
+
+    == No Snapshot cases ==
+    >>> revlog = _testrevlog(data)
+
+    Cases where chunk is already small enough
+    >>> list(_slicechunktosize(revlog, [0], 3))
+    [[0]]
+    >>> list(_slicechunktosize(revlog, [6, 7], 3))
+    [[6, 7]]
+    >>> list(_slicechunktosize(revlog, [0], None))
+    [[0]]
+    >>> list(_slicechunktosize(revlog, [6, 7], None))
+    [[6, 7]]
+
+    cases where we need actual slicing
+    >>> list(_slicechunktosize(revlog, [0, 1], 3))
+    [[0], [1]]
+    >>> list(_slicechunktosize(revlog, [1, 3], 3))
+    [[1], [3]]
+    >>> list(_slicechunktosize(revlog, [1, 2, 3], 3))
+    [[1], [2, 3]]
+    >>> list(_slicechunktosize(revlog, [3, 5], 3))
+    [[3], [5]]
+    >>> list(_slicechunktosize(revlog, [3, 4, 5], 3))
+    [[3], [4, 5]]
+    >>> list(_slicechunktosize(revlog, [5, 6, 7, 8], 3))
+    [[5], [6, 7, 8]]
+    >>> list(_slicechunktosize(revlog, [0, 1, 2, 3, 4, 5, 6, 7, 8], 3))
+    [[0], [1, 2], [3], [5], [6, 7, 8]]
+
+    Case with too large individual chunk (must return valid chunk)
+    >>> list(_slicechunktosize(revlog, [0, 1], 2))
+    [[0], [1]]
+    >>> list(_slicechunktosize(revlog, [1, 3], 1))
+    [[1], [3]]
+    >>> list(_slicechunktosize(revlog, [3, 4, 5], 2))
+    [[3], [5]]
+
+    == mixed case ==
+    >>> revlog = _testrevlog(data, snapshot=[0, 1, 2])
+    >>> list(_slicechunktosize(revlog, range(9), 5))
+    [[0, 1], [2], [3, 4, 5], [6, 7, 8]]
     """
     assert targetsize is None or 0 <= targetsize
-    if targetsize is None or segmentspan(revlog, revs) <= targetsize:
+    startdata = revlog.start(revs[0])
+    enddata = revlog.end(revs[-1])
+    fullspan = enddata - startdata
+    if targetsize is None or fullspan <= targetsize:
         yield revs
         return
 
     startrevidx = 0
-    startdata = revlog.start(revs[0])
     endrevidx = 0
     iterrevs = enumerate(revs)
     next(iterrevs) # skip first rev.
+    # first step: get snapshots out of the way
     for idx, r in iterrevs:
         span = revlog.end(r) - startdata
-        if span <= targetsize:
+        snapshot = revlog.issnapshot(r)
+        if span <= targetsize and snapshot:
             endrevidx = idx
         else:
             chunk = _trimchunk(revlog, revs, startrevidx, endrevidx + 1)
@@ -200,7 +250,35 @@  def _slicechunktosize(revlog, revs, targ
             startrevidx = idx
             startdata = revlog.start(r)
             endrevidx = idx
-    yield _trimchunk(revlog, revs, startrevidx)
+        if not snapshot:
+            break
+
+    # for the others, we use binary slicing to quickly converge toward valid
+    # chunks (otherwise, we might end up looking for start/end of many
+    # revisions). This logic is not looking for the perfect slicing point, it
+    # focuses on quickly converging toward valid chunks.
+    nbitem = len(revs)
+    while (enddata - startdata) > targetsize:
+        endrevidx = nbitem
+        if nbitem - startrevidx <= 1:
+            break # protect against individual chunk larger than limit
+        localenddata = revlog.end(revs[endrevidx - 1])
+        span = localenddata - startdata
+        while (localenddata - startdata) > targetsize:
+            if endrevidx - startrevidx <= 1:
+                break # protect against individual chunk larger than limit
+            endrevidx -= (endrevidx - startrevidx) // 2
+            localenddata = revlog.end(revs[endrevidx -1])
+            span = localenddata - startdata
+        chunk = _trimchunk(revlog, revs, startrevidx, endrevidx)
+        if chunk:
+            yield chunk
+        startrevidx = endrevidx
+        startdata = revlog.start(revs[startrevidx])
+
+    chunk = _trimchunk(revlog, revs, startrevidx)
+    if chunk:
+        yield chunk
 
 def _slicechunktodensity(revlog, revs, targetdensity=0.5,
                          mingapsize=0):