Patchwork D9090: changing-files: rework the way we store changed files in side-data

login
register
mail settings
Submitter phabricator
Date Sept. 26, 2020, 12:10 p.m.
Message ID <differential-rev-PHID-DREV-p7v3lvxbtseucxervumh-req@mercurial-scm.org>
Download mbox | patch
Permalink /patch/47293/
State Superseded
Headers show

Comments

phabricator - Sept. 26, 2020, 12:10 p.m.
marmoute created this revision.
Herald added a reviewer: hg-reviewers.
Herald added a subscriber: mercurial-patches.

REVISION SUMMARY
  We need to store new data so this is a good opportunity to rework this fully.
  
  1. We directly store the list of affected file in the side data:
  
  - This avoid having to fetch and parse the `files` list in the revision in addition to the sidedata. Making the data more self sufficient.
  
  - This work around situation where that `files` field contains wrong information, and open the way to other bug fixing (eg: issue6219)
  
  - The format (fixed initial index, sorted files) allow for fast lookup of filename within the structure.
  
  - This unify the storage of affected files and copies sources and destination, limiting the number filename stored redundantly.
  
  - This prepare for the fact we should drop the `files` as soon as we do any change affecting the revision schema.
  
  - This rely on compression to avoid a significant increase of the changelog.d. More testing on this will be done before we freeze the final format.
  
  
  
  2. We can store additional data:
  
  - The new "merged" field,
  
  - A future "salvaged" set recording files that might have been deleted but have were still present in the final result.

REPOSITORY
  rHG Mercurial

BRANCH
  default

REVISION DETAIL
  https://phab.mercurial-scm.org/D9090

AFFECTED FILES
  mercurial/helptext/internals/revlogs.txt
  mercurial/metadata.py
  mercurial/revlogutils/sidedata.py
  tests/test-copies-in-changeset.t

CHANGE DETAILS




To: marmoute, #hg-reviewers
Cc: mercurial-patches, mercurial-devel

Patch

diff --git a/tests/test-copies-in-changeset.t b/tests/test-copies-in-changeset.t
--- a/tests/test-copies-in-changeset.t
+++ b/tests/test-copies-in-changeset.t
@@ -79,11 +79,9 @@ 
   2\x00a (esc)
 #else
   $ hg debugsidedata -c -v -- -1
-  2 sidedata entries
-   entry-0010 size 11
-    '0\x00a\n1\x00a\n2\x00a'
-   entry-0012 size 5
-    '0\n1\n2'
+  1 sidedata entries
+   entry-0014 size 44
+    '\x00\x00\x00\x04\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00\x06\x00\x00\x00\x03\x00\x00\x00\x00\x06\x00\x00\x00\x04\x00\x00\x00\x00abcd'
 #endif
 
   $ hg showcopies
@@ -117,13 +115,9 @@ 
 
 #else
   $ hg debugsidedata -c -v -- -1
-  3 sidedata entries
-   entry-0010 size 3
-    '1\x00b'
-   entry-0012 size 1
-    '1'
-   entry-0013 size 1
-    '0'
+  1 sidedata entries
+   entry-0014 size 25
+    '\x00\x00\x00\x02\x0c\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x03\x00\x00\x00\x00bb2'
 #endif
 
   $ hg showcopies
@@ -165,8 +159,8 @@ 
 #else
   $ hg debugsidedata -c -v -- -1
   1 sidedata entries
-   entry-0010 size 4
-    '0\x00b2'
+   entry-0014 size 25
+    '\x00\x00\x00\x02\x00\x00\x00\x00\x02\x00\x00\x00\x00\x16\x00\x00\x00\x03\x00\x00\x00\x00b2c'
 #endif
 
   $ hg showcopies
@@ -221,13 +215,9 @@ 
 
 #else
   $ hg debugsidedata -c -v -- -1
-  3 sidedata entries
-   entry-0010 size 7
-    '0\x00a\n2\x00f'
-   entry-0011 size 3
-    '1\x00d'
-   entry-0012 size 5
-    '0\n1\n2'
+  1 sidedata entries
+   entry-0014 size 64
+    '\x00\x00\x00\x06\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x06\x00\x00\x00\x04\x00\x00\x00\x00\x07\x00\x00\x00\x05\x00\x00\x00\x01\x06\x00\x00\x00\x06\x00\x00\x00\x02adfghi'
 #endif
 
   $ hg showcopies
@@ -250,11 +240,9 @@ 
 #else
   $ hg ci -m 'copy a to j'
   $ hg debugsidedata -c -v -- -1
-  2 sidedata entries
-   entry-0010 size 3
-    '0\x00a'
-   entry-0012 size 1
-    '0'
+  1 sidedata entries
+   entry-0014 size 24
+    '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00aj'
 #endif
   $ hg debugdata j 0
   \x01 (esc)
@@ -281,11 +269,9 @@ 
   $ hg ci --amend -m 'copy a to j, v2'
   saved backup bundle to $TESTTMP/repo/.hg/strip-backup/*-*-amend.hg (glob)
   $ hg debugsidedata -c -v -- -1
-  2 sidedata entries
-   entry-0010 size 3
-    '0\x00a'
-   entry-0012 size 1
-    '0'
+  1 sidedata entries
+   entry-0014 size 24
+    '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00aj'
 #endif
   $ hg showcopies --config experimental.copies.read-from=filelog-only
   a -> j
@@ -304,6 +290,9 @@ 
 #else
   $ hg ci -m 'modify j'
   $ hg debugsidedata -c -v -- -1
+  1 sidedata entries
+   entry-0014 size 14
+    '\x00\x00\x00\x01\x14\x00\x00\x00\x01\x00\x00\x00\x00j'
 #endif
 
 Test writing only to filelog
@@ -318,11 +307,9 @@ 
 #else
   $ hg ci -m 'copy a to k'
   $ hg debugsidedata -c -v -- -1
-  2 sidedata entries
-   entry-0010 size 3
-    '0\x00a'
-   entry-0012 size 1
-    '0'
+  1 sidedata entries
+   entry-0014 size 24
+    '\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x00\x00\x06\x00\x00\x00\x02\x00\x00\x00\x00ak'
 #endif
 
   $ hg debugdata k 0
@@ -439,10 +426,10 @@ 
   compression-level:  default default default
   $ hg debugsidedata -c -- 0
   1 sidedata entries
-   entry-0012 size 1
+   entry-0014 size 14
   $ hg debugsidedata -c -- 1
   1 sidedata entries
-   entry-0013 size 1
+   entry-0014 size 14
   $ hg debugsidedata -m -- 0
   $ cat << EOF > .hg/hgrc
   > [format]
@@ -463,7 +450,11 @@ 
   compression:        zlib   zlib    zlib
   compression-level:  default default default
   $ hg debugsidedata -c -- 0
+  1 sidedata entries
+   entry-0014 size 14
   $ hg debugsidedata -c -- 1
+  1 sidedata entries
+   entry-0014 size 14
   $ hg debugsidedata -m -- 0
 
 upgrading
@@ -487,10 +478,10 @@ 
   compression-level:  default default default
   $ hg debugsidedata -c -- 0
   1 sidedata entries
-   entry-0012 size 1
+   entry-0014 size 14
   $ hg debugsidedata -c -- 1
   1 sidedata entries
-   entry-0013 size 1
+   entry-0014 size 14
   $ hg debugsidedata -m -- 0
 
 #endif
diff --git a/mercurial/revlogutils/sidedata.py b/mercurial/revlogutils/sidedata.py
--- a/mercurial/revlogutils/sidedata.py
+++ b/mercurial/revlogutils/sidedata.py
@@ -53,6 +53,7 @@ 
 SD_P2COPIES = 9
 SD_FILESADDED = 10
 SD_FILESREMOVED = 11
+SD_FILES = 12
 
 # internal format constant
 SIDEDATA_HEADER = struct.Struct('>H')
diff --git a/mercurial/metadata.py b/mercurial/metadata.py
--- a/mercurial/metadata.py
+++ b/mercurial/metadata.py
@@ -8,6 +8,7 @@ 
 from __future__ import absolute_import, print_function
 
 import multiprocessing
+import struct
 
 from . import (
     error,
@@ -361,54 +362,114 @@ 
         return None
 
 
+# see mercurial/helptext/internals/revlogs.txt for details about the format
+
+ACTION_MASK = int("111" "00", 2)
+# note: untouched file used as copy source will as `000` for this mask.
+ADDED_FLAG = int("001" "00", 2)
+MERGED_FLAG = int("010" "00", 2)
+REMOVED_FLAG = int("011" "00", 2)
+# `100` is reserved for future use
+TOUCHED_FLAG = int("101" "00", 2)
+
+COPIED_MASK = int("11", 2)
+COPIED_FROM_P1_FLAG = int("10", 2)
+COPIED_FROM_P2_FLAG = int("11", 2)
+
+# structure is <flag><filename-end><copy-source>
+INDEX_HEADER = struct.Struct(">L")
+INDEX_ENTRY = struct.Struct(">bLL")
+
+
 def encode_files_sidedata(files):
-    sortedfiles = sorted(files.touched)
-    sidedata = {}
-    p1copies = files.copied_from_p1
-    if p1copies:
-        p1copies = encodecopies(sortedfiles, p1copies)
-        sidedata[sidedatamod.SD_P1COPIES] = p1copies
-    p2copies = files.copied_from_p2
-    if p2copies:
-        p2copies = encodecopies(sortedfiles, p2copies)
-        sidedata[sidedatamod.SD_P2COPIES] = p2copies
-    filesadded = files.added
-    if filesadded:
-        filesadded = encodefileindices(sortedfiles, filesadded)
-        sidedata[sidedatamod.SD_FILESADDED] = filesadded
-    filesremoved = files.removed
-    if filesremoved:
-        filesremoved = encodefileindices(sortedfiles, filesremoved)
-        sidedata[sidedatamod.SD_FILESREMOVED] = filesremoved
-    if not sidedata:
-        sidedata = None
-    return sidedata
+    touched = files.touched
+    added = files.added
+    removed = files.removed
+    merged = files.merged
+    copied_from_p1 = files.copied_from_p1
+    copied_from_p2 = files.copied_from_p2
+
+    all_files = set(touched)
+    all_files.update(copied_from_p1.values())
+    all_files.update(copied_from_p2.values())
+    all_files = sorted(all_files)
+    file_idx = {f: i for (i, f) in enumerate(all_files)}
+    file_idx[None] = 0
+
+    chunks = [INDEX_HEADER.pack(len(all_files))]
+
+    filename_length = 0
+    for f in all_files:
+        filename_size = len(f)
+        filename_length += filename_size
+        flag = 0
+        if f in added:
+            flag |= ADDED_FLAG
+        elif f in merged:
+            flag |= MERGED_FLAG
+        elif f in removed:
+            flag |= REMOVED_FLAG
+        elif f in touched:
+            flag |= TOUCHED_FLAG
+
+        copy = None
+        if f in copied_from_p1:
+            flag |= COPIED_FROM_P1_FLAG
+            copy = copied_from_p1.get(f)
+        elif f in copied_from_p2:
+            copy = copied_from_p2.get(f)
+            flag |= COPIED_FROM_P2_FLAG
+        copy_idx = file_idx[copy]
+        chunks.append(INDEX_ENTRY.pack(flag, filename_length, copy_idx))
+    chunks.extend(all_files)
+    return {sidedatamod.SD_FILES: b''.join(chunks)}
 
 
 def decode_files_sidedata(changelogrevision, sidedata):
-    """Return a ChangingFiles instance from a changelogrevision using sidata
-    """
-    touched = changelogrevision.files
+    md = ChangingFiles()
+    raw = sidedata.get(sidedatamod.SD_FILES)
+
+    if raw is None:
+        return md
 
-    rawindices = sidedata.get(sidedatamod.SD_FILESADDED)
-    added = decodefileindices(touched, rawindices)
+    copies = []
+    all_files = []
 
-    rawindices = sidedata.get(sidedatamod.SD_FILESREMOVED)
-    removed = decodefileindices(touched, rawindices)
+    total_files = INDEX_HEADER.unpack_from(raw, 0)[0]
+
+    offset = INDEX_HEADER.size
+    file_offset_base = offset + (INDEX_ENTRY.size * total_files)
+    file_offset_last = file_offset_base
 
-    rawcopies = sidedata.get(sidedatamod.SD_P1COPIES)
-    p1_copies = decodecopies(touched, rawcopies)
-
-    rawcopies = sidedata.get(sidedatamod.SD_P2COPIES)
-    p2_copies = decodecopies(touched, rawcopies)
+    for idx in range(total_files):
+        flag, file_end, copy_idx = INDEX_ENTRY.unpack_from(raw, offset)
+        file_end += file_offset_base
+        filename = raw[file_offset_last:file_end]
+        offset += INDEX_ENTRY.size
+        file_offset_last = file_end
+        all_files.append(filename)
+        if flag & ACTION_MASK == ADDED_FLAG:
+            md.mark_added(filename)
+        elif flag & ACTION_MASK == MERGED_FLAG:
+            md.mark_merged(filename)
+        elif flag & ACTION_MASK == REMOVED_FLAG:
+            md.mark_removed(filename)
+        elif flag & ACTION_MASK == TOUCHED_FLAG:
+            md.mark_touched(filename)
 
-    return ChangingFiles(
-        touched=touched,
-        added=added,
-        removed=removed,
-        p1_copies=p1_copies,
-        p2_copies=p2_copies,
-    )
+        copied = None
+        if flag & COPIED_MASK == COPIED_FROM_P1_FLAG:
+            copied = md.mark_copied_from_p1
+        elif flag & COPIED_MASK == COPIED_FROM_P2_FLAG:
+            copied = md.mark_copied_from_p2
+
+        if copied is not None:
+            copies.append((copied, filename, copy_idx))
+
+    for copied, filename, copy_idx in copies:
+        copied(all_files[copy_idx], filename)
+
+    return md
 
 
 def _getsidedata(srcrepo, rev):
@@ -416,23 +477,15 @@ 
     filescopies = computechangesetcopies(ctx)
     filesadded = computechangesetfilesadded(ctx)
     filesremoved = computechangesetfilesremoved(ctx)
-    sidedata = {}
-    if any([filescopies, filesadded, filesremoved]):
-        sortedfiles = sorted(ctx.files())
-        p1copies, p2copies = filescopies
-        p1copies = encodecopies(sortedfiles, p1copies)
-        p2copies = encodecopies(sortedfiles, p2copies)
-        filesadded = encodefileindices(sortedfiles, filesadded)
-        filesremoved = encodefileindices(sortedfiles, filesremoved)
-        if p1copies:
-            sidedata[sidedatamod.SD_P1COPIES] = p1copies
-        if p2copies:
-            sidedata[sidedatamod.SD_P2COPIES] = p2copies
-        if filesadded:
-            sidedata[sidedatamod.SD_FILESADDED] = filesadded
-        if filesremoved:
-            sidedata[sidedatamod.SD_FILESREMOVED] = filesremoved
-    return sidedata
+    filesmerged = computechangesetfilesmerged(ctx)
+    files = ChangingFiles()
+    files.update_touched(ctx.files())
+    files.update_added(filesadded)
+    files.update_removed(filesremoved)
+    files.update_merged(filesmerged)
+    files.update_copies_from_p1(filescopies[0])
+    files.update_copies_from_p2(filescopies[1])
+    return encode_files_sidedata(files)
 
 
 def getsidedataadder(srcrepo, destrepo):
diff --git a/mercurial/helptext/internals/revlogs.txt b/mercurial/helptext/internals/revlogs.txt
--- a/mercurial/helptext/internals/revlogs.txt
+++ b/mercurial/helptext/internals/revlogs.txt
@@ -239,3 +239,75 @@ 
 2. Hash the fulltext of the revision
 
 The 20 byte node ids of the parents are fed into the hasher in ascending order.
+
+Changed Files side-data
+=======================
+
+(This feature is in active development and its behavior is not frozen yet. It
+should not be used in any production repository)
+
+When the `exp-copies-sidedata-changeset` requirement is in use, information
+related to the changed files will be stored as "side-data" for every changeset
+in the changelog.
+
+These data contains the following information:
+
+* set of files actively added by the changeset
+* set of files actively removed by the changeset
+* set of files actively merged by the changeset
+* set of files actively touched by he changeset
+* mapping of copy-source, copy-destination from first parent (p1)
+* mapping of copy-source, copy-destination from second parent (p2)
+
+The block itself is big-endian data, formatted in three sections: header, index,
+and data. See below for details:
+
+Header:
+
+    4 bytes: unsized integer
+
+        total number of entry in the index
+
+Index:
+
+  The index contains an entry for every involved filename. It is sorted by
+  filename. The entry use the following format:
+
+    1 byte:  bits field
+
+        This byte hold two different bit fields:
+
+        The 2 lower bits carry copy information:
+
+            `00`: file has not copy information,
+            `10`: file is copied from a p1 source,
+            `11`: file is copied from a p2 source.
+
+        The 3 next bits carry action information.
+
+            `000`: file was untouched, it exist in the index as copy source,
+            `001`: file was actively added
+            `010`: file was actively merged
+            `011`: file was actively removed
+            `100`: reserved for future use
+            `101`: file was actively touched in any other way
+
+        (The last 2 bites are unused)
+
+    4 bytes: unsized integer
+
+        Address (in bytes) of the end of the associated filename in the data
+        block. (This is the address of the first byte not part of the filename)
+
+        The start of the filename can be retrieve by reading that field for the
+        previous index entry. The filename of the first entry starts at zero.
+
+    4 bytes: unsized integer
+
+        Index (in this very index) of the source of the copy (when a copy is
+        happening). If now copy is happening the value or this field is irrevant and could
+        have any value. It is set to zero by convention
+
+Data:
+
+  raw bytes block containing all filename concatenated without any separator.