Patchwork [06,of,10,lazy-changelog-parse] changelog: lazily parse date/extra field

login
register
mail settings
Submitter Gregory Szorc
Date March 6, 2016, 11:58 p.m.
Message ID <7559d2bcdaeb320212bf.1457308732@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/13646/
State Accepted
Delegated to: Martin von Zweigbergk
Headers show

Comments

Gregory Szorc - March 6, 2016, 11:58 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1457303425 28800
#      Sun Mar 06 14:30:25 2016 -0800
# Node ID 7559d2bcdaeb320212bf8d37e0e5e2075dec6d18
# Parent  5850dab8a22608aff069198e4d9e0157bbad6828
changelog: lazily parse date/extra field

This is probably the most complicated patch in the parsing
refactor.

Because the date and extras are encoded in the same field, we
stuff the entire field into a dedicated variable and add a
property for accessing the sub-components of each. There is
some duplicated code here. But the code is relatively simple,
so it shouldn't be a big deal.

We see revset performance wins across the board:

author(mpm)
0.896565
0.876713
0.822961

desc(bug)
0.887169
0.895514
0.847054

date(2015)
0.878797
0.820987
0.811613

extra(rebase_source)
0.865446
0.823811
0.797756

author(mpm) or author(greg)
1.801832
1.784160
1.668172

author(mpm) or desc(bug)
1.812438
1.822756
1.677608

date(2015) or branch(default)
0.968276
0.910981
0.896032

author(mpm) or desc(bug) or date(2015) or extra(rebase_source)
3.656193
3.516788
3.265024

We see a speed-up on revsets accessing date and extras because the new
parsing code only parses what you access. Even though they are stored
the same text field, we avoid parsing dates when accessing extras and
vice-versa.

But strangely revsets accessing both date and extras appeared to speed
up as well! I'm not sure if this is due to refactoring the parsing
code or due to an optimization in revsets. You can't argue with the
results!

Patch

diff --git a/mercurial/changelog.py b/mercurial/changelog.py
--- a/mercurial/changelog.py
+++ b/mercurial/changelog.py
@@ -146,19 +146,18 @@  class changelogrevision(object):
     """Holds results of a parsed changelog revision.
 
     Changelog revisions consist of multiple pieces of data, including
     the manifest node, user, and date. This object exposes a view into
     the parsed object.
     """
 
     __slots__ = (
-        'date',
+        '_rawdateextra',
         '_rawdesc',
-        'extra',
         'files',
         '_rawmanifest',
         '_rawuser',
     )
 
     def __new__(cls, text):
         if not text:
             return _changelogrevision(
@@ -189,45 +188,65 @@  class changelogrevision(object):
         self._rawdesc = text[doublenl + 2:]
 
         nl1 = text.index('\n')
         self._rawmanifest = text[0:nl1]
 
         nl2 = text.index('\n', nl1 + 1)
         self._rawuser = text[nl1 + 1:nl2]
 
+        nl3 = text.index('\n', nl2 + 1)
+        self._rawdateextra = text[nl2 + 1:nl3]
+
         l = text[:doublenl].split('\n')
-
-        tdata = l[2].split(' ', 2)
-        if len(tdata) != 3:
-            time = float(tdata[0])
-            try:
-                # various tools did silly things with the time zone field.
-                timezone = int(tdata[1])
-            except ValueError:
-                timezone = 0
-            self.extra = _defaultextra
-        else:
-            time, timezone = float(tdata[0]), int(tdata[1])
-            self.extra = decodeextra(tdata[2])
-
-        self.date = (time, timezone)
         self.files = l[3:]
 
         return self
 
     @property
     def manifest(self):
         return bin(self._rawmanifest)
 
     @property
     def user(self):
         return encoding.tolocal(self._rawuser)
 
     @property
+    def _rawdate(self):
+        return self._rawdateextra.split(' ', 2)[0:2]
+
+    @property
+    def _rawextra(self):
+        fields = self._rawdateextra.split(' ', 2)
+        if len(fields) != 3:
+            return None
+
+        return fields[2]
+
+    @property
+    def date(self):
+        raw = self._rawdate
+        time = float(raw[0])
+        # Various tools did silly things with the timezone.
+        try:
+            timezone = int(raw[1])
+        except ValueError:
+            timezone = 0
+
+        return time, timezone
+
+    @property
+    def extra(self):
+        raw = self._rawextra
+        if raw is None:
+            return _defaultextra
+
+        return decodeextra(raw)
+
+    @property
     def description(self):
         return encoding.tolocal(self._rawdesc)
 
 class changelog(revlog.revlog):
     def __init__(self, opener):
         revlog.revlog.__init__(self, opener, "00changelog.i")
         if self._initempty:
             # changelogs don't benefit from generaldelta