Patchwork [09,of,10,lazy-changelog-parse] changelog: avoid slicing raw data until needed

login
register
mail settings
Submitter Gregory Szorc
Date March 6, 2016, 11:58 p.m.
Message ID <d1a19a190175f7fd17a3.1457308735@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/13638/
State Accepted
Delegated to: Martin von Zweigbergk
Headers show

Comments

Gregory Szorc - March 6, 2016, 11:58 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1457307620 28800
#      Sun Mar 06 15:40:20 2016 -0800
# Node ID d1a19a190175f7fd17a37a2a97e5969e8cd8b9a4
# Parent  6f9345c743c024c50c634f6cb54d2dd892dc2792
changelog: avoid slicing raw data until needed

Before, we were slicing the original raw text and storing individual
variables with values corresponding to each field. This is avoidable
overhead.

With this patch, we store the offsets of the fields at construction
time and perform the slice when a property is accessed.

This appears to show a very marginal performance win on its own and
the gains are so small as to not be worth reporting. However, this
patch marks the end of our parsing refactor, so it is worth reporting
the gains from the entire series:

author(mpm)
0.896565
0.795987  89%

desc(bug)
0.887169
0.803438  90%

date(2015)
0.878797
0.773961  88%

extra(rebase_source)
0.865446
0.761603  88%

author(mpm) or author(greg)
1.801832
1.576025  87%

author(mpm) or desc(bug)
1.812438
1.593335  88%

date(2015) or branch(default)
0.968276
0.875270  90%

author(mpm) or desc(bug) or date(2015) or extra(rebase_source)
3.656193
3.183104  87%

Pretty consistent speed-up across the board for any revset accessing
changelog revision data. Not bad!

It's also worth noting that PyPy appears to experience a similar to
marginally greater speed-up as well!

According to statprof, revsets accessing changelog revision data
are now clearly dominated by zlib decompression (16-17% of execution
time). Surprisingly, it appears the most expensive part of revision
parsing are the various text.index() calls to search for newlines!
These appear to cumulatively add up to 5+% of execution time. I
reckon implementing the parsing in C would make things marginally
faster.

If accessing larger strings (such as the commit message),
encoding.tolocal() is the most expensive procedure outside of
decompression.
Gregory Szorc - March 7, 2016, 12:13 a.m.
Yuya, et al:

You'll note from the revset numbers in this series that revset execution
time increases proportionally to the number of revset functions that access
changelog revision data. e.g. "author(x) or author(y)" is roughly 2x slower
than "author(x)." We're apparently decoding changelog revisions for each
revset function that accesses the data.

There is definitely a revset optimization to avoid the redundant changelog
reading. I'm no expert on the revset code and am not sure what the most
appropriate layer to implement such an optimization would be. Unlike
template keywords, we don't have a cache passed to the revset functions. We
/could/ implement an aggressive changectx cache as a context manager and
have it active during revset execution. We could potentially also work
something into the revset layer itself. I'm capable of implementing the
former pretty easily. But I'm not sure it is the best layer. I'd appreciate
if someone who understood the revset code better could weigh in on whether
it is an appropriate location for more optimal behavior here.

On Sun, Mar 6, 2016 at 3:58 PM, Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1457307620 28800
> #      Sun Mar 06 15:40:20 2016 -0800
> # Node ID d1a19a190175f7fd17a37a2a97e5969e8cd8b9a4
> # Parent  6f9345c743c024c50c634f6cb54d2dd892dc2792
> changelog: avoid slicing raw data until needed
>
> Before, we were slicing the original raw text and storing individual
> variables with values corresponding to each field. This is avoidable
> overhead.
>
> With this patch, we store the offsets of the fields at construction
> time and perform the slice when a property is accessed.
>
> This appears to show a very marginal performance win on its own and
> the gains are so small as to not be worth reporting. However, this
> patch marks the end of our parsing refactor, so it is worth reporting
> the gains from the entire series:
>
> author(mpm)
> 0.896565
> 0.795987  89%
>
> desc(bug)
> 0.887169
> 0.803438  90%
>
> date(2015)
> 0.878797
> 0.773961  88%
>
> extra(rebase_source)
> 0.865446
> 0.761603  88%
>
> author(mpm) or author(greg)
> 1.801832
> 1.576025  87%
>
> author(mpm) or desc(bug)
> 1.812438
> 1.593335  88%
>
> date(2015) or branch(default)
> 0.968276
> 0.875270  90%
>
> author(mpm) or desc(bug) or date(2015) or extra(rebase_source)
> 3.656193
> 3.183104  87%
>
> Pretty consistent speed-up across the board for any revset accessing
> changelog revision data. Not bad!
>
> It's also worth noting that PyPy appears to experience a similar to
> marginally greater speed-up as well!
>
> According to statprof, revsets accessing changelog revision data
> are now clearly dominated by zlib decompression (16-17% of execution
> time). Surprisingly, it appears the most expensive part of revision
> parsing are the various text.index() calls to search for newlines!
> These appear to cumulatively add up to 5+% of execution time. I
> reckon implementing the parsing in C would make things marginally
> faster.
>
> If accessing larger strings (such as the commit message),
> encoding.tolocal() is the most expensive procedure outside of
> decompression.
>
> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
> --- a/mercurial/changelog.py
> +++ b/mercurial/changelog.py
> @@ -146,21 +146,18 @@ class changelogrevision(object):
>      """Holds results of a parsed changelog revision.
>
>      Changelog revisions consist of multiple pieces of data, including
>      the manifest node, user, and date. This object exposes a view into
>      the parsed object.
>      """
>
>      __slots__ = (
> -        '_rawdateextra',
> -        '_rawdesc',
> -        '_rawfiles',
> -        '_rawmanifest',
> -        '_rawuser',
> +        '_offsets',
> +        '_text',
>      )
>
>      def __new__(cls, text):
>          if not text:
>              return _changelogrevision(
>                  manifest=nullid,
>                  user='',
>                  date=(0, 0),
> @@ -180,51 +177,51 @@ class changelogrevision(object):
>          #                 : extra is metadata, encoded and separated by
> '\0'
>          #                 : older versions ignore it
>          # files\n\n       : files modified by the cset, no \n or \r
> allowed
>          # (.*)            : comment (free text, ideally utf-8)
>          #
>          # changelog v0 doesn't use extra
>
>          nl1 = text.index('\n')
> -        self._rawmanifest = text[0:nl1]
> -
>          nl2 = text.index('\n', nl1 + 1)
> -        self._rawuser = text[nl1 + 1:nl2]
> -
>          nl3 = text.index('\n', nl2 + 1)
> -        self._rawdateextra = text[nl2 + 1:nl3]
>
>          # The list of files may be empty. Which means nl3 is the first of
> the
>          # double newline that precedes the description.
>          if text[nl3 + 1] == '\n':
> -            self._rawfiles = None
> -            self._rawdesc = text[nl3 + 2:]
> +            doublenl = nl3
>          else:
>              doublenl = text.index('\n\n', nl3 + 1)
> -            self._rawfiles = text[nl3 + 1:doublenl]
> -            self._rawdesc = text[doublenl + 2:]
> +
> +        self._offsets = (nl1, nl2, nl3, doublenl)
> +        self._text = text
>
>          return self
>
>      @property
>      def manifest(self):
> -        return bin(self._rawmanifest)
> +        return bin(self._text[0:self._offsets[0]])
>
>      @property
>      def user(self):
> -        return encoding.tolocal(self._rawuser)
> +        off = self._offsets
> +        return encoding.tolocal(self._text[off[0] + 1:off[1]])
>
>      @property
>      def _rawdate(self):
> -        return self._rawdateextra.split(' ', 2)[0:2]
> +        off = self._offsets
> +        dateextra = self._text[off[1] + 1:off[2]]
> +        return dateextra.split(' ', 2)[0:2]
>
>      @property
>      def _rawextra(self):
> -        fields = self._rawdateextra.split(' ', 2)
> +        off = self._offsets
> +        dateextra = self._text[off[1] + 1:off[2]]
> +        fields = dateextra.split(' ', 2)
>          if len(fields) != 3:
>              return None
>
>          return fields[2]
>
>      @property
>      def date(self):
>          raw = self._rawdate
> @@ -242,24 +239,25 @@ class changelogrevision(object):
>          raw = self._rawextra
>          if raw is None:
>              return _defaultextra
>
>          return decodeextra(raw)
>
>      @property
>      def files(self):
> -        if self._rawfiles is None:
> +        off = self._offsets
> +        if off[2] == off[3]:
>              return []
>
> -        return self._rawfiles.split('\n')
> +        return self._text[off[2] + 1:off[3]].split('\n')
>
>      @property
>      def description(self):
> -        return encoding.tolocal(self._rawdesc)
> +        return encoding.tolocal(self._text[self._offsets[3] + 2:])
>
>  class changelog(revlog.revlog):
>      def __init__(self, opener):
>          revlog.revlog.__init__(self, opener, "00changelog.i")
>          if self._initempty:
>              # changelogs don't benefit from generaldelta
>              self.version &= ~revlog.REVLOGGENERALDELTA
>              self._generaldelta = False
>
Yuya Nishihara - March 9, 2016, 1:41 p.m.
On Sun, 6 Mar 2016 16:13:07 -0800, Gregory Szorc wrote:
> You'll note from the revset numbers in this series that revset execution
> time increases proportionally to the number of revset functions that access
> changelog revision data. e.g. "author(x) or author(y)" is roughly 2x slower
> than "author(x)." We're apparently decoding changelog revisions for each
> revset function that accesses the data.
> 
> There is definitely a revset optimization to avoid the redundant changelog
> reading. I'm no expert on the revset code and am not sure what the most
> appropriate layer to implement such an optimization would be. Unlike
> template keywords, we don't have a cache passed to the revset functions. We
> /could/ implement an aggressive changectx cache as a context manager and
> have it active during revset execution. We could potentially also work
> something into the revset layer itself. I'm capable of implementing the
> former pretty easily. But I'm not sure it is the best layer. I'd appreciate
> if someone who understood the revset code better could weigh in on whether
> it is an appropriate location for more optimal behavior here.

Perhaps we could make optimize() fold a sequence of ctx-reading functions
into one. For example:

  "author(x) or desc(y)" -> "_matchctx('author:x', 'desc:y')"

It will also eliminate the deduplication process currently done by the addset.

A drawback of this approach is that it can't handle complex cases, e.g.
"author(x) or X or desc(y)".

Patch

diff --git a/mercurial/changelog.py b/mercurial/changelog.py
--- a/mercurial/changelog.py
+++ b/mercurial/changelog.py
@@ -146,21 +146,18 @@  class changelogrevision(object):
     """Holds results of a parsed changelog revision.
 
     Changelog revisions consist of multiple pieces of data, including
     the manifest node, user, and date. This object exposes a view into
     the parsed object.
     """
 
     __slots__ = (
-        '_rawdateextra',
-        '_rawdesc',
-        '_rawfiles',
-        '_rawmanifest',
-        '_rawuser',
+        '_offsets',
+        '_text',
     )
 
     def __new__(cls, text):
         if not text:
             return _changelogrevision(
                 manifest=nullid,
                 user='',
                 date=(0, 0),
@@ -180,51 +177,51 @@  class changelogrevision(object):
         #                 : extra is metadata, encoded and separated by '\0'
         #                 : older versions ignore it
         # files\n\n       : files modified by the cset, no \n or \r allowed
         # (.*)            : comment (free text, ideally utf-8)
         #
         # changelog v0 doesn't use extra
 
         nl1 = text.index('\n')
-        self._rawmanifest = text[0:nl1]
-
         nl2 = text.index('\n', nl1 + 1)
-        self._rawuser = text[nl1 + 1:nl2]
-
         nl3 = text.index('\n', nl2 + 1)
-        self._rawdateextra = text[nl2 + 1:nl3]
 
         # The list of files may be empty. Which means nl3 is the first of the
         # double newline that precedes the description.
         if text[nl3 + 1] == '\n':
-            self._rawfiles = None
-            self._rawdesc = text[nl3 + 2:]
+            doublenl = nl3
         else:
             doublenl = text.index('\n\n', nl3 + 1)
-            self._rawfiles = text[nl3 + 1:doublenl]
-            self._rawdesc = text[doublenl + 2:]
+
+        self._offsets = (nl1, nl2, nl3, doublenl)
+        self._text = text
 
         return self
 
     @property
     def manifest(self):
-        return bin(self._rawmanifest)
+        return bin(self._text[0:self._offsets[0]])
 
     @property
     def user(self):
-        return encoding.tolocal(self._rawuser)
+        off = self._offsets
+        return encoding.tolocal(self._text[off[0] + 1:off[1]])
 
     @property
     def _rawdate(self):
-        return self._rawdateextra.split(' ', 2)[0:2]
+        off = self._offsets
+        dateextra = self._text[off[1] + 1:off[2]]
+        return dateextra.split(' ', 2)[0:2]
 
     @property
     def _rawextra(self):
-        fields = self._rawdateextra.split(' ', 2)
+        off = self._offsets
+        dateextra = self._text[off[1] + 1:off[2]]
+        fields = dateextra.split(' ', 2)
         if len(fields) != 3:
             return None
 
         return fields[2]
 
     @property
     def date(self):
         raw = self._rawdate
@@ -242,24 +239,25 @@  class changelogrevision(object):
         raw = self._rawextra
         if raw is None:
             return _defaultextra
 
         return decodeextra(raw)
 
     @property
     def files(self):
-        if self._rawfiles is None:
+        off = self._offsets
+        if off[2] == off[3]:
             return []
 
-        return self._rawfiles.split('\n')
+        return self._text[off[2] + 1:off[3]].split('\n')
 
     @property
     def description(self):
-        return encoding.tolocal(self._rawdesc)
+        return encoding.tolocal(self._text[self._offsets[3] + 2:])
 
 class changelog(revlog.revlog):
     def __init__(self, opener):
         revlog.revlog.__init__(self, opener, "00changelog.i")
         if self._initempty:
             # changelogs don't benefit from generaldelta
             self.version &= ~revlog.REVLOGGENERALDELTA
             self._generaldelta = False