Patchwork [4,of,4] changelog: lazy decode user (API)

login
register
mail settings
Submitter Gregory Szorc
Date Feb. 28, 2016, 7:27 a.m.
Message ID <ee98b780730118e8a894.1456644453@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/13455/
State Accepted
Headers show

Comments

Gregory Szorc - Feb. 28, 2016, 7:27 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1456641258 28800
#      Sat Feb 27 22:34:18 2016 -0800
# Node ID ee98b780730118e8a8948396507633a0460c154e
# Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
changelog: lazy decode user (API)

This appears to show a similar speedup as the previous patch.
Martin von Zweigbergk - Feb. 28, 2016, 3:58 p.m.
How about timestamp+extras? I would think most callers don't need those
either. Are they cheaper to decode/parse?

On Sat, Feb 27, 2016, 23:27 Gregory Szorc <gregory.szorc@gmail.com> wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1456641258 28800
> #      Sat Feb 27 22:34:18 2016 -0800
> # Node ID ee98b780730118e8a8948396507633a0460c154e
> # Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
> changelog: lazy decode user (API)
>
> This appears to show a similar speedup as the previous patch.
>
> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
> --- a/mercurial/changelog.py
> +++ b/mercurial/changelog.py
> @@ -332,30 +332,30 @@ class changelog(revlog.revlog):
>                          : older versions ignore it
>          files\n\n       : files modified by the cset, no \n or \r allowed
>          (.*)            : comment (free text, ideally utf-8)
>
>          changelog v0 doesn't use extra
>
>          Returns a 6-tuple consisting of the following:
>            - manifest node (binary)
> -          - user (encoding.localstr)
> +          - user (binary)
>            - (time, timezone) 2-tuple of a float and int offset
>            - list of files modified by the cset
>            - commit message / description (binary)
>            - dict of extra entries
>          """
>          text = self.revision(node)
>          if not text:
>              return nullid, "", (0, 0), [], "", _defaultextra
>          last = text.index("\n\n")
>          desc = text[last + 2:]
>          l = text[:last].split('\n')
>          manifest = bin(l[0])
> -        user = encoding.tolocal(l[1])
> +        user = l[1]
>
>          tdata = l[2].split(' ', 2)
>          if len(tdata) != 3:
>              time = float(tdata[0])
>              try:
>                  # various tools did silly things with the time zone field.
>                  timezone = int(tdata[1])
>              except ValueError:
> diff --git a/mercurial/context.py b/mercurial/context.py
> --- a/mercurial/context.py
> +++ b/mercurial/context.py
> @@ -543,17 +543,17 @@ class changectx(basectx):
>          return [changectx(repo, p1), changectx(repo, p2)]
>
>      def changeset(self):
>          return self._changeset
>      def manifestnode(self):
>          return self._changeset[0]
>
>      def user(self):
> -        return self._changeset[1]
> +        return encoding.tolocal(self._changeset[1])
>      def date(self):
>          return self._changeset[2]
>      def files(self):
>          return self._changeset[3]
>      def description(self):
>          return encoding.tolocal(self._changeset[4])
>      def branch(self):
>          return encoding.tolocal(self._changeset[5].get("branch"))
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Gregory Szorc - Feb. 28, 2016, 5:26 p.m.
They are actually a bit more expensive! I wanted to test the waters around refactoring the parsing logic before I wrote the more complicated patches :)

> On Feb 28, 2016, at 07:58, Martin von Zweigbergk <martinvonz@google.com> wrote:
> 
> How about timestamp+extras? I would think most callers don't need those either. Are they cheaper to decode/parse?
> 
>> On Sat, Feb 27, 2016, 23:27 Gregory Szorc <gregory.szorc@gmail.com> wrote:
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1456641258 28800
>> #      Sat Feb 27 22:34:18 2016 -0800
>> # Node ID ee98b780730118e8a8948396507633a0460c154e
>> # Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
>> changelog: lazy decode user (API)
>> 
>> This appears to show a similar speedup as the previous patch.
>> 
>> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
>> --- a/mercurial/changelog.py
>> +++ b/mercurial/changelog.py
>> @@ -332,30 +332,30 @@ class changelog(revlog.revlog):
>>                          : older versions ignore it
>>          files\n\n       : files modified by the cset, no \n or \r allowed
>>          (.*)            : comment (free text, ideally utf-8)
>> 
>>          changelog v0 doesn't use extra
>> 
>>          Returns a 6-tuple consisting of the following:
>>            - manifest node (binary)
>> -          - user (encoding.localstr)
>> +          - user (binary)
>>            - (time, timezone) 2-tuple of a float and int offset
>>            - list of files modified by the cset
>>            - commit message / description (binary)
>>            - dict of extra entries
>>          """
>>          text = self.revision(node)
>>          if not text:
>>              return nullid, "", (0, 0), [], "", _defaultextra
>>          last = text.index("\n\n")
>>          desc = text[last + 2:]
>>          l = text[:last].split('\n')
>>          manifest = bin(l[0])
>> -        user = encoding.tolocal(l[1])
>> +        user = l[1]
>> 
>>          tdata = l[2].split(' ', 2)
>>          if len(tdata) != 3:
>>              time = float(tdata[0])
>>              try:
>>                  # various tools did silly things with the time zone field.
>>                  timezone = int(tdata[1])
>>              except ValueError:
>> diff --git a/mercurial/context.py b/mercurial/context.py
>> --- a/mercurial/context.py
>> +++ b/mercurial/context.py
>> @@ -543,17 +543,17 @@ class changectx(basectx):
>>          return [changectx(repo, p1), changectx(repo, p2)]
>> 
>>      def changeset(self):
>>          return self._changeset
>>      def manifestnode(self):
>>          return self._changeset[0]
>> 
>>      def user(self):
>> -        return self._changeset[1]
>> +        return encoding.tolocal(self._changeset[1])
>>      def date(self):
>>          return self._changeset[2]
>>      def files(self):
>>          return self._changeset[3]
>>      def description(self):
>>          return encoding.tolocal(self._changeset[4])
>>      def branch(self):
>>          return encoding.tolocal(self._changeset[5].get("branch"))
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Martin von Zweigbergk - Feb. 29, 2016, 6 a.m.
On Sun, Feb 28, 2016 at 9:26 AM, Gregory Szorc <gregory.szorc@gmail.com> wrote:
> They are actually a bit more expensive!

I thought so...

> I wanted to test the waters around
> refactoring the parsing logic before I wrote the more complicated patches :)

... but that explains it. The series looks good to me, but seems like
something Matt (on CC) might have more opinion about.

>
> On Feb 28, 2016, at 07:58, Martin von Zweigbergk <martinvonz@google.com>
> wrote:
>
> How about timestamp+extras? I would think most callers don't need those
> either. Are they cheaper to decode/parse?
>
> On Sat, Feb 27, 2016, 23:27 Gregory Szorc <gregory.szorc@gmail.com> wrote:
>>
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1456641258 28800
>> #      Sat Feb 27 22:34:18 2016 -0800
>> # Node ID ee98b780730118e8a8948396507633a0460c154e
>> # Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
>> changelog: lazy decode user (API)
>>
>> This appears to show a similar speedup as the previous patch.
>>
>> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
>> --- a/mercurial/changelog.py
>> +++ b/mercurial/changelog.py
>> @@ -332,30 +332,30 @@ class changelog(revlog.revlog):
>>                          : older versions ignore it
>>          files\n\n       : files modified by the cset, no \n or \r allowed
>>          (.*)            : comment (free text, ideally utf-8)
>>
>>          changelog v0 doesn't use extra
>>
>>          Returns a 6-tuple consisting of the following:
>>            - manifest node (binary)
>> -          - user (encoding.localstr)
>> +          - user (binary)
>>            - (time, timezone) 2-tuple of a float and int offset
>>            - list of files modified by the cset
>>            - commit message / description (binary)
>>            - dict of extra entries
>>          """
>>          text = self.revision(node)
>>          if not text:
>>              return nullid, "", (0, 0), [], "", _defaultextra
>>          last = text.index("\n\n")
>>          desc = text[last + 2:]
>>          l = text[:last].split('\n')
>>          manifest = bin(l[0])
>> -        user = encoding.tolocal(l[1])
>> +        user = l[1]
>>
>>          tdata = l[2].split(' ', 2)
>>          if len(tdata) != 3:
>>              time = float(tdata[0])
>>              try:
>>                  # various tools did silly things with the time zone
>> field.
>>                  timezone = int(tdata[1])
>>              except ValueError:
>> diff --git a/mercurial/context.py b/mercurial/context.py
>> --- a/mercurial/context.py
>> +++ b/mercurial/context.py
>> @@ -543,17 +543,17 @@ class changectx(basectx):
>>          return [changectx(repo, p1), changectx(repo, p2)]
>>
>>      def changeset(self):
>>          return self._changeset
>>      def manifestnode(self):
>>          return self._changeset[0]
>>
>>      def user(self):
>> -        return self._changeset[1]
>> +        return encoding.tolocal(self._changeset[1])
>>      def date(self):
>>          return self._changeset[2]
>>      def files(self):
>>          return self._changeset[3]
>>      def description(self):
>>          return encoding.tolocal(self._changeset[4])
>>      def branch(self):
>>          return encoding.tolocal(self._changeset[5].get("branch"))
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Augie Fackler - March 1, 2016, 6:45 p.m.
On Sat, Feb 27, 2016 at 11:27:33PM -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1456641258 28800
> #      Sat Feb 27 22:34:18 2016 -0800
> # Node ID ee98b780730118e8a8948396507633a0460c154e
> # Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
> changelog: lazy decode user (API)

I like where this is going. Please send more like this.

Queued.

> This appears to show a similar speedup as the previous patch.
>
> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
> --- a/mercurial/changelog.py
> +++ b/mercurial/changelog.py
> @@ -332,30 +332,30 @@ class changelog(revlog.revlog):
>                          : older versions ignore it
>          files\n\n       : files modified by the cset, no \n or \r allowed
>          (.*)            : comment (free text, ideally utf-8)
>
>          changelog v0 doesn't use extra
>
>          Returns a 6-tuple consisting of the following:
>            - manifest node (binary)
> -          - user (encoding.localstr)
> +          - user (binary)
>            - (time, timezone) 2-tuple of a float and int offset
>            - list of files modified by the cset
>            - commit message / description (binary)
>            - dict of extra entries
>          """
>          text = self.revision(node)
>          if not text:
>              return nullid, "", (0, 0), [], "", _defaultextra
>          last = text.index("\n\n")
>          desc = text[last + 2:]
>          l = text[:last].split('\n')
>          manifest = bin(l[0])
> -        user = encoding.tolocal(l[1])
> +        user = l[1]
>
>          tdata = l[2].split(' ', 2)
>          if len(tdata) != 3:
>              time = float(tdata[0])
>              try:
>                  # various tools did silly things with the time zone field.
>                  timezone = int(tdata[1])
>              except ValueError:
> diff --git a/mercurial/context.py b/mercurial/context.py
> --- a/mercurial/context.py
> +++ b/mercurial/context.py
> @@ -543,17 +543,17 @@ class changectx(basectx):
>          return [changectx(repo, p1), changectx(repo, p2)]
>
>      def changeset(self):
>          return self._changeset
>      def manifestnode(self):
>          return self._changeset[0]
>
>      def user(self):
> -        return self._changeset[1]
> +        return encoding.tolocal(self._changeset[1])
>      def date(self):
>          return self._changeset[2]
>      def files(self):
>          return self._changeset[3]
>      def description(self):
>          return encoding.tolocal(self._changeset[4])
>      def branch(self):
>          return encoding.tolocal(self._changeset[5].get("branch"))
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Matt Mackall - March 1, 2016, 7:22 p.m.
On Sat, 2016-02-27 at 23:27 -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1456641258 28800
> #      Sat Feb 27 22:34:18 2016 -0800
> # Node ID ee98b780730118e8a8948396507633a0460c154e
> # Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
> changelog: lazy decode user (API)
> 
> This appears to show a similar speedup as the previous patch.

These two scare me (and are against our encoding conventions).

I like the idea of being lazy here and I've definitely seen the hit for this in
profiles, but I worry that this will leak utf-8 data to users that are expecting
local strings and we won't discover the problem until some end user runs it on a
non-utf-8 system months down the road.

Because these sorts of encoding confusions are very hard to keep track of in a
weakly-typed system, our rule has always been: limit the exposure of system to
the secondary types as far as possible. Which is why ALL changelog
encoding/decoding is handled today in just a couple functions in changelog.py
and we mostly don't have to think about it.

If we want to go further down the lazy road, I can imagine shimming in a
lightweight class (still in changelog.py) to replace the return tuple. It can be
initialized with the raw changeset data and have accessor methods or members to
unpack/decode the pieces. Then we'll still be lazy without subtly changing the
types of the legacy API that we're still using all over the place.

Or, we can leave changeset.read() alone and add another entrypoint that contexts
can use for lazy parsing.

-- 
Mathematics is the supreme nostalgia of our time.
Augie Fackler - March 1, 2016, 9:12 p.m.
> On Mar 1, 2016, at 11:22, Matt Mackall <mpm@selenic.com> wrote:
> 
> On Sat, 2016-02-27 at 23:27 -0800, Gregory Szorc wrote:
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1456641258 28800
>> #      Sat Feb 27 22:34:18 2016 -0800
>> # Node ID ee98b780730118e8a8948396507633a0460c154e
>> # Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
>> changelog: lazy decode user (API)
>> 
>> This appears to show a similar speedup as the previous patch.
> 
> These two scare me (and are against our encoding conventions).
> 
> I like the idea of being lazy here and I've definitely seen the hit for this in
> profiles, but I worry that this will leak utf-8 data to users that are expecting
> local strings and we won't discover the problem until some end user runs it on a
> non-utf-8 system months down the road.
> 
> Because these sorts of encoding confusions are very hard to keep track of in a
> weakly-typed system, our rule has always been: limit the exposure of system to
> the secondary types as far as possible. Which is why ALL changelog
> encoding/decoding is handled today in just a couple functions in changelog.py
> and we mostly don't have to think about it.
> 
> If we want to go further down the lazy road, I can imagine shimming in a
> lightweight class (still in changelog.py) to replace the return tuple. It can be
> initialized with the raw changeset data and have accessor methods or members to
> unpack/decode the pieces. Then we'll still be lazy without subtly changing the
> types of the legacy API that we're still using all over the place.
> 
> Or, we can leave changeset.read() alone and add another entrypoint that contexts
> can use for lazy parsing.

I like this option (readraw() or something).

> 
> -- 
> Mathematics is the supreme nostalgia of our time.
> 
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Yuya Nishihara - March 2, 2016, 2:55 p.m.
On Tue, 01 Mar 2016 13:22:20 -0600, Matt Mackall wrote:
> On Sat, 2016-02-27 at 23:27 -0800, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1456641258 28800
> > #      Sat Feb 27 22:34:18 2016 -0800
> > # Node ID ee98b780730118e8a8948396507633a0460c154e
> > # Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
> > changelog: lazy decode user (API)
> > 
> > This appears to show a similar speedup as the previous patch.
> 
> These two scare me (and are against our encoding conventions).
> 
> I like the idea of being lazy here and I've definitely seen the hit for this in
> profiles, but I worry that this will leak utf-8 data to users that are expecting
> local strings and we won't discover the problem until some end user runs it on a
> non-utf-8 system months down the road.
> 
> Because these sorts of encoding confusions are very hard to keep track of in a
> weakly-typed system, our rule has always been: limit the exposure of system to
> the secondary types as far as possible. Which is why ALL changelog
> encoding/decoding is handled today in just a couple functions in changelog.py
> and we mostly don't have to think about it.

FWIW, extra is binary. That's unfortunate we can't convert it to localstr.

  $ hg branch À
  $ hg ci -m branch
  $ HGENCODING=latin1 hg log -T '{get(extras, "branch")} {branch}\n'

Patch

diff --git a/mercurial/changelog.py b/mercurial/changelog.py
--- a/mercurial/changelog.py
+++ b/mercurial/changelog.py
@@ -332,30 +332,30 @@  class changelog(revlog.revlog):
                         : older versions ignore it
         files\n\n       : files modified by the cset, no \n or \r allowed
         (.*)            : comment (free text, ideally utf-8)
 
         changelog v0 doesn't use extra
 
         Returns a 6-tuple consisting of the following:
           - manifest node (binary)
-          - user (encoding.localstr)
+          - user (binary)
           - (time, timezone) 2-tuple of a float and int offset
           - list of files modified by the cset
           - commit message / description (binary)
           - dict of extra entries
         """
         text = self.revision(node)
         if not text:
             return nullid, "", (0, 0), [], "", _defaultextra
         last = text.index("\n\n")
         desc = text[last + 2:]
         l = text[:last].split('\n')
         manifest = bin(l[0])
-        user = encoding.tolocal(l[1])
+        user = l[1]
 
         tdata = l[2].split(' ', 2)
         if len(tdata) != 3:
             time = float(tdata[0])
             try:
                 # various tools did silly things with the time zone field.
                 timezone = int(tdata[1])
             except ValueError:
diff --git a/mercurial/context.py b/mercurial/context.py
--- a/mercurial/context.py
+++ b/mercurial/context.py
@@ -543,17 +543,17 @@  class changectx(basectx):
         return [changectx(repo, p1), changectx(repo, p2)]
 
     def changeset(self):
         return self._changeset
     def manifestnode(self):
         return self._changeset[0]
 
     def user(self):
-        return self._changeset[1]
+        return encoding.tolocal(self._changeset[1])
     def date(self):
         return self._changeset[2]
     def files(self):
         return self._changeset[3]
     def description(self):
         return encoding.tolocal(self._changeset[4])
     def branch(self):
         return encoding.tolocal(self._changeset[5].get("branch"))