Patchwork [2,of,6] changelog: branchinfo is a low level function - it should return UTF-8

login
register
mail settings
Submitter Mads Kiilerich
Date Dec. 14, 2014, 6:34 p.m.
Message ID <4f99561d776947cc3911.1418582061@ssl.google-analytics.com>
Download mbox | patch
Permalink /patch/7088/
State Accepted
Delegated to: Pierre-Yves David
Headers show

Comments

Mads Kiilerich - Dec. 14, 2014, 6:34 p.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1418581678 -3600
#      Sun Dec 14 19:27:58 2014 +0100
# Node ID 4f99561d776947cc3911b645c485be1bc2cf0c1f
# Parent  d35c8e8e2c7f980853fc4da0f0e251f8ca197d43
changelog: branchinfo is a low level function - it should return UTF-8

This will make it possible to persist the utf8 value in caches, without having
to decode it from the local encoding and risk case folding issues.

Stuff is renamed to more it more obvious that the type is different.

This refactoring seems to expose an existing problem in branchmap with local
encoding being used for computing the branchmap cache.
Pierre-Yves David - Dec. 14, 2014, 10:07 p.m.
On 12/14/2014 10:34 AM, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1418581678 -3600
> #      Sun Dec 14 19:27:58 2014 +0100
> # Node ID 4f99561d776947cc3911b645c485be1bc2cf0c1f
> # Parent  d35c8e8e2c7f980853fc4da0f0e251f8ca197d43
> changelog: branchinfo is a low level function - it should return UTF-8

Patch 1 seems obviously right and patch 2 seems a good move.

I've pushed both to the clowncopter.
Matt Mackall - Dec. 22, 2014, 7:51 p.m.
On Sun, 2014-12-14 at 19:34 +0100, Mads Kiilerich wrote:
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1418581678 -3600
> #      Sun Dec 14 19:27:58 2014 +0100
> # Node ID 4f99561d776947cc3911b645c485be1bc2cf0c1f
> # Parent  d35c8e8e2c7f980853fc4da0f0e251f8ca197d43
> changelog: branchinfo is a low level function - it should return UTF-8
> 
> This will make it possible to persist the utf8 value in caches, without having
> to decode it from the local encoding and risk case folding issues.
> 
> Stuff is renamed to more it more obvious that the type is different.
> 
> This refactoring seems to expose an existing problem in branchmap with local
> encoding being used for computing the branchmap cache.

You know that there is localstr magic here, right?

http://mercurial.selenic.com/wiki/EncodingStrategy#Round-trip_conversion

Do you have an example of an actual issue?
Mads Kiilerich - Jan. 6, 2015, 4:11 p.m.
On 12/22/2014 08:51 PM, Matt Mackall wrote:
> On Sun, 2014-12-14 at 19:34 +0100, Mads Kiilerich wrote:
>> # HG changeset patch
>> # User Mads Kiilerich <madski@unity3d.com>
>> # Date 1418581678 -3600
>> #      Sun Dec 14 19:27:58 2014 +0100
>> # Node ID 4f99561d776947cc3911b645c485be1bc2cf0c1f
>> # Parent  d35c8e8e2c7f980853fc4da0f0e251f8ca197d43
>> changelog: branchinfo is a low level function - it should return UTF-8
>>
>> This will make it possible to persist the utf8 value in caches, without having
>> to decode it from the local encoding and risk case folding issues.
>>
>> Stuff is renamed to make it more obvious that the type is different.
>>
>> This refactoring seems to expose an existing problem in branchmap with local
>> encoding being used for computing the branchmap cache.
> You know that there is localstr magic here, right?
>
> http://mercurial.selenic.com/wiki/EncodingStrategy#Round-trip_conversion

I was aware of the magic but not to which extent it prevented any actual 
problem here. I agree that it seems good enough to avoid problems in the 
current branchmap implementation.

The localstr magic also seems fragile and a bit inconsistent:

 >>> u = encoding.localstr('u', 'l')
 >>> U = encoding.localstr('U', 'l')
 >>> u == U
True
 >>> hash(u) == hash(U)
False

The localstr magic is fine for interaction with the UI layer, but it 
seems like a layering violation and obviously wrong to use it for 
internal processing that use local encoding for both input and output.

Converting the branch name of every single revision to local encoding 
when we don't need it also seems unfortunate when we want to optimize 
everything that is looping over every single revision.

/Mads
Matt Mackall - Jan. 6, 2015, 8:45 p.m.
On Tue, 2015-01-06 at 17:11 +0100, Mads Kiilerich wrote:
> On 12/22/2014 08:51 PM, Matt Mackall wrote:
> > On Sun, 2014-12-14 at 19:34 +0100, Mads Kiilerich wrote:
> >> # HG changeset patch
> >> # User Mads Kiilerich <madski@unity3d.com>
> >> # Date 1418581678 -3600
> >> #      Sun Dec 14 19:27:58 2014 +0100
> >> # Node ID 4f99561d776947cc3911b645c485be1bc2cf0c1f
> >> # Parent  d35c8e8e2c7f980853fc4da0f0e251f8ca197d43
> >> changelog: branchinfo is a low level function - it should return UTF-8
> >>
> >> This will make it possible to persist the utf8 value in caches, without having
> >> to decode it from the local encoding and risk case folding issues.
> >>
> >> Stuff is renamed to make it more obvious that the type is different.
> >>
> >> This refactoring seems to expose an existing problem in branchmap with local
> >> encoding being used for computing the branchmap cache.
> > You know that there is localstr magic here, right?
> >
> > http://mercurial.selenic.com/wiki/EncodingStrategy#Round-trip_conversion
> 
> I was aware of the magic but not to which extent it prevented any actual 
> problem here. I agree that it seems good enough to avoid problems in the 
> current branchmap implementation.
> 
> The localstr magic also seems fragile and a bit inconsistent:
> 
>  >>> u = encoding.localstr('u', 'l')
>  >>> U = encoding.localstr('U', 'l')

The only caller of the constructor is encoding.tolocal(utf8), which
doesn't allow this to happen.

> The localstr magic is fine for interaction with the UI layer, but it 
> seems like a layering violation and obviously wrong to use it for 
> internal processing that use local encoding for both input and output.

We only use localstr for stuff that's coming from UTF-8. Like branch
names, which are always stored on disk in UTF-8, but the user always
specifies in the local encoding.

The evil here is a) not everyone uses UTF-8 yet and b) UTF-8 can't be
round-tripped to other encodings. Localstr gives the illusion that it
can and lets us work in the most convenient encoding with a bare minimum
number of conversion points.

Localstr is such a huge design win that it works without people even
knowing it's there. Compare that to the usual model of encoding issues
that people don't even think about it until their code explodes on the
first encounter with a non-ASCII character in production. If I were
creating a language, some form of the idea would be in the core library.

> Converting the branch name of every single revision to local encoding 
> when we don't need it also seems unfortunate when we want to optimize 
> everything that is looping over every single revision.

Are you proposing to make the conversion lazy? We could totally do
that.</smiley>

Patch

diff --git a/mercurial/branchmap.py b/mercurial/branchmap.py
--- a/mercurial/branchmap.py
+++ b/mercurial/branchmap.py
@@ -233,9 +233,10 @@ 
         cl = repo.changelog
         # collect new branch entries
         newbranches = {}
-        getbranchinfo = cl.branchinfo
+        getbranchinfoutf8 = cl.branchinfoutf8
         for r in revgen:
-            branch, closesbranch = getbranchinfo(r)
+            branchutf8, closesbranch = getbranchinfoutf8(r)
+            branch = encoding.tolocal(branchutf8)
             newbranches.setdefault(branch, []).append(r)
             if closesbranch:
                 self._closednodes.add(cl.node(r))
diff --git a/mercurial/changelog.py b/mercurial/changelog.py
--- a/mercurial/changelog.py
+++ b/mercurial/changelog.py
@@ -371,10 +371,10 @@ 
         text = "\n".join(l)
         return self.addrevision(text, transaction, len(self), p1, p2)
 
-    def branchinfo(self, rev):
-        """return the branch name and open/close state of a revision
+    def branchinfoutf8(self, rev):
+        """return the UTF-8 branch name and open/close state of a revision
 
         This function exists because creating a changectx object
         just to access this is costly."""
         extra = self.read(rev)[5]
-        return encoding.tolocal(extra.get("branch")), 'close' in extra
+        return extra.get("branch"), 'close' in extra
diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -1702,7 +1702,7 @@ 
                     rbheads.extend(bheads)
                     for h in bheads:
                         r = self.changelog.rev(h)
-                        b, c = self.changelog.branchinfo(r)
+                        _butf8, c = self.changelog.branchinfoutf8(r)
                         if c:
                             closed.append(h)