Patchwork [02,of,10] localrepo: bytes for errors

login
register
mail settings
Submitter timeless
Date May 12, 2016, 1:23 a.m.
Message ID <0669c0d7de92c4ef5207.1463016190@gcc2-power8.osuosl.org>
Download mbox | patch
Permalink /patch/15050/
State Rejected
Headers show

Comments

timeless - May 12, 2016, 1:23 a.m.
# HG changeset patch
# User timeless <timeless@mozdev.org>
# Date 1461035348 0
#      Tue Apr 19 03:09:08 2016 +0000
# Node ID 0669c0d7de92c4ef5207272ad17d1126338aa985
# Parent  c3476399e7a64ea562a2748d010aff57a884481f
# Available At bb://timeless/mercurial-crew
#              hg pull bb://timeless/mercurial-crew -r 0669c0d7de92
localrepo: bytes for errors
Matt Mackall - May 12, 2016, 4:47 a.m.
On Thu, 2016-05-12 at 01:23 +0000, timeless wrote:
> # HG changeset patch
> # User timeless <timeless@mozdev.org>
> # Date 1461035348 0
> #      Tue Apr 19 03:09:08 2016 +0000
> # Node ID 0669c0d7de92c4ef5207272ad17d1126338aa985
> # Parent  c3476399e7a64ea562a2748d010aff57a884481f
> # Available At bb://timeless/mercurial-crew
> #              hg pull bb://timeless/mercurial-crew -r 0669c0d7de92
> localrepo: bytes for errors
> 
> diff -r c3476399e7a6 -r 0669c0d7de92 mercurial/localrepo.py
> --- a/mercurial/localrepo.py	Tue Apr 19 14:34:11 2016 +0000
> +++ b/mercurial/localrepo.py	Tue Apr 19 03:09:08 2016 +0000
> @@ -294,9 +294,9 @@
>                          b' dummy changelog to prevent using the old repo
> layout'
>                      )
>              else:
> -                raise error.RepoError(_("repository %s not found") % path)
> +                raise error.RepoError(_("repository %s not found") %
> path.encode('utf-8'))

This violates our encoding strategy:

https://www.mercurial-scm.org/wiki/EncodingStrategy

(In particular, very very little outside of encoding.py should use the letters
"utf" outside of comment.)

It's also very, very broken.

First, path comes from the environment. And in the Unix environment, it can be
basically any sequence of bytes not containing null. It can be in any encoding,
multiple encodings[1], or even no encoding[2]. Unix doesn't know or care and
therefore Mercurial doesn't know or care either. Which means it can't
meaningfully "encode a path" to UTF-8. Nevermind that it's going to immediately
print the string in a locale that might not be UTF-8 either.

Second, str.encode(x) is a method that shouldn't exist because encoding is a
transformation from characters to bytes and a str is already bytes. But worse
than that, it's also broken, because it's effectively
str.decode('ascii').encode(x). So it'll usually no-op in naive tests that will
explode horribly when it first encounters the real world. It's basically a trap,
and it's probably the third-worst unicode design mistake in Python[3].


Also, be aware that I'm going to have a very dim view of bulk "" -> b"" patches.
There are 100x more bare strings than xrange calls, and we've already been
there.

[1] You should see my MP3 collection of German industrial music ripped circa
2002.
[2] Back in the days of university computer systems with tiny quotas, you could
often cheat by storing *binaries* in filenames or symlinks
[3] The second is automatic casting between bytes and unicode.
-- 
Mathematics is the supreme nostalgia of our time.
Jun Wu - May 13, 2016, 1:59 p.m.
One possible solution doing the "" -> b"" conversion is to abuse the

   # coding:

comment and use codecs.register to support a dialect that replaces
all "" to b"" at compiling time.

Existing projects doing this:
1. https://github.com/dropbox/pyxl
2. https://github.com/syrusakbary/interpy


On 05/13/2016 04:48 AM, Gregory Szorc wrote:
 > On Wed, May 11, 2016 at 9:47 PM, Matt Mackall <mpm@selenic.com> wrote:
 >> Also, be aware that I'm going to have a very dim view of bulk "" -> b""
 >> patches.
 >> There are 100x more bare strings than xrange calls, and we've already been
 >> there.
 >>
 >
 > Unfortunately, we're going to need to mass convert "" to b"" if we wish to
 > run the same source on Python 2/3 otherwise Python 3 is going to assume
 > string literals are unicode, which is the exact opposite of what we want.
 > In theory, we might be able to implement a custom module loader on Python 3
 > that does source/ast translation when loading .py files. But this scares me
 > for several reasons.
 >
 > As for this series, I think mass converting "" to b"" is preferred to
 > converting whatever strings you happen to trip over when running whatever
 > tests are failing on Python 3. It's probably easiest to start with
 > everything as bytes and then introduce unicode/str type only where
 > necessary instead of going through this intermediate phase where half the
 > variables/types aren't compatible with Python 3.
Jun Wu - May 13, 2016, 2:18 p.m.
On 05/13/2016 04:48 AM, Gregory Szorc wrote:
> In theory, we might be able to implement a custom module loader on Python 3
> that does source/ast translation when loading .py files. But this scares me
> for several reasons.

I realized the "module loader" may include the "# coding: " hack.
Could you explain the reasons?

I guess the reason we don't want massive codemod is because we want to maintain 
blame information. If that's the concern, the blame-skip-revset
idea may be another possible solution.
Gregory Szorc - May 13, 2016, 4:30 p.m.
On Fri, May 13, 2016 at 7:18 AM, Jun Wu <quark@fb.com> wrote:

> On 05/13/2016 04:48 AM, Gregory Szorc wrote:
>
>> In theory, we might be able to implement a custom module loader on Python
>> 3
>> that does source/ast translation when loading .py files. But this scares
>> me
>> for several reasons.
>>
>
> I realized the "module loader" may include the "# coding: " hack.
> Could you explain the reasons?
>

I didn't realize the "# coding" hack is an option. That's very attractive!

I suggested the module loading hack because we already have a custom module
loader handling mercurial.* modules. We could likely extend it rather
easily to do rewriting. But I think I like the "# coding" idea better.
Feels simpler.


>
> I guess the reason we don't want massive codemod is because we want to
> maintain blame information. If that's the concern, the blame-skip-revset
> idea may be another possible solution.
>

I like when Mercurial development realizes the need for a new Mercurial
feature :)
Matt Mackall - May 13, 2016, 8:37 p.m.
On Fri, 2016-05-13 at 09:30 -0700, Gregory Szorc wrote:
> On Fri, May 13, 2016 at 7:18 AM, Jun Wu <quark@fb.com> wrote:
> 
> > 
> > On 05/13/2016 04:48 AM, Gregory Szorc wrote:
> > 
> > > 
> > > In theory, we might be able to implement a custom module loader on Python
> > > 3
> > > that does source/ast translation when loading .py files. But this scares
> > > me
> > > for several reasons.
> > > 
> > I realized the "module loader" may include the "# coding: " hack.
> > Could you explain the reasons?
> > 
> I didn't realize the "# coding" hack is an option. That's very attractive!
> 
> I suggested the module loading hack because we already have a custom module
> loader handling mercurial.* modules. We could likely extend it rather
> easily to do rewriting. But I think I like the "# coding" idea better.
> Feels simpler.

Requirements for whatever solution we pick:

- minimal impact on 2.x performance and maintainability
- protection against rot / active enforcement by check-code
- cannot require devs to do their own unicode/bytes type analysis
- finite amount of churn[1]

Non-requirements:

- elegant
- good 3.x performance (a lost battle before we hit the first line of code)

Another possibility is we subvert unicode objects in Py3:

- force Py3's idea of the system encoding to be Latin1
- because Latin1 has 256 valid codepoints, it's 1:1 with byte strings
- thus unicode objects effectively become fat, slow byte strings
- our ASCII string literals don't need any adjustment
- we continue to do the "real" charset work in tolocal/fromlocal
- vast bulk of code doesn't need touching

[1] The level of churn we've gone through lately just for the print and import
statement work is a bit too high and is impacting throughput of higher-priority
development.

-- 
Mathematics is the supreme nostalgia of our time.
Gregory Szorc - May 15, 2016, 7:34 p.m.
On Fri, May 13, 2016 at 9:30 AM, Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> On Fri, May 13, 2016 at 7:18 AM, Jun Wu <quark@fb.com> wrote:
>
>> On 05/13/2016 04:48 AM, Gregory Szorc wrote:
>>
>>> In theory, we might be able to implement a custom module loader on
>>> Python 3
>>> that does source/ast translation when loading .py files. But this scares
>>> me
>>> for several reasons.
>>>
>>
>> I realized the "module loader" may include the "# coding: " hack.
>> Could you explain the reasons?
>>
>
> I didn't realize the "# coding" hack is an option. That's very attractive!
>
> I suggested the module loading hack because we already have a custom
> module loader handling mercurial.* modules. We could likely extend it
> rather easily to do rewriting. But I think I like the "# coding" idea
> better. Feels simpler.
>

Playing around with a custom codec, I'm not convinced this is easier than
hacking up module import.

When you use a custom "# coding" line, the source file's bytes get passed
to the codec's decode(). To convert string literals to bytes would require
us to identify string literals and rewrite the source bytes to contain the
"b" prefix. At the point you're parsing string literals, you've just
reinvented Python's parser. So it feels to me that the proper layer to
inject the automagical rewriting would be in the parser or ast level and
that would require custom module loading.

Fortunately, Python 3.5 has all the module import bits implemented in
Python (as opposed to C), so we /should/ have the control we need to inject
ourselves into module loading at the right layer.


>
>
>>
>> I guess the reason we don't want massive codemod is because we want to
>> maintain blame information. If that's the concern, the blame-skip-revset
>> idea may be another possible solution.
>>
>
> I like when Mercurial development realizes the need for a new Mercurial
> feature :)
>
Martijn Pieters - May 18, 2016, 5:43 p.m.
On 15 May 2016 at 20:34, Gregory Szorc <gregory.szorc@gmail.com> wrote:
> Playing around with a custom codec, I'm not convinced this is easier than
> hacking up module import.
>
> When you use a custom "# coding" line, the source file's bytes get passed to
> the codec's decode(). To convert string literals to bytes would require us
> to identify string literals and rewrite the source bytes to contain the "b"
> prefix. At the point you're parsing string literals, you've just reinvented
> Python's parser. So it feels to me that the proper layer to inject the
> automagical rewriting would be in the parser or ast level and that would
> require custom module loading.
>
> Fortunately, Python 3.5 has all the module import bits implemented in Python
> (as opposed to C), so we /should/ have the control we need to inject
> ourselves into module loading at the right layer.

At least the codec route has the advantage that it only has to run
once per source file revision; the result is cached as bytecode.

You don't have to re-invent the parser; you could use the tokenize
module (https://docs.python.org/3/library/tokenize.html) would do all
the heavily lifting. Just look out for token.STRING tokens and process
the token string; it'll be the full string literal (including any
prefixes attached).

Patch

diff -r c3476399e7a6 -r 0669c0d7de92 mercurial/localrepo.py
--- a/mercurial/localrepo.py	Tue Apr 19 14:34:11 2016 +0000
+++ b/mercurial/localrepo.py	Tue Apr 19 03:09:08 2016 +0000
@@ -294,9 +294,9 @@ 
                         b' dummy changelog to prevent using the old repo layout'
                     )
             else:
-                raise error.RepoError(_("repository %s not found") % path)
+                raise error.RepoError(_("repository %s not found") % path.encode('utf-8'))
         elif create:
-            raise error.RepoError(_("repository %s already exists") % path)
+            raise error.RepoError(_("repository %s already exists") % path.encode('utf-8'))
         else:
             try:
                 self.requirements = scmutil.readrequires(
@@ -312,7 +312,7 @@ 
             s = vfs.base
             if not vfs.exists():
                 raise error.RepoError(
-                    _('.hg/sharedpath points to nonexistent directory %s') % s)
+                    _('.hg/sharedpath points to nonexistent directory %s') % s.encode('utf-8'))
             self.sharedpath = s
         except IOError as inst:
             if inst.errno != errno.ENOENT:
@@ -819,7 +819,7 @@ 
             return self.branchmap().branchtip(branch)
         except KeyError:
             if not ignoremissing:
-                raise error.RepoLookupError(_("unknown branch '%s'") % branch)
+                raise error.RepoLookupError(_("unknown branch '%s'") % branch.encode('utf-8'))
             else:
                 pass
 
@@ -1290,7 +1290,7 @@ 
             if not wait:
                 raise
             self.ui.warn(_("waiting for lock on %s held by %r\n") %
-                         (desc, inst.locker))
+                         (desc.encode('utf-8'), inst.locker))
             # default to 600 seconds timeout
             l = lockmod.lock(vfs, lockname,
                              int(self.ui.config("ui", "timeout", "600")),
@@ -1325,7 +1325,7 @@ 
             return l
 
         l = self._lock(self.svfs, "lock", wait, None,
-                       self.invalidate, _('repository %s') % self.origroot)
+                       self.invalidate, _('repository %s') % self.origroot.encode(('utf-8')))
         self._lockref = weakref.ref(l)
         return l
 
@@ -1365,7 +1365,7 @@ 
 
         l = self._lock(self.vfs, "wlock", wait, unlock,
                        self.invalidatedirstate, _('working directory of %s') %
-                       self.origroot,
+                       self.origroot.encode('utf-8'),
                        inheritchecker=self._wlockchecktransaction,
                        parentenvvar='HG_WLOCK_LOCKER')
         self._wlockref = weakref.ref(l)
@@ -1552,7 +1552,7 @@ 
                             continue
                         if not force:
                             raise error.Abort(
-                                _("commit with new subrepo %s excluded") % s)
+                                _("commit with new subrepo %s excluded") % s.encode('utf-8'))
                     dirtyreason = wctx.sub(s).dirtyreason(True)
                     if dirtyreason:
                         if not self.ui.configbool('ui', 'commitsubrepos'):
@@ -1624,7 +1624,7 @@ 
                 for s in sorted(commitsubs):
                     sub = wctx.sub(s)
                     self.ui.status(_('committing subrepository %s\n') %
-                        subrepo.subrelpath(sub))
+                        subrepo.subrelpath(sub).encode('utf-8'))
                     sr = sub.commit(cctx._text, user, date)
                     newstate[s] = (newstate[s][0], sr)
                 subrepo.writestate(self, newstate)
@@ -1639,7 +1639,7 @@ 
             except: # re-raises
                 if edited:
                     self.ui.write(
-                        _('note: commit message saved in %s\n') % msgfn)
+                        _('note: commit message saved in %s\n') % msgfn.encode('utf-8'))
                 raise
             # update bookmarks, dirstate and mergestate
             bookmarks.update(self, [p1, p2], ret)
@@ -1697,12 +1697,12 @@ 
                                                     trp, changed)
                             m.setflag(f, fctx.flags())
                     except OSError as inst:
-                        self.ui.warn(_("trouble committing %s!\n") % f)
+                        self.ui.warn(_("trouble committing %s!\n") % f.encode('utf-8'))
                         raise
                     except IOError as inst:
                         errcode = getattr(inst, 'errno', errno.ENOENT)
                         if error or errcode and errcode != errno.ENOENT:
-                            self.ui.warn(_("trouble committing %s!\n") % f)
+                            self.ui.warn(_("trouble committing %s!\n") % f.encode('utf-8'))
                         raise
 
                 # update manifest