Submitter | timeless |
---|---|
Date | May 12, 2016, 1:23 a.m. |
Message ID | <0669c0d7de92c4ef5207.1463016190@gcc2-power8.osuosl.org> |
Download | mbox | patch |
Permalink | /patch/15050/ |
State | Rejected |
Headers | show |
Comments
On Thu, 2016-05-12 at 01:23 +0000, timeless wrote: > # HG changeset patch > # User timeless <timeless@mozdev.org> > # Date 1461035348 0 > # Tue Apr 19 03:09:08 2016 +0000 > # Node ID 0669c0d7de92c4ef5207272ad17d1126338aa985 > # Parent c3476399e7a64ea562a2748d010aff57a884481f > # Available At bb://timeless/mercurial-crew > # hg pull bb://timeless/mercurial-crew -r 0669c0d7de92 > localrepo: bytes for errors > > diff -r c3476399e7a6 -r 0669c0d7de92 mercurial/localrepo.py > --- a/mercurial/localrepo.py Tue Apr 19 14:34:11 2016 +0000 > +++ b/mercurial/localrepo.py Tue Apr 19 03:09:08 2016 +0000 > @@ -294,9 +294,9 @@ > b' dummy changelog to prevent using the old repo > layout' > ) > else: > - raise error.RepoError(_("repository %s not found") % path) > + raise error.RepoError(_("repository %s not found") % > path.encode('utf-8')) This violates our encoding strategy: https://www.mercurial-scm.org/wiki/EncodingStrategy (In particular, very very little outside of encoding.py should use the letters "utf" outside of comment.) It's also very, very broken. First, path comes from the environment. And in the Unix environment, it can be basically any sequence of bytes not containing null. It can be in any encoding, multiple encodings[1], or even no encoding[2]. Unix doesn't know or care and therefore Mercurial doesn't know or care either. Which means it can't meaningfully "encode a path" to UTF-8. Nevermind that it's going to immediately print the string in a locale that might not be UTF-8 either. Second, str.encode(x) is a method that shouldn't exist because encoding is a transformation from characters to bytes and a str is already bytes. But worse than that, it's also broken, because it's effectively str.decode('ascii').encode(x). So it'll usually no-op in naive tests that will explode horribly when it first encounters the real world. It's basically a trap, and it's probably the third-worst unicode design mistake in Python[3]. Also, be aware that I'm going to have a very dim view of bulk "" -> b"" patches. There are 100x more bare strings than xrange calls, and we've already been there. [1] You should see my MP3 collection of German industrial music ripped circa 2002. [2] Back in the days of university computer systems with tiny quotas, you could often cheat by storing *binaries* in filenames or symlinks [3] The second is automatic casting between bytes and unicode. -- Mathematics is the supreme nostalgia of our time.
One possible solution doing the "" -> b"" conversion is to abuse the # coding: comment and use codecs.register to support a dialect that replaces all "" to b"" at compiling time. Existing projects doing this: 1. https://github.com/dropbox/pyxl 2. https://github.com/syrusakbary/interpy On 05/13/2016 04:48 AM, Gregory Szorc wrote: > On Wed, May 11, 2016 at 9:47 PM, Matt Mackall <mpm@selenic.com> wrote: >> Also, be aware that I'm going to have a very dim view of bulk "" -> b"" >> patches. >> There are 100x more bare strings than xrange calls, and we've already been >> there. >> > > Unfortunately, we're going to need to mass convert "" to b"" if we wish to > run the same source on Python 2/3 otherwise Python 3 is going to assume > string literals are unicode, which is the exact opposite of what we want. > In theory, we might be able to implement a custom module loader on Python 3 > that does source/ast translation when loading .py files. But this scares me > for several reasons. > > As for this series, I think mass converting "" to b"" is preferred to > converting whatever strings you happen to trip over when running whatever > tests are failing on Python 3. It's probably easiest to start with > everything as bytes and then introduce unicode/str type only where > necessary instead of going through this intermediate phase where half the > variables/types aren't compatible with Python 3.
On 05/13/2016 04:48 AM, Gregory Szorc wrote: > In theory, we might be able to implement a custom module loader on Python 3 > that does source/ast translation when loading .py files. But this scares me > for several reasons. I realized the "module loader" may include the "# coding: " hack. Could you explain the reasons? I guess the reason we don't want massive codemod is because we want to maintain blame information. If that's the concern, the blame-skip-revset idea may be another possible solution.
On Fri, May 13, 2016 at 7:18 AM, Jun Wu <quark@fb.com> wrote: > On 05/13/2016 04:48 AM, Gregory Szorc wrote: > >> In theory, we might be able to implement a custom module loader on Python >> 3 >> that does source/ast translation when loading .py files. But this scares >> me >> for several reasons. >> > > I realized the "module loader" may include the "# coding: " hack. > Could you explain the reasons? > I didn't realize the "# coding" hack is an option. That's very attractive! I suggested the module loading hack because we already have a custom module loader handling mercurial.* modules. We could likely extend it rather easily to do rewriting. But I think I like the "# coding" idea better. Feels simpler. > > I guess the reason we don't want massive codemod is because we want to > maintain blame information. If that's the concern, the blame-skip-revset > idea may be another possible solution. > I like when Mercurial development realizes the need for a new Mercurial feature :)
On Fri, 2016-05-13 at 09:30 -0700, Gregory Szorc wrote: > On Fri, May 13, 2016 at 7:18 AM, Jun Wu <quark@fb.com> wrote: > > > > > On 05/13/2016 04:48 AM, Gregory Szorc wrote: > > > > > > > > In theory, we might be able to implement a custom module loader on Python > > > 3 > > > that does source/ast translation when loading .py files. But this scares > > > me > > > for several reasons. > > > > > I realized the "module loader" may include the "# coding: " hack. > > Could you explain the reasons? > > > I didn't realize the "# coding" hack is an option. That's very attractive! > > I suggested the module loading hack because we already have a custom module > loader handling mercurial.* modules. We could likely extend it rather > easily to do rewriting. But I think I like the "# coding" idea better. > Feels simpler. Requirements for whatever solution we pick: - minimal impact on 2.x performance and maintainability - protection against rot / active enforcement by check-code - cannot require devs to do their own unicode/bytes type analysis - finite amount of churn[1] Non-requirements: - elegant - good 3.x performance (a lost battle before we hit the first line of code) Another possibility is we subvert unicode objects in Py3: - force Py3's idea of the system encoding to be Latin1 - because Latin1 has 256 valid codepoints, it's 1:1 with byte strings - thus unicode objects effectively become fat, slow byte strings - our ASCII string literals don't need any adjustment - we continue to do the "real" charset work in tolocal/fromlocal - vast bulk of code doesn't need touching [1] The level of churn we've gone through lately just for the print and import statement work is a bit too high and is impacting throughput of higher-priority development. -- Mathematics is the supreme nostalgia of our time.
On Fri, May 13, 2016 at 9:30 AM, Gregory Szorc <gregory.szorc@gmail.com> wrote: > On Fri, May 13, 2016 at 7:18 AM, Jun Wu <quark@fb.com> wrote: > >> On 05/13/2016 04:48 AM, Gregory Szorc wrote: >> >>> In theory, we might be able to implement a custom module loader on >>> Python 3 >>> that does source/ast translation when loading .py files. But this scares >>> me >>> for several reasons. >>> >> >> I realized the "module loader" may include the "# coding: " hack. >> Could you explain the reasons? >> > > I didn't realize the "# coding" hack is an option. That's very attractive! > > I suggested the module loading hack because we already have a custom > module loader handling mercurial.* modules. We could likely extend it > rather easily to do rewriting. But I think I like the "# coding" idea > better. Feels simpler. > Playing around with a custom codec, I'm not convinced this is easier than hacking up module import. When you use a custom "# coding" line, the source file's bytes get passed to the codec's decode(). To convert string literals to bytes would require us to identify string literals and rewrite the source bytes to contain the "b" prefix. At the point you're parsing string literals, you've just reinvented Python's parser. So it feels to me that the proper layer to inject the automagical rewriting would be in the parser or ast level and that would require custom module loading. Fortunately, Python 3.5 has all the module import bits implemented in Python (as opposed to C), so we /should/ have the control we need to inject ourselves into module loading at the right layer. > > >> >> I guess the reason we don't want massive codemod is because we want to >> maintain blame information. If that's the concern, the blame-skip-revset >> idea may be another possible solution. >> > > I like when Mercurial development realizes the need for a new Mercurial > feature :) >
On 15 May 2016 at 20:34, Gregory Szorc <gregory.szorc@gmail.com> wrote: > Playing around with a custom codec, I'm not convinced this is easier than > hacking up module import. > > When you use a custom "# coding" line, the source file's bytes get passed to > the codec's decode(). To convert string literals to bytes would require us > to identify string literals and rewrite the source bytes to contain the "b" > prefix. At the point you're parsing string literals, you've just reinvented > Python's parser. So it feels to me that the proper layer to inject the > automagical rewriting would be in the parser or ast level and that would > require custom module loading. > > Fortunately, Python 3.5 has all the module import bits implemented in Python > (as opposed to C), so we /should/ have the control we need to inject > ourselves into module loading at the right layer. At least the codec route has the advantage that it only has to run once per source file revision; the result is cached as bytecode. You don't have to re-invent the parser; you could use the tokenize module (https://docs.python.org/3/library/tokenize.html) would do all the heavily lifting. Just look out for token.STRING tokens and process the token string; it'll be the full string literal (including any prefixes attached).
Patch
diff -r c3476399e7a6 -r 0669c0d7de92 mercurial/localrepo.py --- a/mercurial/localrepo.py Tue Apr 19 14:34:11 2016 +0000 +++ b/mercurial/localrepo.py Tue Apr 19 03:09:08 2016 +0000 @@ -294,9 +294,9 @@ b' dummy changelog to prevent using the old repo layout' ) else: - raise error.RepoError(_("repository %s not found") % path) + raise error.RepoError(_("repository %s not found") % path.encode('utf-8')) elif create: - raise error.RepoError(_("repository %s already exists") % path) + raise error.RepoError(_("repository %s already exists") % path.encode('utf-8')) else: try: self.requirements = scmutil.readrequires( @@ -312,7 +312,7 @@ s = vfs.base if not vfs.exists(): raise error.RepoError( - _('.hg/sharedpath points to nonexistent directory %s') % s) + _('.hg/sharedpath points to nonexistent directory %s') % s.encode('utf-8')) self.sharedpath = s except IOError as inst: if inst.errno != errno.ENOENT: @@ -819,7 +819,7 @@ return self.branchmap().branchtip(branch) except KeyError: if not ignoremissing: - raise error.RepoLookupError(_("unknown branch '%s'") % branch) + raise error.RepoLookupError(_("unknown branch '%s'") % branch.encode('utf-8')) else: pass @@ -1290,7 +1290,7 @@ if not wait: raise self.ui.warn(_("waiting for lock on %s held by %r\n") % - (desc, inst.locker)) + (desc.encode('utf-8'), inst.locker)) # default to 600 seconds timeout l = lockmod.lock(vfs, lockname, int(self.ui.config("ui", "timeout", "600")), @@ -1325,7 +1325,7 @@ return l l = self._lock(self.svfs, "lock", wait, None, - self.invalidate, _('repository %s') % self.origroot) + self.invalidate, _('repository %s') % self.origroot.encode(('utf-8'))) self._lockref = weakref.ref(l) return l @@ -1365,7 +1365,7 @@ l = self._lock(self.vfs, "wlock", wait, unlock, self.invalidatedirstate, _('working directory of %s') % - self.origroot, + self.origroot.encode('utf-8'), inheritchecker=self._wlockchecktransaction, parentenvvar='HG_WLOCK_LOCKER') self._wlockref = weakref.ref(l) @@ -1552,7 +1552,7 @@ continue if not force: raise error.Abort( - _("commit with new subrepo %s excluded") % s) + _("commit with new subrepo %s excluded") % s.encode('utf-8')) dirtyreason = wctx.sub(s).dirtyreason(True) if dirtyreason: if not self.ui.configbool('ui', 'commitsubrepos'): @@ -1624,7 +1624,7 @@ for s in sorted(commitsubs): sub = wctx.sub(s) self.ui.status(_('committing subrepository %s\n') % - subrepo.subrelpath(sub)) + subrepo.subrelpath(sub).encode('utf-8')) sr = sub.commit(cctx._text, user, date) newstate[s] = (newstate[s][0], sr) subrepo.writestate(self, newstate) @@ -1639,7 +1639,7 @@ except: # re-raises if edited: self.ui.write( - _('note: commit message saved in %s\n') % msgfn) + _('note: commit message saved in %s\n') % msgfn.encode('utf-8')) raise # update bookmarks, dirstate and mergestate bookmarks.update(self, [p1, p2], ret) @@ -1697,12 +1697,12 @@ trp, changed) m.setflag(f, fctx.flags()) except OSError as inst: - self.ui.warn(_("trouble committing %s!\n") % f) + self.ui.warn(_("trouble committing %s!\n") % f.encode('utf-8')) raise except IOError as inst: errcode = getattr(inst, 'errno', errno.ENOENT) if error or errcode and errcode != errno.ENOENT: - self.ui.warn(_("trouble committing %s!\n") % f) + self.ui.warn(_("trouble committing %s!\n") % f.encode('utf-8')) raise # update manifest