Patchwork minirst: make substitutions be unicode, as they are used on unicode data

login
register
mail settings
Submitter Augie Fackler
Date June 7, 2014, 7:52 p.m.
Message ID <efe7c79ca21edbdc2aeb.1402170753@augie-macbookair.laroche.local>
Download mbox | patch
Permalink /patch/4950/
State Changes Requested
Headers show

Comments

Augie Fackler - June 7, 2014, 7:52 p.m.
# HG changeset patch
# User Augie Fackler <raf@durin42.com>
# Date 1402170658 14400
#      Sat Jun 07 15:50:58 2014 -0400
# Node ID efe7c79ca21edbdc2aeb83e60335c6d89294f3db
# Parent  7afe70a5d2ad5b22c21ba9be849451407c1f337f
minirst: make substitutions be unicode, as they are used on unicode data

Caught this while working to try and get 'hg help' working in Python 3.
Pierre-Yves David - June 12, 2014, 3:42 a.m.
On 06/11/2014 08:38 PM, Kevin Bullock wrote:
> On Jun 7, 2014, at 2:52 PM, Augie Fackler <raf@durin42.com> wrote:
>
>> # HG changeset patch
>> # User Augie Fackler <raf@durin42.com>
>> # Date 1402170658 14400
>> #      Sat Jun 07 15:50:58 2014 -0400
>> # Node ID efe7c79ca21edbdc2aeb83e60335c6d89294f3db
>> # Parent  7afe70a5d2ad5b22c21ba9be849451407c1f337f
>> minirst: make substitutions be unicode, as they are used on unicode data
>
> Looks sensible, but I'm not at all familiar with how minirst works.

CCing Martin
Matt Mackall - June 12, 2014, 7:35 p.m.
On Sat, 2014-06-07 at 15:52 -0400, Augie Fackler wrote:
> # HG changeset patch
> # User Augie Fackler <raf@durin42.com>
> # Date 1402170658 14400
> #      Sat Jun 07 15:50:58 2014 -0400
> # Node ID efe7c79ca21edbdc2aeb83e60335c6d89294f3db
> # Parent  7afe70a5d2ad5b22c21ba9be849451407c1f337f
> minirst: make substitutions be unicode, as they are used on unicode data
> 
> Caught this while working to try and get 'hg help' working in Python 3.

Huh. Something changed here then. Every once in a while I add this to
hg:

reload(sys)
sys.setdefaultencoding("undefined")

..and clean up the resulting breakage[1]. At one point, I fixed this
precise thing.. in the other direction:

http://www.selenic.com/hg/rev/0ad0ebe67815

Looks like foozy switched it back, and reintroduced the implicit unicode
coercion:

http://www.selenic.com/hg/rev/87bb6b7644f6

As a rule, we should stick to the rule "everything is a bytestring in
the local encoding" in the global scope and decode/encode things as
needed. Otherwise, we're always going to have confusion about the type
of arguments. So I'd rather do this in this case:

    utext = text.decode(encoding.encoding)
    for f, t in substs:
        utext = utext.replace(f.decode("ascii"), t.decode("ascii"))
    return utext.encode(encoding.encoding)

Also, we might consider encapsulating this encoding-aware-replace in
encoding.py, which will let us be smarter about this in the future: when
the encoding isn't the "bad" multibyte encoding where ASCII isn't a
proper subset (ie Shift-JIS), we can do a normal replace without
transcoding.

[1] Automatic promotion to Unicode is the biggest flaw in Py2
Augie Fackler - June 13, 2014, 11:50 a.m.
On Jun 12, 2014, at 3:35 PM, Matt Mackall <mpm@selenic.com> wrote:

> On Sat, 2014-06-07 at 15:52 -0400, Augie Fackler wrote:
>> # HG changeset patch
>> # User Augie Fackler <raf@durin42.com>
>> # Date 1402170658 14400
>> #      Sat Jun 07 15:50:58 2014 -0400
>> # Node ID efe7c79ca21edbdc2aeb83e60335c6d89294f3db
>> # Parent  7afe70a5d2ad5b22c21ba9be849451407c1f337f
>> minirst: make substitutions be unicode, as they are used on unicode data
>> 
>> Caught this while working to try and get 'hg help' working in Python 3.
> 
> Huh. Something changed here then. Every once in a while I add this to
> hg:
> 
> reload(sys)
> sys.setdefaultencoding("undefined")
> 
> ..and clean up the resulting breakage[1]. At one point, I fixed this
> precise thing.. in the other direction:

Could we add support for an HGUNICODEPEDANTRY environment variable that would do this before importing most of hg? Then we could (optionally) run a buildbot with that on. I'd be happy to write the patch if you're OK with that as an approach.

> 
> http://www.selenic.com/hg/rev/0ad0ebe67815
> 
> Looks like foozy switched it back, and reintroduced the implicit unicode
> coercion:
> 
> http://www.selenic.com/hg/rev/87bb6b7644f6
> 
> As a rule, we should stick to the rule "everything is a bytestring in
> the local encoding" in the global scope and decode/encode things as
> needed. Otherwise, we're always going to have confusion about the type
> of arguments. So I'd rather do this in this case:
> 
>    utext = text.decode(encoding.encoding)
>    for f, t in substs:
>        utext = utext.replace(f.decode("ascii"), t.decode("ascii"))
>    return utext.encode(encoding.encoding)

Ah, okay. I wasn't sure if this was intentional, since it felt like minirst might be doing something strange.

> Also, we might consider encapsulating this encoding-aware-replace in
> encoding.py, which will let us be smarter about this in the future: when
> the encoding isn't the "bad" multibyte encoding where ASCII isn't a
> proper subset (ie Shift-JIS), we can do a normal replace without
> transcoding.
> 
> [1] Automatic promotion to Unicode is the biggest flaw in Py2
> 
> -- 
> Mathematics is the supreme nostalgia of our time.
> 
>
Matt Mackall - June 13, 2014, 7:12 p.m.
On Fri, 2014-06-13 at 07:50 -0400, Augie Fackler wrote:
> On Jun 12, 2014, at 3:35 PM, Matt Mackall <mpm@selenic.com> wrote:
> 
> > On Sat, 2014-06-07 at 15:52 -0400, Augie Fackler wrote:
> >> # HG changeset patch
> >> # User Augie Fackler <raf@durin42.com>
> >> # Date 1402170658 14400
> >> #      Sat Jun 07 15:50:58 2014 -0400
> >> # Node ID efe7c79ca21edbdc2aeb83e60335c6d89294f3db
> >> # Parent  7afe70a5d2ad5b22c21ba9be849451407c1f337f
> >> minirst: make substitutions be unicode, as they are used on unicode data
> >> 
> >> Caught this while working to try and get 'hg help' working in Python 3.
> > 
> > Huh. Something changed here then. Every once in a while I add this to
> > hg:
> > 
> > reload(sys)
> > sys.setdefaultencoding("undefined")
> > 
> > ..and clean up the resulting breakage[1]. At one point, I fixed this
> > precise thing.. in the other direction:
> 
> Could we add support for an HGUNICODEPEDANTRY environment variable that would do this before importing most of hg? Then we could (optionally) run a buildbot with that on. I'd be happy to write the patch if you're OK with that as an approach.

Sure.

> >    utext = text.decode(encoding.encoding)
> >    for f, t in substs:
> >        utext = utext.replace(f.decode("ascii"), t.decode("ascii"))
> >    return utext.encode(encoding.encoding)
> 
> Ah, okay. I wasn't sure if this was intentional, since it felt like minirst might be doing something strange.

Since I have this fix sitting in my working directory, I'll go ahead and
push it.

Patch

diff --git a/mercurial/minirst.py b/mercurial/minirst.py
--- a/mercurial/minirst.py
+++ b/mercurial/minirst.py
@@ -353,14 +353,14 @@ 
     return blocks
 
 def inlineliterals(blocks):
-    substs = [('``', '"')]
+    substs = [(u'``', u'"')]
     for b in blocks:
         if b['type'] in ('paragraph', 'section'):
             b['lines'] = [replace(l, substs) for l in b['lines']]
     return blocks
 
 def hgrole(blocks):
-    substs = [(':hg:`', '"hg '), ('`', '"')]
+    substs = [(u':hg:`', u'"hg '), (u'`', u'"')]
     for b in blocks:
         if b['type'] in ('paragraph', 'section'):
             # Turn :hg:`command` into "hg command". This also works