Patchwork [2,of,3] templater: replace jsonescape in main json templater (issue4926)

login
register
mail settings
Submitter Matt Mackall
Date Jan. 12, 2016, 5:01 p.m.
Message ID <35d049d7e5a2dec87318.1452618066@ruin.waste.org>
Download mbox | patch
Permalink /patch/12695/
State Changes Requested
Delegated to: Yuya Nishihara
Headers show

Comments

Matt Mackall - Jan. 12, 2016, 5:01 p.m.
# HG changeset patch
# User Matt Mackall <mpm@selenic.com>
# Date 1452542432 21600
#      Mon Jan 11 14:00:32 2016 -0600
# Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
# Parent  544d391bd3b42b96975a3521b73c25223db930b0
templater: replace jsonescape in main json templater (issue4926)

This version differs in a couple ways:

- it skips optional escaping of codepoints > U+007f
- it thus handles emoji correctly (JSON requires UTF-16 surrogates)
- but it may run afoul of silly Unicode linebreaks if exec'd in js
- it uses UTF-8b to round-trip undecodeable bytes
Yuya Nishihara - Jan. 13, 2016, 1:01 p.m.
On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote:
> # HG changeset patch
> # User Matt Mackall <mpm@selenic.com>
> # Date 1452542432 21600
> #      Mon Jan 11 14:00:32 2016 -0600
> # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
> # Parent  544d391bd3b42b96975a3521b73c25223db930b0
> templater: replace jsonescape in main json templater (issue4926)
> 
> This version differs in a couple ways:
> 
> - it skips optional escaping of codepoints > U+007f
> - it thus handles emoji correctly (JSON requires UTF-16 surrogates)
> - but it may run afoul of silly Unicode linebreaks if exec'd in js
> - it uses UTF-8b to round-trip undecodeable bytes

We can't do that because JSON output can be embedded in non-UTF-8 HTML,
where only 7bit ASCII is allowed, and JSON input (i.e. template string)
is a local-encoding text in general.

I have patch series to fix the issue4926, but I found my patch seems to have
the emoji issue right now.
Matt Mackall - Jan. 13, 2016, 4:51 p.m.
On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote:
> On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote:
> > # HG changeset patch
> > # User Matt Mackall <mpm@selenic.com>
> > # Date 1452542432 21600
> > #      Mon Jan 11 14:00:32 2016 -0600
> > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
> > # Parent  544d391bd3b42b96975a3521b73c25223db930b0
> > templater: replace jsonescape in main json templater (issue4926)
> > 
> > This version differs in a couple ways:
> > 
> > - it skips optional escaping of codepoints > U+007f
> > - it thus handles emoji correctly (JSON requires UTF-16 surrogates)
> > - but it may run afoul of silly Unicode linebreaks if exec'd in js
> > - it uses UTF-8b to round-trip undecodeable bytes
> 
> We can't do that because JSON output can be embedded in non-UTF-8 HTML,
> where only 7bit ASCII is allowed,

Example scenarios, please. 

There's no configuration of hgweb that won't potentially display non-ASCII if it
exists in files. If you commit Unicode "á" to a file and fire up
"HGENCODING=ascii hg serve", you'll get mojibake in the browser by design (and
the correct bytes verbatim if you select raw mode). So I'm not sure what you
mean by "allowed". I guess we could get into trouble if we expand JSON directly
into some in-page Javascript when the page metadata marks it as non-UTF8. 

>  and JSON input (i.e. template string)
> is a local-encoding text in general.

encoding.jsonescape (indirectly) knows about localstr objects, and thus recovers
the original UTF-8 text to encode if it exists.

> I have patch series to fix the issue4926, but I found my patch seems to have
> the emoji issue right now.

Whatever we do, we need to kill the second implementation of jsonescape in the
templater.
Yuya Nishihara - Jan. 14, 2016, 1:12 p.m.
On Wed, 13 Jan 2016 10:51:06 -0600, Matt Mackall wrote:
> On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote:
> > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote:
> > > # HG changeset patch
> > > # User Matt Mackall <mpm@selenic.com>
> > > # Date 1452542432 21600
> > > #      Mon Jan 11 14:00:32 2016 -0600
> > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
> > > # Parent  544d391bd3b42b96975a3521b73c25223db930b0
> > > templater: replace jsonescape in main json templater (issue4926)
> > > 
> > > This version differs in a couple ways:
> > > 
> > > - it skips optional escaping of codepoints > U+007f
> > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates)
> > > - but it may run afoul of silly Unicode linebreaks if exec'd in js
> > > - it uses UTF-8b to round-trip undecodeable bytes
> > 
> > We can't do that because JSON output can be embedded in non-UTF-8 HTML,
> > where only 7bit ASCII is allowed,
> 
> Example scenarios, please.

HGENCODING=utf-8
export HGENCODING

hg init a
cd a
touch foo
hg ci -Am "$(python -c 'print u"\xc0".encode("utf-8")')"
hg serve --encoding iso-8859-1

Then, access to http://localhost:8000/graph/tip .
(In our real-word example, --encoding Shift_JIS and Japanese characters.)

Before this patch, there was no mojibake because "À" is escaped to "\u00c0".
With this patch, "À" is lost as follows:

  u"À" -> "\xc0" (iso-8859-1) -> "\xed\xb3\x80" (utf8b)
  -> "\xed\xb3\x80" (iso-8859-1)

> There's no configuration of hgweb that won't potentially display non-ASCII if it
> exists in files. If you commit Unicode "á" to a file and fire up
> "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design (and
> the correct bytes verbatim if you select raw mode). So I'm not sure what you
> mean by "allowed". I guess we could get into trouble if we expand JSON directly
> into some in-page Javascript when the page metadata marks it as non-UTF8.

JSON data can be embedded in non-UTF8 page so long as it is represented in ASCII
and the page encoding is compatible with ASCII.

> >  and JSON input (i.e. template string)
> > is a local-encoding text in general.
> 
> encoding.jsonescape (indirectly) knows about localstr objects, and thus recovers
> the original UTF-8 text to encode if it exists.

Yes, but localstr is mostly lost in templater, and toutf8b() takes it as bytes,
not as local-encoding text.

> > I have patch series to fix the issue4926, but I found my patch seems to have
> > the emoji issue right now.
> 
> Whatever we do, we need to kill the second implementation of jsonescape in the
> templater.

Sure. My series will do:

 1. add option to escape all non-ASCII characters by encoding.jsonescape()
 2. add "|utf8" template filter to explicitly convert localstr|str to utf-8
 3. change "|json" to take input as utf8b bytes (BC)
Matt Mackall - Jan. 14, 2016, 6:49 p.m.
On Thu, 2016-01-14 at 22:12 +0900, Yuya Nishihara wrote:
> On Wed, 13 Jan 2016 10:51:06 -0600, Matt Mackall wrote:
> > On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote:
> > > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote:
> > > > # HG changeset patch
> > > > # User Matt Mackall <mpm@selenic.com>
> > > > # Date 1452542432 21600
> > > > #      Mon Jan 11 14:00:32 2016 -0600
> > > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
> > > > # Parent  544d391bd3b42b96975a3521b73c25223db930b0
> > > > templater: replace jsonescape in main json templater (issue4926)
> > > > 
> > > > This version differs in a couple ways:
> > > > 
> > > > - it skips optional escaping of codepoints > U+007f
> > > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates)
> > > > - but it may run afoul of silly Unicode linebreaks if exec'd in js
> > > > - it uses UTF-8b to round-trip undecodeable bytes
> > > 
> > > We can't do that because JSON output can be embedded in non-UTF-8 HTML,
> > > where only 7bit ASCII is allowed,
> > 
> > Example scenarios, please.
> 
> HGENCODING=utf-8
> export HGENCODING
> 
> hg init a
> cd a
> touch foo
> hg ci -Am "$(python -c 'print u"\xc0".encode("utf-8")')"
> hg serve --encoding iso-8859-1
> 
> Then, access to http://localhost:8000/graph/tip .
> (In our real-word example, --encoding Shift_JIS and Japanese characters.)
> 
> Before this patch, there was no mojibake because "À" is escaped to "\u00c0".
> With this patch, "À" is lost as follows:
> 
>   u"À" -> "\xc0" (iso-8859-1) -> "\xed\xb3\x80" (utf8b)
>   -> "\xed\xb3\x80" (iso-8859-1)
> 
> > There's no configuration of hgweb that won't potentially display non-ASCII
> > if it
> > exists in files. If you commit Unicode "á" to a file and fire up
> > "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design
> > (and
> > the correct bytes verbatim if you select raw mode). So I'm not sure what you
> > mean by "allowed". I guess we could get into trouble if we expand JSON
> > directly
> > into some in-page Javascript when the page metadata marks it as non-UTF8.
> 
> JSON data can be embedded in non-UTF8 page so long as it is represented in
> ASCII
> and the page encoding is compatible with ASCII.

Ok.

> > >  and JSON input (i.e. template string)
> > > is a local-encoding text in general.
> > 
> > encoding.jsonescape (indirectly) knows about localstr objects, and thus
> > recovers
> > the original UTF-8 text to encode if it exists.
> 
> Yes, but localstr is mostly lost in templater,

Oh? Please elaborate.

>  and toutf8b() takes it as bytes,
> not as local-encoding text.

You may be right, but according to the docstring, that means you've found a bug,

> > > I have patch series to fix the issue4926, but I found my patch seems to
> > > have
> > > the emoji issue right now.
> > 
> > Whatever we do, we need to kill the second implementation of jsonescape in
> > the
> > templater.
> 
> Sure. My series will do:
> 
>  1. add option to escape all non-ASCII characters by encoding.jsonescape()

Ok.

>  2. add "|utf8" template filter to explicitly convert localstr|str to utf-8

Sounds suspicious. What's it going to do with filenames?

>  3. change "|json" to take input as utf8b bytes (BC)

Sounds like a really big break from our encoding philosophy.
Yuya Nishihara - Jan. 15, 2016, 2:18 p.m.
On Thu, 14 Jan 2016 12:49:27 -0600, Matt Mackall wrote:
> On Thu, 2016-01-14 at 22:12 +0900, Yuya Nishihara wrote:
> > > >  and JSON input (i.e. template string)
> > > > is a local-encoding text in general.
> > > 
> > > encoding.jsonescape (indirectly) knows about localstr objects, and thus
> > > recovers
> > > the original UTF-8 text to encode if it exists.
> > 
> > Yes, but localstr is mostly lost in templater,
> 
> Oh? Please elaborate.

I meant a localstr object would soon be converted to a plain str, loosing
original _utf8 bytes. The templater does bunch of string manipulation.

> >  and toutf8b() takes it as bytes,
> > not as local-encoding text.
> 
> You may be right, but according to the docstring, that means you've found
> a bug,

I thought it was when utf8b helper was introduced, but I don't think it is
a bug now. toutf8b() must keep the original bytes unmodified, so it can't
take input as local-encoding text. The only thing it can do is to escape
non-UTF8 part to U+DCxx.

> >  1. add option to escape all non-ASCII characters by encoding.jsonescape()
> 
> Ok.
> 
> >  2. add "|utf8" template filter to explicitly convert localstr|str to utf-8
> 
> Sounds suspicious. What's it going to do with filenames?

Filenames are the reason I'm going to introduce ugly |utf8 filter.

Because filenames are bytes in Mercurial world, {filename|json} should preserve
the original byte sequence, which means

  {x|json} -> '"' toutf8b(x) '"'

On the other hand, most template strings are in local encoding. Because |json
filter have to be byte-transparent to filenames, we need something to annotate
an input as a local string.

  {x|utf8|json} -> '"' toutf8b(fromlocal(x)) '"'

> >  3. change "|json" to take input as utf8b bytes (BC)
> 
> Sounds like a really big break from our encoding philosophy.

Yep, but |json was special in that it converts an input string to unicode.

Good news is "json" and "jsonescape" are undocumented, so they are considered
as internal filters.

Patch

diff -r 544d391bd3b4 -r 35d049d7e5a2 mercurial/templatefilters.py
--- a/mercurial/templatefilters.py	Mon Jan 11 13:43:43 2016 -0600
+++ b/mercurial/templatefilters.py	Mon Jan 11 14:00:32 2016 -0600
@@ -197,15 +197,8 @@ 
         return {None: 'null', False: 'false', True: 'true'}[obj]
     elif isinstance(obj, int) or isinstance(obj, float):
         return str(obj)
-    elif isinstance(obj, encoding.localstr):
-        u = encoding.fromlocal(obj).decode('utf-8')  # can round-trip
-        return '"%s"' % jsonescape(u)
     elif isinstance(obj, str):
-        # no encoding.fromlocal() because it may abort if obj can't be decoded
-        u = unicode(obj, encoding.encoding, 'replace')
-        return '"%s"' % jsonescape(u)
-    elif isinstance(obj, unicode):
-        return '"%s"' % jsonescape(obj)
+        return '"%s"' % encoding.jsonescape(obj)
     elif util.safehasattr(obj, 'keys'):
         out = []
         for k, v in sorted(obj.iteritems()):
diff -r 544d391bd3b4 -r 35d049d7e5a2 tests/test-command-template.t
--- a/tests/test-command-template.t	Mon Jan 11 13:43:43 2016 -0600
+++ b/tests/test-command-template.t	Mon Jan 11 14:00:32 2016 -0600
@@ -3493,12 +3493,12 @@ 
 json filter should try round-trip conversion to utf-8:
 
   $ HGENCODING=ascii hg log -T "{branch|json}\n" -r0
-  "\u00e9"
+  "\xc3\xa9" (esc)
 
 json filter should not abort if it can't decode bytes:
 (not sure the current behavior is right; we might want to use utf-8b encoding?)
 
   $ HGENCODING=ascii hg log -T "{'`cat utf-8`'|json}\n" -l1
-  "\ufffd\ufffd"
+  "\xc3\xa9" (esc)
 
   $ cd ..
diff -r 544d391bd3b4 -r 35d049d7e5a2 tests/test-hgweb-commands.t
--- a/tests/test-hgweb-commands.t	Mon Jan 11 13:43:43 2016 -0600
+++ b/tests/test-hgweb-commands.t	Mon Jan 11 14:00:32 2016 -0600
@@ -2099,7 +2099,7 @@ 
   >>> for line in open("out"):
   ...     if line.startswith("var data ="):
   ...         print line,
-  var data = [["061dd13ba3c3", [0, 1], [[0, 0, 1, -1, ""]], "\u80fd", "test", "1970-01-01", ["unstable", true], ["tip"], ["something"]], ["cad8025a2e87", [0, 1], [[0, 0, 1, 3, "FF0000"]], "branch commit with null character: \u0000", "test", "1970-01-01", ["unstable", false], [], []], ["1d22e65f027e", [0, 1], [[0, 0, 1, 3, ""]], "branch", "test", "1970-01-01", ["stable", true], [], []], ["a4f92ed23982", [0, 1], [[0, 0, 1, 3, ""]], "Added tag 1.0 for changeset 2ef0ac749a14", "test", "1970-01-01", ["default", true], [], []], ["2ef0ac749a14", [0, 1], [], "base", "test", "1970-01-01", ["default", false], ["1.0"], ["anotherthing"]]];
+  var data = [["061dd13ba3c3", [0, 1], [[0, 0, 1, -1, ""]], "\xed\xb2\x94\\\\", "test", "1970-01-01", ["unstable", true], ["tip"], ["something"]], ["cad8025a2e87", [0, 1], [[0, 0, 1, 3, "FF0000"]], "branch commit with null character: \\u0000", "test", "1970-01-01", ["unstable", false], [], []], ["1d22e65f027e", [0, 1], [[0, 0, 1, 3, ""]], "branch", "test", "1970-01-01", ["stable", true], [], []], ["a4f92ed23982", [0, 1], [[0, 0, 1, 3, ""]], "Added tag 1.0 for changeset 2ef0ac749a14", "test", "1970-01-01", ["default", true], [], []], ["2ef0ac749a14", [0, 1], [], "base", "test", "1970-01-01", ["default", false], ["1.0"], ["anotherthing"]]]; (esc)
 
 capabilities