Submitter | Matt Mackall |
---|---|
Date | Jan. 12, 2016, 5:01 p.m. |
Message ID | <35d049d7e5a2dec87318.1452618066@ruin.waste.org> |
Download | mbox | patch |
Permalink | /patch/12695/ |
State | Changes Requested |
Delegated to: | Yuya Nishihara |
Headers | show |
Comments
On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote: > # HG changeset patch > # User Matt Mackall <mpm@selenic.com> > # Date 1452542432 21600 > # Mon Jan 11 14:00:32 2016 -0600 > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83 > # Parent 544d391bd3b42b96975a3521b73c25223db930b0 > templater: replace jsonescape in main json templater (issue4926) > > This version differs in a couple ways: > > - it skips optional escaping of codepoints > U+007f > - it thus handles emoji correctly (JSON requires UTF-16 surrogates) > - but it may run afoul of silly Unicode linebreaks if exec'd in js > - it uses UTF-8b to round-trip undecodeable bytes We can't do that because JSON output can be embedded in non-UTF-8 HTML, where only 7bit ASCII is allowed, and JSON input (i.e. template string) is a local-encoding text in general. I have patch series to fix the issue4926, but I found my patch seems to have the emoji issue right now.
On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote: > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote: > > # HG changeset patch > > # User Matt Mackall <mpm@selenic.com> > > # Date 1452542432 21600 > > # Mon Jan 11 14:00:32 2016 -0600 > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83 > > # Parent 544d391bd3b42b96975a3521b73c25223db930b0 > > templater: replace jsonescape in main json templater (issue4926) > > > > This version differs in a couple ways: > > > > - it skips optional escaping of codepoints > U+007f > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates) > > - but it may run afoul of silly Unicode linebreaks if exec'd in js > > - it uses UTF-8b to round-trip undecodeable bytes > > We can't do that because JSON output can be embedded in non-UTF-8 HTML, > where only 7bit ASCII is allowed, Example scenarios, please. There's no configuration of hgweb that won't potentially display non-ASCII if it exists in files. If you commit Unicode "á" to a file and fire up "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design (and the correct bytes verbatim if you select raw mode). So I'm not sure what you mean by "allowed". I guess we could get into trouble if we expand JSON directly into some in-page Javascript when the page metadata marks it as non-UTF8. > and JSON input (i.e. template string) > is a local-encoding text in general. encoding.jsonescape (indirectly) knows about localstr objects, and thus recovers the original UTF-8 text to encode if it exists. > I have patch series to fix the issue4926, but I found my patch seems to have > the emoji issue right now. Whatever we do, we need to kill the second implementation of jsonescape in the templater.
On Wed, 13 Jan 2016 10:51:06 -0600, Matt Mackall wrote: > On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote: > > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote: > > > # HG changeset patch > > > # User Matt Mackall <mpm@selenic.com> > > > # Date 1452542432 21600 > > > # Mon Jan 11 14:00:32 2016 -0600 > > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83 > > > # Parent 544d391bd3b42b96975a3521b73c25223db930b0 > > > templater: replace jsonescape in main json templater (issue4926) > > > > > > This version differs in a couple ways: > > > > > > - it skips optional escaping of codepoints > U+007f > > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates) > > > - but it may run afoul of silly Unicode linebreaks if exec'd in js > > > - it uses UTF-8b to round-trip undecodeable bytes > > > > We can't do that because JSON output can be embedded in non-UTF-8 HTML, > > where only 7bit ASCII is allowed, > > Example scenarios, please. HGENCODING=utf-8 export HGENCODING hg init a cd a touch foo hg ci -Am "$(python -c 'print u"\xc0".encode("utf-8")')" hg serve --encoding iso-8859-1 Then, access to http://localhost:8000/graph/tip . (In our real-word example, --encoding Shift_JIS and Japanese characters.) Before this patch, there was no mojibake because "À" is escaped to "\u00c0". With this patch, "À" is lost as follows: u"À" -> "\xc0" (iso-8859-1) -> "\xed\xb3\x80" (utf8b) -> "\xed\xb3\x80" (iso-8859-1) > There's no configuration of hgweb that won't potentially display non-ASCII if it > exists in files. If you commit Unicode "á" to a file and fire up > "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design (and > the correct bytes verbatim if you select raw mode). So I'm not sure what you > mean by "allowed". I guess we could get into trouble if we expand JSON directly > into some in-page Javascript when the page metadata marks it as non-UTF8. JSON data can be embedded in non-UTF8 page so long as it is represented in ASCII and the page encoding is compatible with ASCII. > > and JSON input (i.e. template string) > > is a local-encoding text in general. > > encoding.jsonescape (indirectly) knows about localstr objects, and thus recovers > the original UTF-8 text to encode if it exists. Yes, but localstr is mostly lost in templater, and toutf8b() takes it as bytes, not as local-encoding text. > > I have patch series to fix the issue4926, but I found my patch seems to have > > the emoji issue right now. > > Whatever we do, we need to kill the second implementation of jsonescape in the > templater. Sure. My series will do: 1. add option to escape all non-ASCII characters by encoding.jsonescape() 2. add "|utf8" template filter to explicitly convert localstr|str to utf-8 3. change "|json" to take input as utf8b bytes (BC)
On Thu, 2016-01-14 at 22:12 +0900, Yuya Nishihara wrote: > On Wed, 13 Jan 2016 10:51:06 -0600, Matt Mackall wrote: > > On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote: > > > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote: > > > > # HG changeset patch > > > > # User Matt Mackall <mpm@selenic.com> > > > > # Date 1452542432 21600 > > > > # Mon Jan 11 14:00:32 2016 -0600 > > > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83 > > > > # Parent 544d391bd3b42b96975a3521b73c25223db930b0 > > > > templater: replace jsonescape in main json templater (issue4926) > > > > > > > > This version differs in a couple ways: > > > > > > > > - it skips optional escaping of codepoints > U+007f > > > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates) > > > > - but it may run afoul of silly Unicode linebreaks if exec'd in js > > > > - it uses UTF-8b to round-trip undecodeable bytes > > > > > > We can't do that because JSON output can be embedded in non-UTF-8 HTML, > > > where only 7bit ASCII is allowed, > > > > Example scenarios, please. > > HGENCODING=utf-8 > export HGENCODING > > hg init a > cd a > touch foo > hg ci -Am "$(python -c 'print u"\xc0".encode("utf-8")')" > hg serve --encoding iso-8859-1 > > Then, access to http://localhost:8000/graph/tip . > (In our real-word example, --encoding Shift_JIS and Japanese characters.) > > Before this patch, there was no mojibake because "À" is escaped to "\u00c0". > With this patch, "À" is lost as follows: > > u"À" -> "\xc0" (iso-8859-1) -> "\xed\xb3\x80" (utf8b) > -> "\xed\xb3\x80" (iso-8859-1) > > > There's no configuration of hgweb that won't potentially display non-ASCII > > if it > > exists in files. If you commit Unicode "á" to a file and fire up > > "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design > > (and > > the correct bytes verbatim if you select raw mode). So I'm not sure what you > > mean by "allowed". I guess we could get into trouble if we expand JSON > > directly > > into some in-page Javascript when the page metadata marks it as non-UTF8. > > JSON data can be embedded in non-UTF8 page so long as it is represented in > ASCII > and the page encoding is compatible with ASCII. Ok. > > > and JSON input (i.e. template string) > > > is a local-encoding text in general. > > > > encoding.jsonescape (indirectly) knows about localstr objects, and thus > > recovers > > the original UTF-8 text to encode if it exists. > > Yes, but localstr is mostly lost in templater, Oh? Please elaborate. > and toutf8b() takes it as bytes, > not as local-encoding text. You may be right, but according to the docstring, that means you've found a bug, > > > I have patch series to fix the issue4926, but I found my patch seems to > > > have > > > the emoji issue right now. > > > > Whatever we do, we need to kill the second implementation of jsonescape in > > the > > templater. > > Sure. My series will do: > > 1. add option to escape all non-ASCII characters by encoding.jsonescape() Ok. > 2. add "|utf8" template filter to explicitly convert localstr|str to utf-8 Sounds suspicious. What's it going to do with filenames? > 3. change "|json" to take input as utf8b bytes (BC) Sounds like a really big break from our encoding philosophy.
On Thu, 14 Jan 2016 12:49:27 -0600, Matt Mackall wrote: > On Thu, 2016-01-14 at 22:12 +0900, Yuya Nishihara wrote: > > > > and JSON input (i.e. template string) > > > > is a local-encoding text in general. > > > > > > encoding.jsonescape (indirectly) knows about localstr objects, and thus > > > recovers > > > the original UTF-8 text to encode if it exists. > > > > Yes, but localstr is mostly lost in templater, > > Oh? Please elaborate. I meant a localstr object would soon be converted to a plain str, loosing original _utf8 bytes. The templater does bunch of string manipulation. > > and toutf8b() takes it as bytes, > > not as local-encoding text. > > You may be right, but according to the docstring, that means you've found > a bug, I thought it was when utf8b helper was introduced, but I don't think it is a bug now. toutf8b() must keep the original bytes unmodified, so it can't take input as local-encoding text. The only thing it can do is to escape non-UTF8 part to U+DCxx. > > 1. add option to escape all non-ASCII characters by encoding.jsonescape() > > Ok. > > > 2. add "|utf8" template filter to explicitly convert localstr|str to utf-8 > > Sounds suspicious. What's it going to do with filenames? Filenames are the reason I'm going to introduce ugly |utf8 filter. Because filenames are bytes in Mercurial world, {filename|json} should preserve the original byte sequence, which means {x|json} -> '"' toutf8b(x) '"' On the other hand, most template strings are in local encoding. Because |json filter have to be byte-transparent to filenames, we need something to annotate an input as a local string. {x|utf8|json} -> '"' toutf8b(fromlocal(x)) '"' > > 3. change "|json" to take input as utf8b bytes (BC) > > Sounds like a really big break from our encoding philosophy. Yep, but |json was special in that it converts an input string to unicode. Good news is "json" and "jsonescape" are undocumented, so they are considered as internal filters.
Patch
diff -r 544d391bd3b4 -r 35d049d7e5a2 mercurial/templatefilters.py --- a/mercurial/templatefilters.py Mon Jan 11 13:43:43 2016 -0600 +++ b/mercurial/templatefilters.py Mon Jan 11 14:00:32 2016 -0600 @@ -197,15 +197,8 @@ return {None: 'null', False: 'false', True: 'true'}[obj] elif isinstance(obj, int) or isinstance(obj, float): return str(obj) - elif isinstance(obj, encoding.localstr): - u = encoding.fromlocal(obj).decode('utf-8') # can round-trip - return '"%s"' % jsonescape(u) elif isinstance(obj, str): - # no encoding.fromlocal() because it may abort if obj can't be decoded - u = unicode(obj, encoding.encoding, 'replace') - return '"%s"' % jsonescape(u) - elif isinstance(obj, unicode): - return '"%s"' % jsonescape(obj) + return '"%s"' % encoding.jsonescape(obj) elif util.safehasattr(obj, 'keys'): out = [] for k, v in sorted(obj.iteritems()): diff -r 544d391bd3b4 -r 35d049d7e5a2 tests/test-command-template.t --- a/tests/test-command-template.t Mon Jan 11 13:43:43 2016 -0600 +++ b/tests/test-command-template.t Mon Jan 11 14:00:32 2016 -0600 @@ -3493,12 +3493,12 @@ json filter should try round-trip conversion to utf-8: $ HGENCODING=ascii hg log -T "{branch|json}\n" -r0 - "\u00e9" + "\xc3\xa9" (esc) json filter should not abort if it can't decode bytes: (not sure the current behavior is right; we might want to use utf-8b encoding?) $ HGENCODING=ascii hg log -T "{'`cat utf-8`'|json}\n" -l1 - "\ufffd\ufffd" + "\xc3\xa9" (esc) $ cd .. diff -r 544d391bd3b4 -r 35d049d7e5a2 tests/test-hgweb-commands.t --- a/tests/test-hgweb-commands.t Mon Jan 11 13:43:43 2016 -0600 +++ b/tests/test-hgweb-commands.t Mon Jan 11 14:00:32 2016 -0600 @@ -2099,7 +2099,7 @@ >>> for line in open("out"): ... if line.startswith("var data ="): ... print line, - var data = [["061dd13ba3c3", [0, 1], [[0, 0, 1, -1, ""]], "\u80fd", "test", "1970-01-01", ["unstable", true], ["tip"], ["something"]], ["cad8025a2e87", [0, 1], [[0, 0, 1, 3, "FF0000"]], "branch commit with null character: \u0000", "test", "1970-01-01", ["unstable", false], [], []], ["1d22e65f027e", [0, 1], [[0, 0, 1, 3, ""]], "branch", "test", "1970-01-01", ["stable", true], [], []], ["a4f92ed23982", [0, 1], [[0, 0, 1, 3, ""]], "Added tag 1.0 for changeset 2ef0ac749a14", "test", "1970-01-01", ["default", true], [], []], ["2ef0ac749a14", [0, 1], [], "base", "test", "1970-01-01", ["default", false], ["1.0"], ["anotherthing"]]]; + var data = [["061dd13ba3c3", [0, 1], [[0, 0, 1, -1, ""]], "\xed\xb2\x94\\\\", "test", "1970-01-01", ["unstable", true], ["tip"], ["something"]], ["cad8025a2e87", [0, 1], [[0, 0, 1, 3, "FF0000"]], "branch commit with null character: \\u0000", "test", "1970-01-01", ["unstable", false], [], []], ["1d22e65f027e", [0, 1], [[0, 0, 1, 3, ""]], "branch", "test", "1970-01-01", ["stable", true], [], []], ["a4f92ed23982", [0, 1], [[0, 0, 1, 3, ""]], "Added tag 1.0 for changeset 2ef0ac749a14", "test", "1970-01-01", ["default", true], [], []], ["2ef0ac749a14", [0, 1], [], "base", "test", "1970-01-01", ["default", false], ["1.0"], ["anotherthing"]]]; (esc) capabilities