Patchwork encoding: handle UTF-16 internal limit with fromutf8b (issue5033)

login
register
mail settings
Submitter Matt Mackall
Date Jan. 7, 2016, 9:01 p.m.
Message ID <7aa1dbfbd7a0966ae0e2.1452200516@ruin.waste.org>
Download mbox | patch
Permalink /patch/12590/
State Changes Requested
Delegated to: Yuya Nishihara
Headers show

Comments

Matt Mackall - Jan. 7, 2016, 9:01 p.m.
# HG changeset patch
# User Matt Mackall <mpm@selenic.com>
# Date 1452200277 21600
#      Thu Jan 07 14:57:57 2016 -0600
# Node ID 7aa1dbfbd7a0966ae0e241b16b72fde6df2cb94a
# Parent  b8405d739149cdd6d8d9bd5e3dd2ad8487b1f09a
encoding: handle UTF-16 internal limit with fromutf8b (issue5033)

Default builds of Python have a Unicode type that isn't actually full
Unicode but UTF-16, so characters may not actually be characters.
Since our UTF-8b hack escaping uses a plane that overlaps with the
UTF-16 escaping system, this gets extra complicated. This changes the
code to work on a list of integer code points rather than
"characters", and adds a path to unpack full Unicode codepoints in the
UTF-16 case.
Yuya Nishihara - Jan. 9, 2016, 7:26 a.m.
On Thu, 07 Jan 2016 15:01:56 -0600, Matt Mackall wrote:
> # HG changeset patch
> # User Matt Mackall <mpm@selenic.com>
> # Date 1452200277 21600
> #      Thu Jan 07 14:57:57 2016 -0600
> # Node ID 7aa1dbfbd7a0966ae0e241b16b72fde6df2cb94a
> # Parent  b8405d739149cdd6d8d9bd5e3dd2ad8487b1f09a
> encoding: handle UTF-16 internal limit with fromutf8b (issue5033)
> 
> Default builds of Python have a Unicode type that isn't actually full
> Unicode but UTF-16, so characters may not actually be characters.
> Since our UTF-8b hack escaping uses a plane that overlaps with the
> UTF-16 escaping system, this gets extra complicated. This changes the
> code to work on a list of integer code points rather than
> "characters", and adds a path to unpack full Unicode codepoints in the
> UTF-16 case.
> 
> diff -r b8405d739149 -r 7aa1dbfbd7a0 mercurial/encoding.py
> --- a/mercurial/encoding.py	Sat Jan 02 02:13:56 2016 +0100
> +++ b/mercurial/encoding.py	Thu Jan 07 14:57:57 2016 -0600
> @@ -9,6 +9,8 @@
>  
>  import locale
>  import os
> +import struct
> +import sys
>  import unicodedata
>  
>  from . import (
> @@ -516,6 +518,8 @@
>      True
>      >>> roundtrip("\\xef\\xef\\xbf\\xbd")
>      True
> +    >>> roundtrip("\\xf1\\x80\\x80\\x80\\x80")
> +    True
>      '''
>  
>      # fast path - look for uDxxx prefixes in s
> @@ -523,10 +527,23 @@
>          return s
>  
>      u = s.decode("utf-8")
> +    if sys.maxunicode > 65535:
> +        # Our Python build is sane and stores UTF-32 internally, will
> +        # return full Unicode characters when iterating
> +        cpl = [ord(c) for c in u]
> +    else:
> +        # Our Python stores UTF-16 internally (default build) and will
> +        # return surrogate pairs for characters > U+FFFF, thus
> +        # defeating the point of having a Unicode string type.
> +        # We need to unpack as UCS-4.
> +        a = u.encode("utf-32-be")
> +        cpl = struct.unpack('>%dL' % (len(a) / 4), a)
> +
>      r = ""
> -    for c in u:
> -        if ord(c) & 0xffff00 == 0xdc00:
> -            r += chr(ord(c) & 0xff)
> +
> +    for cp in cpl:
> +        if cp & 0xffff00 == 0xdc00:
> +            r += chr(cp & 0xff)
>          else:
> -            r += c.encode("utf-8")
> +            r += unichr(cp).encode("utf-8")

Sadly unichr(cp) also doesn't work on narrow Python if cp >= 0x10000.
Matt Mackall - Jan. 11, 2016, 5:48 p.m.
On Sat, 2016-01-09 at 16:26 +0900, Yuya Nishihara wrote:
> On Thu, 07 Jan 2016 15:01:56 -0600, Matt Mackall wrote:
> > +            r += unichr(cp).encode("utf-8")
> 
> Sadly unichr(cp) also doesn't work on narrow Python if cp >= 0x10000.

Ok, that's just ridiculous. Alternate patch sent.

Patch

diff -r b8405d739149 -r 7aa1dbfbd7a0 mercurial/encoding.py
--- a/mercurial/encoding.py	Sat Jan 02 02:13:56 2016 +0100
+++ b/mercurial/encoding.py	Thu Jan 07 14:57:57 2016 -0600
@@ -9,6 +9,8 @@ 
 
 import locale
 import os
+import struct
+import sys
 import unicodedata
 
 from . import (
@@ -516,6 +518,8 @@ 
     True
     >>> roundtrip("\\xef\\xef\\xbf\\xbd")
     True
+    >>> roundtrip("\\xf1\\x80\\x80\\x80\\x80")
+    True
     '''
 
     # fast path - look for uDxxx prefixes in s
@@ -523,10 +527,23 @@ 
         return s
 
     u = s.decode("utf-8")
+    if sys.maxunicode > 65535:
+        # Our Python build is sane and stores UTF-32 internally, will
+        # return full Unicode characters when iterating
+        cpl = [ord(c) for c in u]
+    else:
+        # Our Python stores UTF-16 internally (default build) and will
+        # return surrogate pairs for characters > U+FFFF, thus
+        # defeating the point of having a Unicode string type.
+        # We need to unpack as UCS-4.
+        a = u.encode("utf-32-be")
+        cpl = struct.unpack('>%dL' % (len(a) / 4), a)
+
     r = ""
-    for c in u:
-        if ord(c) & 0xffff00 == 0xdc00:
-            r += chr(ord(c) & 0xff)
+
+    for cp in cpl:
+        if cp & 0xffff00 == 0xdc00:
+            r += chr(cp & 0xff)
         else:
-            r += c.encode("utf-8")
+            r += unichr(cp).encode("utf-8")
     return r