Patchwork [1,of,6,foldmap-in-C] encoding: define an enum that specifies what normcase does to ASCII strings

login
register
mail settings
Submitter Siddharth Agarwal
Date April 2, 2015, 2:48 a.m.
Message ID <42a1040af0c362b38ce4.1427942886@devbig136.prn2.facebook.com>
Download mbox | patch
Permalink /patch/8441/
State Superseded
Commit f473a1fe5c7c467a9cb2ad9e6dd3ff7edef64de5
Headers show

Comments

Siddharth Agarwal - April 2, 2015, 2:48 a.m.
# HG changeset patch
# User Siddharth Agarwal <sid0@fb.com>
# Date 1427872870 25200
#      Wed Apr 01 00:21:10 2015 -0700
# Node ID 42a1040af0c362b38ce45fc71e065d1769902c79
# Parent  37a2b446985f2ef77b9690a0548c8630828b7412
encoding: define an enum that specifies what normcase does to ASCII strings

For C code we don't want to pay the cost of calling into a Python function for
the common case of ASCII filenames. However, while on most POSIX platforms we
prefer to normalize filenames by lowercasing them, on Windows we uppercase
them. We define an enum here indicating the direction that filenames should be
normalized as. Some platforms (notably Cygwin) have more complicated
normalization behavior -- we add a case for that too.

In upcoming patches we'll also define a fallback function that is called if the
string has non-ASCII bytes.

This enum will be replicated in the C code to make foldmaps. There's
unfortunately no nice way to avoid that -- we can't have encoding import
parsers because of import cycles. One way might be to have parsers import
encoding, but accessing Python modules from C code is just awkward.

The name 'normcaseasciispecs' was chosen to indicate that this is merely
an integer that specifies a behavior, not a function. The name was pluralized
since in upcoming patches we'll introduce 'normcaseasciispec' which will be one
of these values.
Adrian Buehlmann - April 2, 2015, 8:14 a.m.
On 2015-04-02 04:48, Siddharth Agarwal wrote:
> # HG changeset patch
> # User Siddharth Agarwal <sid0@fb.com>
> # Date 1427872870 25200
> #      Wed Apr 01 00:21:10 2015 -0700
> # Node ID 42a1040af0c362b38ce45fc71e065d1769902c79
> # Parent  37a2b446985f2ef77b9690a0548c8630828b7412
> encoding: define an enum that specifies what normcase does to ASCII strings
> 
> For C code we don't want to pay the cost of calling into a Python function for
> the common case of ASCII filenames. However, while on most POSIX platforms we
> prefer to normalize filenames by lowercasing them, on Windows we uppercase
> them. We define an enum here indicating the direction that filenames should be
> normalized as. Some platforms (notably Cygwin) have more complicated
> normalization behavior -- we add a case for that too.
> 
> In upcoming patches we'll also define a fallback function that is called if the
> string has non-ASCII bytes.
> 
> This enum will be replicated in the C code to make foldmaps. There's
> unfortunately no nice way to avoid that -- we can't have encoding import
> parsers because of import cycles. One way might be to have parsers import
> encoding, but accessing Python modules from C code is just awkward.
> 
> The name 'normcaseasciispecs' was chosen to indicate that this is merely
> an integer that specifies a behavior, not a function. The name was pluralized
> since in upcoming patches we'll introduce 'normcaseasciispec' which will be one
> of these values.
> 
> diff --git a/mercurial/encoding.py b/mercurial/encoding.py
> --- a/mercurial/encoding.py
> +++ b/mercurial/encoding.py
> @@ -354,6 +354,19 @@ def upper(s):
>      except LookupError, k:
>          raise error.Abort(k, hint="please check your locale settings")
>  
> +class normcaseasciispecs(object):
> +    '''what a platform's normcase does to ASCII strings
> +
> +    This is specified per platform, and should be consistent with what normcase
> +    on that platform actually does.
> +
> +    lower: normcase lowercases ASCII strings
> +    upper: normcase uppercases ASCII strings
> +    other: the fallback function should always be called'''
> +    lower = -1
> +    upper = 1
> +    other = 0
> +
>  _jsonmap = {}
>  
>  def jsonescape(s):

Ugh, this sounds ugly.

I guess there is not much chance this surprising difference between
Mercurial's util.normcase function doing uppercase when run on Windows
and lowercase when run on other platforms could be eliminated.

We originally used os.path.normcase on Windows, which happens to
lowercase the string there, but this was changed to uppercase by foozy
on 2011-12-16 with 3c5e818ac679 ("windows: use upper() instead of
lower() or os.path.normcase()").

I guess there is no chance to turn back 3c5e818ac679, given the problems
foozy had to deal with there.

(adding foozy to cc)
Siddharth Agarwal - April 2, 2015, 4:40 p.m.
On 04/02/2015 01:14 AM, Adrian Buehlmann wrote:
> On 2015-04-02 04:48, Siddharth Agarwal wrote:
>> # HG changeset patch
>> # User Siddharth Agarwal <sid0@fb.com>
>> # Date 1427872870 25200
>> #      Wed Apr 01 00:21:10 2015 -0700
>> # Node ID 42a1040af0c362b38ce45fc71e065d1769902c79
>> # Parent  37a2b446985f2ef77b9690a0548c8630828b7412
>> encoding: define an enum that specifies what normcase does to ASCII strings
>>
>> For C code we don't want to pay the cost of calling into a Python function for
>> the common case of ASCII filenames. However, while on most POSIX platforms we
>> prefer to normalize filenames by lowercasing them, on Windows we uppercase
>> them. We define an enum here indicating the direction that filenames should be
>> normalized as. Some platforms (notably Cygwin) have more complicated
>> normalization behavior -- we add a case for that too.
>>
>> In upcoming patches we'll also define a fallback function that is called if the
>> string has non-ASCII bytes.
>>
>> This enum will be replicated in the C code to make foldmaps. There's
>> unfortunately no nice way to avoid that -- we can't have encoding import
>> parsers because of import cycles. One way might be to have parsers import
>> encoding, but accessing Python modules from C code is just awkward.
>>
>> The name 'normcaseasciispecs' was chosen to indicate that this is merely
>> an integer that specifies a behavior, not a function. The name was pluralized
>> since in upcoming patches we'll introduce 'normcaseasciispec' which will be one
>> of these values.
>>
>> diff --git a/mercurial/encoding.py b/mercurial/encoding.py
>> --- a/mercurial/encoding.py
>> +++ b/mercurial/encoding.py
>> @@ -354,6 +354,19 @@ def upper(s):
>>      except LookupError, k:
>>          raise error.Abort(k, hint="please check your locale settings")
>>  
>> +class normcaseasciispecs(object):
>> +    '''what a platform's normcase does to ASCII strings
>> +
>> +    This is specified per platform, and should be consistent with what normcase
>> +    on that platform actually does.
>> +
>> +    lower: normcase lowercases ASCII strings
>> +    upper: normcase uppercases ASCII strings
>> +    other: the fallback function should always be called'''
>> +    lower = -1
>> +    upper = 1
>> +    other = 0
>> +
>>  _jsonmap = {}
>>  
>>  def jsonescape(s):
> Ugh, this sounds ugly.

I don't disagree.

> I guess there is not much chance this surprising difference between
> Mercurial's util.normcase function doing uppercase when run on Windows
> and lowercase when run on other platforms could be eliminated.
>
> We originally used os.path.normcase on Windows, which happens to
> lowercase the string there, but this was changed to uppercase by foozy
> on 2011-12-16 with 3c5e818ac679 ("windows: use upper() instead of
> lower() or os.path.normcase()").
>
> I guess there is no chance to turn back 3c5e818ac679, given the problems
> foozy had to deal with there.

Not that I'm aware of. For correctness we really need to uppercase on
NTFS (per Michael Kaplan's blog post) and lowercase on HFS+ (per Apple
tech note 1150). There's just no way around that.

- Siddharth
Matt Mackall - April 2, 2015, 6:22 p.m.
On Thu, 2015-04-02 at 10:14 +0200, Adrian Buehlmann wrote:
> On 2015-04-02 04:48, Siddharth Agarwal wrote:
> > # HG changeset patch
> > # User Siddharth Agarwal <sid0@fb.com>
> > # Date 1427872870 25200
> > #      Wed Apr 01 00:21:10 2015 -0700
> > # Node ID 42a1040af0c362b38ce45fc71e065d1769902c79
> > # Parent  37a2b446985f2ef77b9690a0548c8630828b7412
> > encoding: define an enum that specifies what normcase does to ASCII strings
> > 
> > For C code we don't want to pay the cost of calling into a Python function for
> > the common case of ASCII filenames. However, while on most POSIX platforms we
> > prefer to normalize filenames by lowercasing them, on Windows we uppercase
> > them. We define an enum here indicating the direction that filenames should be
> > normalized as. Some platforms (notably Cygwin) have more complicated
> > normalization behavior -- we add a case for that too.
> > 
> > In upcoming patches we'll also define a fallback function that is called if the
> > string has non-ASCII bytes.
> > 
> > This enum will be replicated in the C code to make foldmaps. There's
> > unfortunately no nice way to avoid that -- we can't have encoding import
> > parsers because of import cycles. One way might be to have parsers import
> > encoding, but accessing Python modules from C code is just awkward.
> > 
> > The name 'normcaseasciispecs' was chosen to indicate that this is merely
> > an integer that specifies a behavior, not a function. The name was pluralized
> > since in upcoming patches we'll introduce 'normcaseasciispec' which will be one
> > of these values.
> > 
> > diff --git a/mercurial/encoding.py b/mercurial/encoding.py
> > --- a/mercurial/encoding.py
> > +++ b/mercurial/encoding.py
> > @@ -354,6 +354,19 @@ def upper(s):
> >      except LookupError, k:
> >          raise error.Abort(k, hint="please check your locale settings")
> >  
> > +class normcaseasciispecs(object):
> > +    '''what a platform's normcase does to ASCII strings
> > +
> > +    This is specified per platform, and should be consistent with what normcase
> > +    on that platform actually does.
> > +
> > +    lower: normcase lowercases ASCII strings
> > +    upper: normcase uppercases ASCII strings
> > +    other: the fallback function should always be called'''
> > +    lower = -1
> > +    upper = 1
> > +    other = 0
> > +
> >  _jsonmap = {}
> >  
> >  def jsonescape(s):
> 
> Ugh, this sounds ugly.
> 
> I guess there is not much chance this surprising difference between
> Mercurial's util.normcase function doing uppercase when run on Windows
> and lowercase when run on other platforms could be eliminated.

No, it's a difference present in the underlying filesystems (NTFS
compares via upper(), HFS+ via lower()). And given that there are a
number of scripts that aren't quite 1:1 upper:lower, there are
differences that appear. There's even a script that has one uppercase
and two lowercases, so it's fully 1:2.
Matt Mackall - April 2, 2015, 8:50 p.m.
On Wed, 2015-04-01 at 19:48 -0700, Siddharth Agarwal wrote:
> # HG changeset patch
> # User Siddharth Agarwal <sid0@fb.com>
> # Date 1427872870 25200
> #      Wed Apr 01 00:21:10 2015 -0700
> # Node ID 42a1040af0c362b38ce45fc71e065d1769902c79
> # Parent  37a2b446985f2ef77b9690a0548c8630828b7412
> encoding: define an enum that specifies what normcase does to ASCII strings
> 
> For C code we don't want to pay the cost of calling into a Python function for
> the common case of ASCII filenames. However, while on most POSIX platforms we
> prefer to normalize filenames by lowercasing them, on Windows we uppercase
> them. We define an enum here indicating the direction that filenames should be
> normalized as. Some platforms (notably Cygwin) have more complicated
> normalization behavior -- we add a case for that too.
> 
> In upcoming patches we'll also define a fallback function that is called if the
> string has non-ASCII bytes.
> 
> This enum will be replicated in the C code to make foldmaps. There's
> unfortunately no nice way to avoid that -- we can't have encoding import
> parsers because of import cycles. One way might be to have parsers import
> encoding, but accessing Python modules from C code is just awkward.
> 
> The name 'normcaseasciispecs' was chosen to indicate that this is merely
> an integer that specifies a behavior, not a function. The name was pluralized
> since in upcoming patches we'll introduce 'normcaseasciispec' which will be one
> of these values.
> 
> diff --git a/mercurial/encoding.py b/mercurial/encoding.py
> --- a/mercurial/encoding.py
> +++ b/mercurial/encoding.py
> @@ -354,6 +354,19 @@ def upper(s):
>      except LookupError, k:
>          raise error.Abort(k, hint="please check your locale settings")
>  
> +class normcaseasciispecs(object):
> +    '''what a platform's normcase does to ASCII strings
> +
> +    This is specified per platform, and should be consistent with what normcase
> +    on that platform actually does.
> +
> +    lower: normcase lowercases ASCII strings
> +    upper: normcase uppercases ASCII strings
> +    other: the fallback function should always be called'''
> +    lower = -1
> +    upper = 1
> +    other = 0

This looks ok, but I think you've gotten hung up on 7-bit strings when
naming this. The direction here applies to what normcase does with
Unicode strings too, no? I think this should simply be normcasespec(s)?

Also, as long as you're poking around here, I'll mention that fast
isascii and isutf8 functions would be handy.

Patch

diff --git a/mercurial/encoding.py b/mercurial/encoding.py
--- a/mercurial/encoding.py
+++ b/mercurial/encoding.py
@@ -354,6 +354,19 @@  def upper(s):
     except LookupError, k:
         raise error.Abort(k, hint="please check your locale settings")
 
+class normcaseasciispecs(object):
+    '''what a platform's normcase does to ASCII strings
+
+    This is specified per platform, and should be consistent with what normcase
+    on that platform actually does.
+
+    lower: normcase lowercases ASCII strings
+    upper: normcase uppercases ASCII strings
+    other: the fallback function should always be called'''
+    lower = -1
+    upper = 1
+    other = 0
+
 _jsonmap = {}
 
 def jsonescape(s):