Patchwork D8039: chg: force-set LC_CTYPE on server start to actual value from the environment

login
register
mail settings
Submitter phabricator
Date Jan. 29, 2020, 10:24 p.m.
Message ID <differential-rev-PHID-DREV-fy4mjjcodrh42a5mpt7y-req@mercurial-scm.org>
Download mbox | patch
Permalink /patch/44736/
State Superseded
Headers show

Comments

phabricator - Jan. 29, 2020, 10:24 p.m.
spectral created this revision.
Herald added subscribers: mercurial-devel, mjpieters.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  Python 3.7+ will "coerce" the LC_CTYPE variable in many instances, and this can
  cause issues with chg being able to start up. D7550 <https://phab.mercurial-scm.org/D7550> attempted to fix this, but a
  combination of a misreading of the way that python3.7 does the coercion and an
  untested state (LC_CTYPE being set to an invalid value) meant that this was
  still not quite working.
  
  This change will cause differences between chg and hg: hg will have the LC_CTYPE
  environment variable coerced, while chg will not. This is unlikely to cause any
  detectable behavior differences in what Mercurial itself outputs, but it does
  have two known effects:
  
  - When using hg, the coerced LC_CTYPE will be passed to subprocesses, even non-python ones. Using chg will remove the coercion, and this will not happen. This is arguably more correct behavior on chg's part.
  - On macOS, if you set your region to Brazil but your language to English, this isn't representable in locale strings, so macOS sets LC_CTYPE=UTF-8. If this value is passed along when ssh'ing to a non-macOS machine, some functions (such as locale.setlocale()) may raise an exception due to an unsupported locale setting. This is most easily encountered when doing an interactive commit/split/etc. when using ui.interface=curses.

REPOSITORY
  rHG Mercurial

BRANCH
  default

REVISION DETAIL
  https://phab.mercurial-scm.org/D8039

AFFECTED FILES
  contrib/chg/chg.c
  hg
  mercurial/chgserver.py
  tests/test-chg.t

CHANGE DETAILS




To: spectral, #hg-reviewers
Cc: mjpieters, mercurial-devel
Yuya Nishihara - Jan. 30, 2020, 2:52 p.m.
> -        # Python3 has some logic to "coerce" the C locale to a UTF-8 capable
> -        # one, and it sets LC_CTYPE in the environment to C.UTF-8 if none of
> -        # 'LC_CTYPE', 'LC_ALL' or 'LANG' are set (to any value). This can be
> -        # disabled with PYTHONCOERCECLOCALE=0 in the environment.
> -        #
> -        # When fromui is called via _inithashstate, python has already set
> -        # this, so that's in the environment right when we start up the hg
> -        # process. Then chg will call us and tell us to set the environment to
> -        # the one it has; this might NOT have LC_CTYPE, so we'll need to
> -        # carry-forward the LC_CTYPE that was coerced in these situations.
> -        #
> -        # If this is not handled, we will fail config+env validation and fail
> -        # to start chg. If this is just ignored instead of carried forward, we
> -        # may have different behavior between chg and non-chg.

Can you move and rephrase this comment?

> @@ -730,6 +696,11 @@
>      # environ cleaner.
>      if b'CHGINTERNALMARK' in encoding.environ:
>          del encoding.environ[b'CHGINTERNALMARK']
> +    if b'CHGORIG_LC_CTYPE' in encoding.environ:
> +        encoding.environ[b'LC_CTYPE'] = encoding.environ[b'CHGORIG_LC_CTYPE']
> +        del encoding.environ[b'CHGORIG_LC_CTYPE']
> +    elif b'CHG_CLEAR_LC_CTYPE' in encoding.environ:
> +        del encoding.environ[b'LC_CTYPE']

would crash if `LC_CTYPE` wasn't set, and probably needs to delete
`CHG_CLEAR_LC_CTYPE`.

> diff --git a/hg b/hg
> --- a/hg
> +++ b/hg
> @@ -1,4 +1,4 @@
> -#!/usr/bin/env python
> +#!/usr/bin/env python3

Unrelated change.
phabricator - Jan. 30, 2020, 3:05 p.m.
yuja added a comment.


  > - # Python3 has some logic to "coerce" the C locale to a UTF-8 capable
  > - # one, and it sets LC_CTYPE in the environment to C.UTF-8 if none of
  > - # 'LC_CTYPE', 'LC_ALL' or 'LANG' are set (to any value). This can be
  > - # disabled with PYTHONCOERCECLOCALE=0 in the environment.
  > - #
  > - # When fromui is called via _inithashstate, python has already set
  > - # this, so that's in the environment right when we start up the hg
  > - # process. Then chg will call us and tell us to set the environment to
  > - # the one it has; this might NOT have LC_CTYPE, so we'll need to
  > - # carry-forward the LC_CTYPE that was coerced in these situations.
  > - #
  > - # If this is not handled, we will fail config+env validation and fail
  > - # to start chg. If this is just ignored instead of carried forward, we
  > - # may have different behavior between chg and non-chg.
  
  Can you move and rephrase this comment?
  
  > @@ -730,6 +696,11 @@
  >
  > 1. environ cleaner. if b'CHGINTERNALMARK' in encoding.environ: del encoding.environ[b'CHGINTERNALMARK']
  >
  > +    if b'CHGORIG_LC_CTYPE' in encoding.environ:
  > +        encoding.environ[b'LC_CTYPE'] = encoding.environ[b'CHGORIG_LC_CTYPE']
  > +        del encoding.environ[b'CHGORIG_LC_CTYPE']
  > +    elif b'CHG_CLEAR_LC_CTYPE' in encoding.environ:
  > +        del encoding.environ[b'LC_CTYPE']
  
  would crash if `LC_CTYPE` wasn't set, and probably needs to delete
  `CHG_CLEAR_LC_CTYPE`.
  
  > diff --git a/hg b/hg
  >
  > - a/hg
  >
  > +++ b/hg
  > @@ -1,4 +1,4 @@
  > -#!/usr/bin/env python
  > +#!/usr/bin/env python3
  
  Unrelated change.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D8039/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D8039

To: spectral, #hg-reviewers
Cc: yuja, mjpieters, mercurial-devel
phabricator - Jan. 30, 2020, 3:50 p.m.
quark added a comment.


  What do you think about this approach:
  
  1. The server detects that LC_TYPE is coerced.
  2. When handling the "validate" command, the server sends back "invalidate this server, and fallback to original hg" response.
  
  This makes chg/non-chg behave consistently with some startup overhead in mis-configured environment. The chg client can potentially print a warning to remind the user to fix their environment.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D8039/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D8039

To: spectral, #hg-reviewers
Cc: quark, yuja, mjpieters, mercurial-devel
Yuya Nishihara - Jan. 30, 2020, 4:51 p.m.
>   What do you think about this approach:
>   
>   1. The server detects that LC_TYPE is coerced.
>   2. When handling the "validate" command, the server sends back "invalidate this server, and fallback to original hg" response.
>   
>   This makes chg/non-chg behave consistently with some startup overhead in mis-configured environment. The chg client can potentially print a warning to remind the user to fix their environment.

That could be, but if we do want to make chg/hg behavior consistent, maybe we
can adjust the hash computation logic?

 1. client sends CHG_ORIG_LC_CTYPE or CHG_UNSET_LC_CTYPE when spawning server
 2. they're kept in environ dict, but override LC_CTYPE while computing hash,
    and excluded from the hash
 3. client does not send these variables over setenv command, but passes the
    validation because `{CHG_ORIG_LC_CTYPE: x, LC_CTYPE: y} == {LC_CTYPE: x}`.

If we had a sha1 logic in C, we could compute the env hash at client side.
Python 3 can't be trusted.
phabricator - Jan. 30, 2020, 5:02 p.m.
yuja added a comment.


  >   What do you think about this approach:
  >   1. The server detects that LC_TYPE is coerced.
  >   2. When handling the "validate" command, the server sends back "invalidate this server, and fallback to original hg" response.
  >   This makes chg/non-chg behave consistently with some startup overhead in mis-configured environment. The chg client can potentially print a warning to remind the user to fix their environment.
  
  That could be, but if we do want to make chg/hg behavior consistent, maybe we
  can adjust the hash computation logic?
  
  1. client sends CHG_ORIG_LC_CTYPE or CHG_UNSET_LC_CTYPE when spawning server
  2. they're kept in environ dict, but override LC_CTYPE while computing hash, and excluded from the hash
  3. client does not send these variables over setenv command, but passes the validation because `{CHG_ORIG_LC_CTYPE: x, LC_CTYPE: y} == {LC_CTYPE: x}`.
  
  If we had a sha1 logic in C, we could compute the env hash at client side.
  Python 3 can't be trusted.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D8039/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D8039

To: spectral, #hg-reviewers
Cc: quark, yuja, mjpieters, mercurial-devel
phabricator - Feb. 4, 2020, 9:48 p.m.
spectral added a comment.


  In D8039#118719 <https://phab.mercurial-scm.org/D8039#118719>, @yuja wrote:
  
  >>   What do you think about this approach:
  >>   1. The server detects that LC_TYPE is coerced.
  >>   2. When handling the "validate" command, the server sends back "invalidate this server, and fallback to original hg" response.
  >>   This makes chg/non-chg behave consistently with some startup overhead in mis-configured environment. The chg client can potentially print a warning to remind the user to fix their environment.
  >
  > That could be, but if we do want to make chg/hg behavior consistent, maybe we
  > can adjust the hash computation logic?
  >
  > 1. client sends CHG_ORIG_LC_CTYPE or CHG_UNSET_LC_CTYPE when spawning server
  > 2. they're kept in environ dict, but override LC_CTYPE while computing hash, and excluded from the hash
  > 3. client does not send these variables over setenv command, but passes the validation because `{CHG_ORIG_LC_CTYPE: x, LC_CTYPE: y} == {LC_CTYPE: x}`.
  
  I think that's almost what I was doing in the original stack of three commits ending in D8023 <https://phab.mercurial-scm.org/D8023>, though I used a different encoding than two separate environment variables (one for clear, one for set), I have no strong preference between the two encoding methods. There's still a bit of a issue with the setenv command: it'll clear the environment, and replace it with the one that came from chg. So we need to somehow keep it from clearing/modifying CHG_ORIG_LC_CTYPE and LC_CTYPE, which isn't difficult (D8023 <https://phab.mercurial-scm.org/D8023> already does this, just with a more complicated mechanism).  With a small change to D8023 <https://phab.mercurial-scm.org/D8023> I think D8021 <https://phab.mercurial-scm.org/D8021> could be dropped, and D8022 <https://phab.mercurial-scm.org/D8022> could be changed to only do this in execcmdserver.  Done.
  
  > If we had a sha1 logic in C, we could compute the env hash at client side.
  
  I think sha1dc has been added to the repo, so we do have a sha1 implementation in C. :)  Computing the hash on the client avoids the restart loop, but doesn't avoid the behavior difference - chg will still call setenv and replace the modified version with the original one.
  
  > Python 3 can't be trusted.
  
  Yeah :(
  
  Sorry about the delay responding here, I was sick with the flu.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D8039/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D8039

To: spectral, #hg-reviewers
Cc: quark, yuja, mjpieters, mercurial-devel
Yuya Nishihara - Feb. 6, 2020, 2:20 p.m.
I like the simplicity of this patch. Queued, thanks.

> Computing the hash on the client avoids the restart loop, but doesn't avoid the behavior difference - chg will still call setenv and replace the modified version with the original one.

Indeed.
phabricator - Feb. 6, 2020, 2:29 p.m.
yuja added a comment.


  I like the simplicity of this patch. Queued, thanks.
  
  > Computing the hash on the client avoids the restart loop, but doesn't avoid the behavior difference - chg will still call setenv and replace the modified version with the original one.
  
  Indeed.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D8039/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D8039

To: spectral, #hg-reviewers
Cc: quark, yuja, mjpieters, mercurial-devel

Patch

diff --git a/tests/test-chg.t b/tests/test-chg.t
--- a/tests/test-chg.t
+++ b/tests/test-chg.t
@@ -332,8 +332,8 @@ 
   YYYY/MM/DD HH:MM:SS (PID)> log -R cached
   YYYY/MM/DD HH:MM:SS (PID)> loaded repo into cache: $TESTTMP/cached (in  ...s)
 
-Test that chg works even when python "coerces" the locale (py3.7+, which is done
-by default if none of LC_ALL, LC_CTYPE, or LANG are set in the environment)
+Test that chg works (sets to the user's actual LC_CTYPE) even when python
+"coerces" the locale (py3.7+)
 
   $ cat > $TESTTMP/debugenv.py <<EOF
   > from mercurial import encoding
@@ -347,9 +347,22 @@ 
   >         if v is not None:
   >             ui.write(b'%s=%s\n' % (k, encoding.environ[k]))
   > EOF
+(hg keeps python's modified LC_CTYPE, chg doesn't)
+  $ (unset LC_ALL; unset LANG; LC_CTYPE= "$CHGHG" \
+  >    --config extensions.debugenv=$TESTTMP/debugenv.py debugenv)
+  LC_CTYPE=C.UTF-8 (py37 !)
+  LC_CTYPE= (no-py37 !)
+  $ (unset LC_ALL; unset LANG; LC_CTYPE= chg \
+  >    --config extensions.debugenv=$TESTTMP/debugenv.py debugenv)
+  LC_CTYPE=
+  $ (unset LC_ALL; unset LANG; LC_CTYPE=unsupported_value chg \
+  >    --config extensions.debugenv=$TESTTMP/debugenv.py debugenv)
+  LC_CTYPE=unsupported_value
+  $ (unset LC_ALL; unset LANG; LC_CTYPE= chg \
+  >    --config extensions.debugenv=$TESTTMP/debugenv.py debugenv)
+  LC_CTYPE=
   $ LANG= LC_ALL= LC_CTYPE= chg \
   >    --config extensions.debugenv=$TESTTMP/debugenv.py debugenv
   LC_ALL=
-  LC_CTYPE=C.UTF-8 (py37 !)
-  LC_CTYPE= (no-py37 !)
+  LC_CTYPE=
   LANG=
diff --git a/mercurial/chgserver.py b/mercurial/chgserver.py
--- a/mercurial/chgserver.py
+++ b/mercurial/chgserver.py
@@ -550,40 +550,6 @@ 
             raise ValueError(b'unexpected value in setenv request')
         self.ui.log(b'chgserver', b'setenv: %r\n', sorted(newenv.keys()))
 
-        # Python3 has some logic to "coerce" the C locale to a UTF-8 capable
-        # one, and it sets LC_CTYPE in the environment to C.UTF-8 if none of
-        # 'LC_CTYPE', 'LC_ALL' or 'LANG' are set (to any value). This can be
-        # disabled with PYTHONCOERCECLOCALE=0 in the environment.
-        #
-        # When fromui is called via _inithashstate, python has already set
-        # this, so that's in the environment right when we start up the hg
-        # process. Then chg will call us and tell us to set the environment to
-        # the one it has; this might NOT have LC_CTYPE, so we'll need to
-        # carry-forward the LC_CTYPE that was coerced in these situations.
-        #
-        # If this is not handled, we will fail config+env validation and fail
-        # to start chg. If this is just ignored instead of carried forward, we
-        # may have different behavior between chg and non-chg.
-        if pycompat.ispy3:
-            # Rename for wordwrapping purposes
-            oldenv = encoding.environ
-            if not any(
-                e.get(b'PYTHONCOERCECLOCALE') == b'0' for e in [oldenv, newenv]
-            ):
-                keys = [b'LC_CTYPE', b'LC_ALL', b'LANG']
-                old_keys = [k for k, v in oldenv.items() if k in keys and v]
-                new_keys = [k for k, v in newenv.items() if k in keys and v]
-                # If the user's environment (from chg) doesn't have ANY of the
-                # keys that python looks for, and the environment (from
-                # initialization) has ONLY LC_CTYPE and it's set to C.UTF-8,
-                # carry it forward.
-                if (
-                    not new_keys
-                    and old_keys == [b'LC_CTYPE']
-                    and oldenv[b'LC_CTYPE'] == b'C.UTF-8'
-                ):
-                    newenv[b'LC_CTYPE'] = oldenv[b'LC_CTYPE']
-
         encoding.environ.clear()
         encoding.environ.update(newenv)
 
@@ -730,6 +696,11 @@ 
     # environ cleaner.
     if b'CHGINTERNALMARK' in encoding.environ:
         del encoding.environ[b'CHGINTERNALMARK']
+    if b'CHGORIG_LC_CTYPE' in encoding.environ:
+        encoding.environ[b'LC_CTYPE'] = encoding.environ[b'CHGORIG_LC_CTYPE']
+        del encoding.environ[b'CHGORIG_LC_CTYPE']
+    elif b'CHG_CLEAR_LC_CTYPE' in encoding.environ:
+        del encoding.environ[b'LC_CTYPE']
 
     if repo:
         # one chgserver can serve multiple repos. drop repo information
diff --git a/hg b/hg
--- a/hg
+++ b/hg
@@ -1,4 +1,4 @@ 
-#!/usr/bin/env python
+#!/usr/bin/env python3
 #
 # mercurial - scalable distributed SCM
 #
diff --git a/contrib/chg/chg.c b/contrib/chg/chg.c
--- a/contrib/chg/chg.c
+++ b/contrib/chg/chg.c
@@ -226,6 +226,16 @@ 
 	}
 	argv[argsize - 1] = NULL;
 
+	const char *lc_ctype_env = getenv("LC_CTYPE");
+	if (lc_ctype_env == NULL) {
+		if (putenv("CHG_CLEAR_LC_CTYPE=") != 0)
+			abortmsgerrno("failed to putenv CHG_CLEAR_LC_CTYPE");
+	} else {
+		if (setenv("CHGORIG_LC_CTYPE", lc_ctype_env, 1) != 0) {
+			abortmsgerrno("failed to setenv CHGORIG_LC_CTYYPE");
+		}
+	}
+
 	if (putenv("CHGINTERNALMARK=") != 0)
 		abortmsgerrno("failed to putenv");
 	if (execvp(hgcmd, (char **)argv) < 0)