Patchwork D7550: chg: fix chg to work with py3.7+ "coercing" the locale

login
register
mail settings
Submitter phabricator
Date Dec. 5, 2019, 11:46 p.m.
Message ID <differential-rev-PHID-DREV-veqrlcrt2bf4vtpui5qz-req@mercurial-scm.org>
Download mbox | patch
Permalink /patch/43587/
State Superseded
Headers show

Comments

phabricator - Dec. 5, 2019, 11:46 p.m.
spectral created this revision.
Herald added subscribers: mercurial-devel, mjpieters.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  When the environment is empty (specifically: it doesn't contain LC_ALL,
  LC_CTYPE, or LANG), Python will "coerce" the locale environment variables to be
  a UTF-8 capable one. It sets LC_CTYPE in the environment, and this breaks chg,
  since chg operates by:
  
  - start hg, using whatever environment the user has when chg starts
  - hg stores a hash of this "original" environment, but python has already set LC_CTYPE even though the user doesn't have it in their environment
  - chg calls setenv over the commandserver. This clears the environment inside of hg and sets it to be exactly what the environment in chg is (without LC_CTYPE).
  - chg calls validate to ensure that the environment hg is using (after the setenv call) is the one that the chg process has - if not, it is assumed the user changed their environment and we should use a different server. This will *never* be true in this situation because LC_CTYPE was removed.

REPOSITORY
  rHG Mercurial

BRANCH
  default

REVISION DETAIL
  https://phab.mercurial-scm.org/D7550

AFFECTED FILES
  mercurial/chgserver.py
  tests/test-chg.t

CHANGE DETAILS




To: spectral, #hg-reviewers
Cc: mjpieters, mercurial-devel
Yuya Nishihara - Dec. 7, 2019, 3:55 a.m.
>   When the environment is empty (specifically: it doesn't contain LC_ALL,
>   LC_CTYPE, or LANG), Python will "coerce" the locale environment variables to be
>   a UTF-8 capable one. It sets LC_CTYPE in the environment, and this breaks chg,
>   since chg operates by:
>   
>   - start hg, using whatever environment the user has when chg starts
>   - hg stores a hash of this "original" environment, but python has already set LC_CTYPE even though the user doesn't have it in their environment
>   - chg calls setenv over the commandserver. This clears the environment inside of hg and sets it to be exactly what the environment in chg is (without LC_CTYPE).
>   - chg calls validate to ensure that the environment hg is using (after the setenv call) is the one that the chg process has - if not, it is assumed the user changed their environment and we should use a different server. This will *never* be true in this situation because LC_CTYPE was removed.

Sigh. Can we work around this weird behavior by making chg do
`putenv("PYTHONCOERCECLOCALE=0")`? I think it's simple and more desired
behavior than the default of Python 3.
phabricator - Dec. 7, 2019, 3:57 a.m.
yuja added a comment.


  >   When the environment is empty (specifically: it doesn't contain LC_ALL,
  >   LC_CTYPE, or LANG), Python will "coerce" the locale environment variables to be
  >   a UTF-8 capable one. It sets LC_CTYPE in the environment, and this breaks chg,
  >   since chg operates by:
  >   - start hg, using whatever environment the user has when chg starts
  >   - hg stores a hash of this "original" environment, but python has already set LC_CTYPE even though the user doesn't have it in their environment
  >   - chg calls setenv over the commandserver. This clears the environment inside of hg and sets it to be exactly what the environment in chg is (without LC_CTYPE).
  >   - chg calls validate to ensure that the environment hg is using (after the setenv call) is the one that the chg process has - if not, it is assumed the user changed their environment and we should use a different server. This will *never* be true in this situation because LC_CTYPE was removed.
  
  Sigh. Can we work around this weird behavior by making chg do
  `putenv("PYTHONCOERCECLOCALE=0")`? I think it's simple and more desired
  behavior than the default of Python 3.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7550/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7550

To: spectral, #hg-reviewers
Cc: yuja, mjpieters, mercurial-devel
phabricator - Dec. 11, 2019, 11:14 p.m.
spectral added a comment.


  In D7550#111235 <https://phab.mercurial-scm.org/D7550#111235>, @yuja wrote:
  
  >>   When the environment is empty (specifically: it doesn't contain LC_ALL,
  >>   LC_CTYPE, or LANG), Python will "coerce" the locale environment variables to be
  >>   a UTF-8 capable one. It sets LC_CTYPE in the environment, and this breaks chg,
  >>   since chg operates by:
  >>   - start hg, using whatever environment the user has when chg starts
  >>   - hg stores a hash of this "original" environment, but python has already set LC_CTYPE even though the user doesn't have it in their environment
  >>   - chg calls setenv over the commandserver. This clears the environment inside of hg and sets it to be exactly what the environment in chg is (without LC_CTYPE).
  >>   - chg calls validate to ensure that the environment hg is using (after the setenv call) is the one that the chg process has - if not, it is assumed the user changed their environment and we should use a different server. This will *never* be true in this situation because LC_CTYPE was removed.
  >
  > Sigh. Can we work around this weird behavior by making chg do
  > `putenv("PYTHONCOERCECLOCALE=0")`? I think it's simple and more desired
  > behavior than the default of Python 3.
  
  I had considered that and was concerned it would create an observable, surprising/confusing difference between chg and non-chg: if chg sets PYTHONCOERCECLOCALE=0, hg won't have LC_CTYPE in the environment, and it WILL have PYTHONCOERCECLOCALE in the environment. When it starts external tools (like merge tools), this may change behavior in some observable fashion, and if the user stops using chg and uses just plain hg, it will have LC_CTYPE in the environment. This would probably be difficult to debug - users (at least the ones I interact with) often don't tell us they're using chg, if they even know that they are. (Sometimes users don't even know they're using chg, such as via their IDE's Mercurial integration, but that's probably not actually a problem here - the IDE would be responsible for making this work, not end users).
  
  I don't know the reason why Python is doing this at all, so maybe my concern is purely hypothetical and not really a problem?

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7550/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7550

To: spectral, #hg-reviewers
Cc: yuja, mjpieters, mercurial-devel
Yuya Nishihara - Dec. 12, 2019, 2:03 p.m.
>   > Sigh. Can we work around this weird behavior by making chg do
>   > `putenv("PYTHONCOERCECLOCALE=0")`? I think it's simple and more desired
>   > behavior than the default of Python 3.
>   
>   I had considered that and was concerned it would create an observable, surprising/confusing difference between chg and non-chg: if chg sets PYTHONCOERCECLOCALE=0, hg won't have LC_CTYPE in the environment, and it WILL have PYTHONCOERCECLOCALE in the environment. When it starts external tools (like merge tools), this may change behavior in some observable fashion, and if the user stops using chg and uses just plain hg, it will have LC_CTYPE in the environment.

Yeah, that could happen. I checked the CPython code, but there's no easy way
to disable PYTHONCOERCECLOCALE at all without writing a C wrapper or rebuilding
Python itself with --without-c-locale-coercion.

I don't care much about the pollution of subprocess environments since Python
does pollute LC_CTYPE by default, which is IMHO worse, but I agree it isn't
nice to introduce behavior change between pure hg and chg. So we'll have to
take this patch, sigh.

Queued, thanks.
phabricator - Dec. 12, 2019, 2:07 p.m.
yuja added a comment.


  >   > Sigh. Can we work around this weird behavior by making chg do
  >   > `putenv("PYTHONCOERCECLOCALE=0")`? I think it's simple and more desired
  >   > behavior than the default of Python 3.
  >   I had considered that and was concerned it would create an observable, surprising/confusing difference between chg and non-chg: if chg sets PYTHONCOERCECLOCALE=0, hg won't have LC_CTYPE in the environment, and it WILL have PYTHONCOERCECLOCALE in the environment. When it starts external tools (like merge tools), this may change behavior in some observable fashion, and if the user stops using chg and uses just plain hg, it will have LC_CTYPE in the environment.
  
  Yeah, that could happen. I checked the CPython code, but there's no easy way
  to disable PYTHONCOERCECLOCALE at all without writing a C wrapper or rebuilding
  Python itself with --without-c-locale-coercion.
  
  I don't care much about the pollution of subprocess environments since Python
  does pollute LC_CTYPE by default, which is IMHO worse, but I agree it isn't
  nice to introduce behavior change between pure hg and chg. So we'll have to
  take this patch, sigh.
  
  Queued, thanks.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7550/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7550

To: spectral, #hg-reviewers
Cc: yuja, mjpieters, mercurial-devel

Patch

diff --git a/tests/test-chg.t b/tests/test-chg.t
--- a/tests/test-chg.t
+++ b/tests/test-chg.t
@@ -329,3 +329,25 @@ 
   YYYY/MM/DD HH:MM:SS (PID)> loaded repo into cache: $TESTTMP/cached2 (in  ...s)
   YYYY/MM/DD HH:MM:SS (PID)> log -R cached
   YYYY/MM/DD HH:MM:SS (PID)> loaded repo into cache: $TESTTMP/cached (in  ...s)
+
+Test that chg works even when python "coerces" the locale (py3.7+, which is done
+by default if none of LC_ALL, LC_CTYPE, or LANG are set in the environment)
+
+  $ cat > $TESTTMP/debugenv.py <<EOF
+  > from mercurial import encoding
+  > from mercurial import registrar
+  > cmdtable = {}
+  > command = registrar.command(cmdtable)
+  > @command(b'debugenv', [], b'', norepo=True)
+  > def debugenv(ui):
+  >     for k in [b'LC_ALL', b'LC_CTYPE', b'LANG']:
+  >         v = encoding.environ.get(k)
+  >         if v is not None:
+  >             ui.write(b'%s=%s\n' % (k, encoding.environ[k]))
+  > EOF
+  $ LANG= LC_ALL= LC_CTYPE= chg \
+  >    --config extensions.debugenv=$TESTTMP/debugenv.py debugenv
+  LC_ALL=
+  LC_CTYPE=C.UTF-8 (py37 !)
+  LC_CTYPE= (no-py37 !)
+  LANG=
diff --git a/mercurial/chgserver.py b/mercurial/chgserver.py
--- a/mercurial/chgserver.py
+++ b/mercurial/chgserver.py
@@ -549,6 +549,38 @@ 
         except ValueError:
             raise ValueError(b'unexpected value in setenv request')
         self.ui.log(b'chgserver', b'setenv: %r\n', sorted(newenv.keys()))
+
+        # Python3 has some logic to "coerce" the C locale to a UTF-8 capable
+        # one, and it sets LC_CTYPE in the environment to C.UTF-8 if none of
+        # 'LC_CTYPE', 'LC_ALL' or 'LANG' are set (to any value). This can be
+        # disabled with PYTHONCOERCECLOCALE=0 in the environment.
+        #
+        # When fromui is called via _inithashstate, python has already set
+        # this, so that's in the environment right when we start up the hg
+        # process. Then chg will call us and tell us to set the environment to
+        # the one it has; this might NOT have LC_CTYPE, so we'll need to
+        # carry-forward the LC_CTYPE that was coerced in these situations.
+        #
+        # If this is not handled, we will fail config+env validation and fail
+        # to start chg. If this is just ignored instead of carried forward, we
+        # may have different behavior between chg and non-chg.
+        if pycompat.ispy3:
+            # Rename for wordwrapping purposes
+            oldenv = encoding.environ
+            if not any(e.get(b'PYTHONCOERCECLOCALE') == b'0' for e in [oldenv,
+                                                                       newenv]):
+                keys = [b'LC_CTYPE', b'LC_ALL', b'LANG']
+                old_keys = [k for k, v in oldenv.items() if k in keys and v]
+                new_keys = [k for k, v in newenv.items() if k in keys and v]
+                # If the user's environment (from chg) doesn't have ANY of the
+                # keys that python looks for, and the environment (from
+                # initialization) has ONLY LC_CTYPE and it's set to C.UTF-8,
+                # carry it forward.
+                if (not new_keys and
+                    old_keys == [b'LC_CTYPE'] and
+                    oldenv[b'LC_CTYPE'] == b'C.UTF-8'):
+                    newenv[b'LC_CTYPE'] = oldenv[b'LC_CTYPE']
+
         encoding.environ.clear()
         encoding.environ.update(newenv)