Submitter | Manuel Jacob |
---|---|
Date | June 24, 2020, 12:57 p.m. |
Message ID | <0bc2e245a6f89bd5797e.1593003451@tmp> |
Download | mbox | patch |
Permalink | /patch/46556/ |
State | Superseded |
Headers | show |
Comments
This breaks tests (https://foss.heptapod.net/octobus/mercurial-devel/pipelines/7240). I’ll investigate. On 2020-06-24 14:57, Manuel Jacob wrote: > # HG changeset patch > # User Manuel Jacob <me@manueljacob.de> > # Date 1593002661 -7200 > # Wed Jun 24 14:44:21 2020 +0200 > # Node ID 0bc2e245a6f89bd5797e234874733f31778aea67 > # Parent c2df0bca0dfa30cdf177d37b852841002d670ee5 > # EXP-Topic use_os_fsencode > pycompat: use os.fsencode() to re-encode sys.argv > > Historically, the previous code made sense, as Py_EncodeLocale() and > fs.fsencode() could possibly use different encodings. However, this is > not the > case anymore for Python 3.2, which uses the locale encoding as the > filesystem > encoding (this is not true for later Python versions, but see below). > See > https://vstinner.github.io/painful-history-python-filesystem-encoding.html > for > a source and more background information. > > Using os.fsencode() is safer, as the documentation for sys.argv says > that it can > be used to get the original bytes. When doing further changes, the > Python > developers will take care that this continues to work. > > One concrete case where os.fsencode() is more correct is when enabling > Python's > UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our previous > code > would have encoded it using the locale encoding (which might be > different), > whereas os.fsencode() will encode it with UTF-8. > > Since we don’t claim to support the UTF-8 mode, this is not really a > bug and the > patch can go to the default branch. It might be a good idea to not > commit this > to the stable branch, as it could in theory introduce regressions. > > diff --git a/mercurial/pycompat.py b/mercurial/pycompat.py > --- a/mercurial/pycompat.py > +++ b/mercurial/pycompat.py > @@ -156,16 +156,10 @@ > > if getattr(sys, 'argv', None) is not None: > # On POSIX, the char** argv array is converted to Python str > using > - # Py_DecodeLocale(). The inverse of this is > Py_EncodeLocale(), which isn't > - # directly callable from Python code. So, we need to emulate > it. > - # Py_DecodeLocale() calls mbstowcs() and falls back to > mbrtowc() with > - # surrogateescape error handling on failure. These functions > take the > - # current system locale into account. So, the inverse > operation is to > - # .encode() using the system locale's encoding and using the > - # surrogateescape error handler. The only tricky part here is > getting > - # the system encoding correct, since `locale.getlocale()` can > return > - # None. We fall back to the filesystem encoding if lookups via > `locale` > - # fail, as this seems like a reasonable thing to do. > + # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), > which > + # isn't directly callable from Python code. In practice, > os.fsencode() > + # can be used instead (this is recommended by Python's > documentation > + # for sys.argv). > # > # On Windows, the wchar_t **argv is passed into the > interpreter as-is. > # Like POSIX, we need to emulate what Py_EncodeLocale() would > do. But > @@ -178,12 +172,7 @@ > if os.name == r'nt': > sysargv = [a.encode("mbcs", "ignore") for a in sys.argv] > else: > - encoding = ( > - locale.getlocale()[1] > - or locale.getdefaultlocale()[1] > - or sys.getfilesystemencoding() > - ) > - sysargv = [a.encode(encoding, "surrogateescape") for a in > sys.argv] > + sysargv = [fsencode(a) for a in sys.argv] > > bytechr = struct.Struct('>B').pack > byterepr = b'%r'.__mod__ > _______________________________________________ > Mercurial-devel mailing list > Mercurial-devel@mercurial-scm.org > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
It turns out that the test breakage is not at all because of the changes in the patch. It’s because of non-ASCII characters in the changeset description. See https://foss.heptapod.net/octobus/mercurial-devel/-/commit/e5930264da81a3a73b424efc46f114dbfb9bbe6d, which also fails. GitLab puts the changeset description in the environment variables (maybe after today’s update?), which hgweb tries to encode using latin-1. The solution is... more fsdecode()s. I hope this doesn’t sound too dubious to be true. :) On 2020-06-24 16:14, Manuel Jacob wrote: > This breaks tests > (https://foss.heptapod.net/octobus/mercurial-devel/pipelines/7240). > I’ll investigate. > > On 2020-06-24 14:57, Manuel Jacob wrote: >> # HG changeset patch >> # User Manuel Jacob <me@manueljacob.de> >> # Date 1593002661 -7200 >> # Wed Jun 24 14:44:21 2020 +0200 >> # Node ID 0bc2e245a6f89bd5797e234874733f31778aea67 >> # Parent c2df0bca0dfa30cdf177d37b852841002d670ee5 >> # EXP-Topic use_os_fsencode >> pycompat: use os.fsencode() to re-encode sys.argv >> >> Historically, the previous code made sense, as Py_EncodeLocale() and >> fs.fsencode() could possibly use different encodings. However, this is >> not the >> case anymore for Python 3.2, which uses the locale encoding as the >> filesystem >> encoding (this is not true for later Python versions, but see below). >> See >> https://vstinner.github.io/painful-history-python-filesystem-encoding.html >> for >> a source and more background information. >> >> Using os.fsencode() is safer, as the documentation for sys.argv says >> that it can >> be used to get the original bytes. When doing further changes, the >> Python >> developers will take care that this continues to work. >> >> One concrete case where os.fsencode() is more correct is when enabling >> Python's >> UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our >> previous code >> would have encoded it using the locale encoding (which might be >> different), >> whereas os.fsencode() will encode it with UTF-8. >> >> Since we don’t claim to support the UTF-8 mode, this is not really a >> bug and the >> patch can go to the default branch. It might be a good idea to not >> commit this >> to the stable branch, as it could in theory introduce regressions. >> >> diff --git a/mercurial/pycompat.py b/mercurial/pycompat.py >> --- a/mercurial/pycompat.py >> +++ b/mercurial/pycompat.py >> @@ -156,16 +156,10 @@ >> >> if getattr(sys, 'argv', None) is not None: >> # On POSIX, the char** argv array is converted to Python str >> using >> - # Py_DecodeLocale(). The inverse of this is >> Py_EncodeLocale(), which isn't >> - # directly callable from Python code. So, we need to emulate >> it. >> - # Py_DecodeLocale() calls mbstowcs() and falls back to >> mbrtowc() with >> - # surrogateescape error handling on failure. These functions >> take the >> - # current system locale into account. So, the inverse >> operation is to >> - # .encode() using the system locale's encoding and using the >> - # surrogateescape error handler. The only tricky part here is >> getting >> - # the system encoding correct, since `locale.getlocale()` can >> return >> - # None. We fall back to the filesystem encoding if lookups >> via `locale` >> - # fail, as this seems like a reasonable thing to do. >> + # Py_DecodeLocale(). The inverse of this is >> Py_EncodeLocale(), which >> + # isn't directly callable from Python code. In practice, >> os.fsencode() >> + # can be used instead (this is recommended by Python's >> documentation >> + # for sys.argv). >> # >> # On Windows, the wchar_t **argv is passed into the >> interpreter as-is. >> # Like POSIX, we need to emulate what Py_EncodeLocale() would >> do. But >> @@ -178,12 +172,7 @@ >> if os.name == r'nt': >> sysargv = [a.encode("mbcs", "ignore") for a in sys.argv] >> else: >> - encoding = ( >> - locale.getlocale()[1] >> - or locale.getdefaultlocale()[1] >> - or sys.getfilesystemencoding() >> - ) >> - sysargv = [a.encode(encoding, "surrogateescape") for a in >> sys.argv] >> + sysargv = [fsencode(a) for a in sys.argv] >> >> bytechr = struct.Struct('>B').pack >> byterepr = b'%r'.__mod__ >> _______________________________________________ >> Mercurial-devel mailing list >> Mercurial-devel@mercurial-scm.org >> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel > _______________________________________________ > Mercurial-devel mailing list > Mercurial-devel@mercurial-scm.org > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Rebased on the WSGI environment patches it passes test-py3 heptapod CI: https://foss.heptapod.net/octobus/mercurial-devel/-/jobs/81591 When run directly, it passed before and after the other patch. I’ve sent an otherwise unmodified version of this patch rebased on a merge from stable. On 2020-06-25 02:05, Manuel Jacob wrote: > It turns out that the test breakage is not at all because of the > changes in the patch. It’s because of non-ASCII characters in the > changeset description. See > https://foss.heptapod.net/octobus/mercurial-devel/-/commit/e5930264da81a3a73b424efc46f114dbfb9bbe6d, > which also fails. > > GitLab puts the changeset description in the environment variables > (maybe after today’s update?), which hgweb tries to encode using > latin-1. The solution is... more fsdecode()s. I hope this doesn’t > sound too dubious to be true. :) > > On 2020-06-24 16:14, Manuel Jacob wrote: >> This breaks tests >> (https://foss.heptapod.net/octobus/mercurial-devel/pipelines/7240). >> I’ll investigate. >> >> On 2020-06-24 14:57, Manuel Jacob wrote: >>> # HG changeset patch >>> # User Manuel Jacob <me@manueljacob.de> >>> # Date 1593002661 -7200 >>> # Wed Jun 24 14:44:21 2020 +0200 >>> # Node ID 0bc2e245a6f89bd5797e234874733f31778aea67 >>> # Parent c2df0bca0dfa30cdf177d37b852841002d670ee5 >>> # EXP-Topic use_os_fsencode >>> pycompat: use os.fsencode() to re-encode sys.argv >>> >>> Historically, the previous code made sense, as Py_EncodeLocale() and >>> fs.fsencode() could possibly use different encodings. However, this >>> is not the >>> case anymore for Python 3.2, which uses the locale encoding as the >>> filesystem >>> encoding (this is not true for later Python versions, but see below). >>> See >>> https://vstinner.github.io/painful-history-python-filesystem-encoding.html >>> for >>> a source and more background information. >>> >>> Using os.fsencode() is safer, as the documentation for sys.argv says >>> that it can >>> be used to get the original bytes. When doing further changes, the >>> Python >>> developers will take care that this continues to work. >>> >>> One concrete case where os.fsencode() is more correct is when >>> enabling Python's >>> UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our >>> previous code >>> would have encoded it using the locale encoding (which might be >>> different), >>> whereas os.fsencode() will encode it with UTF-8. >>> >>> Since we don’t claim to support the UTF-8 mode, this is not really a >>> bug and the >>> patch can go to the default branch. It might be a good idea to not >>> commit this >>> to the stable branch, as it could in theory introduce regressions. >>> >>> diff --git a/mercurial/pycompat.py b/mercurial/pycompat.py >>> --- a/mercurial/pycompat.py >>> +++ b/mercurial/pycompat.py >>> @@ -156,16 +156,10 @@ >>> >>> if getattr(sys, 'argv', None) is not None: >>> # On POSIX, the char** argv array is converted to Python str >>> using >>> - # Py_DecodeLocale(). The inverse of this is >>> Py_EncodeLocale(), which isn't >>> - # directly callable from Python code. So, we need to emulate >>> it. >>> - # Py_DecodeLocale() calls mbstowcs() and falls back to >>> mbrtowc() with >>> - # surrogateescape error handling on failure. These functions >>> take the >>> - # current system locale into account. So, the inverse >>> operation is to >>> - # .encode() using the system locale's encoding and using the >>> - # surrogateescape error handler. The only tricky part here >>> is getting >>> - # the system encoding correct, since `locale.getlocale()` >>> can return >>> - # None. We fall back to the filesystem encoding if lookups >>> via `locale` >>> - # fail, as this seems like a reasonable thing to do. >>> + # Py_DecodeLocale(). The inverse of this is >>> Py_EncodeLocale(), which >>> + # isn't directly callable from Python code. In practice, >>> os.fsencode() >>> + # can be used instead (this is recommended by Python's >>> documentation >>> + # for sys.argv). >>> # >>> # On Windows, the wchar_t **argv is passed into the >>> interpreter as-is. >>> # Like POSIX, we need to emulate what Py_EncodeLocale() >>> would do. But >>> @@ -178,12 +172,7 @@ >>> if os.name == r'nt': >>> sysargv = [a.encode("mbcs", "ignore") for a in sys.argv] >>> else: >>> - encoding = ( >>> - locale.getlocale()[1] >>> - or locale.getdefaultlocale()[1] >>> - or sys.getfilesystemencoding() >>> - ) >>> - sysargv = [a.encode(encoding, "surrogateescape") for a >>> in sys.argv] >>> + sysargv = [fsencode(a) for a in sys.argv] >>> >>> bytechr = struct.Struct('>B').pack >>> byterepr = b'%r'.__mod__ >>> _______________________________________________ >>> Mercurial-devel mailing list >>> Mercurial-devel@mercurial-scm.org >>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel >> _______________________________________________ >> Mercurial-devel mailing list >> Mercurial-devel@mercurial-scm.org >> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel > _______________________________________________ > Mercurial-devel mailing list > Mercurial-devel@mercurial-scm.org > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Patch
diff --git a/mercurial/pycompat.py b/mercurial/pycompat.py --- a/mercurial/pycompat.py +++ b/mercurial/pycompat.py @@ -156,16 +156,10 @@ if getattr(sys, 'argv', None) is not None: # On POSIX, the char** argv array is converted to Python str using - # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which isn't - # directly callable from Python code. So, we need to emulate it. - # Py_DecodeLocale() calls mbstowcs() and falls back to mbrtowc() with - # surrogateescape error handling on failure. These functions take the - # current system locale into account. So, the inverse operation is to - # .encode() using the system locale's encoding and using the - # surrogateescape error handler. The only tricky part here is getting - # the system encoding correct, since `locale.getlocale()` can return - # None. We fall back to the filesystem encoding if lookups via `locale` - # fail, as this seems like a reasonable thing to do. + # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which + # isn't directly callable from Python code. In practice, os.fsencode() + # can be used instead (this is recommended by Python's documentation + # for sys.argv). # # On Windows, the wchar_t **argv is passed into the interpreter as-is. # Like POSIX, we need to emulate what Py_EncodeLocale() would do. But @@ -178,12 +172,7 @@ if os.name == r'nt': sysargv = [a.encode("mbcs", "ignore") for a in sys.argv] else: - encoding = ( - locale.getlocale()[1] - or locale.getdefaultlocale()[1] - or sys.getfilesystemencoding() - ) - sysargv = [a.encode(encoding, "surrogateescape") for a in sys.argv] + sysargv = [fsencode(a) for a in sys.argv] bytechr = struct.Struct('>B').pack byterepr = b'%r'.__mod__