Submitter | Pulkit Goyal |
---|---|
Date | Sept. 14, 2016, 5:15 p.m. |
Message ID | <ec133d50af780e84a6a2.1473873327@pulkit-goyal> |
Download | mbox | patch |
Permalink | /patch/16629/ |
State | Changes Requested |
Headers | show |
Comments
On Wed, 14 Sep 2016 22:45:27 +0530, Pulkit Goyal wrote: > # HG changeset patch > # User Pulkit Goyal <7895pulkit@gmail.com> > # Date 1473787789 -19800 > # Tue Sep 13 22:59:49 2016 +0530 > # Node ID ec133d50af780e84a6a24825b52d433c10f9cd55 > # Parent 85bd31515225e7fdf9bd88edde054db2c74a33f8 > py3: have an utility function to return string > > There are cases when we need strings and can't use bytes in python 3. > We need an utility function for these cases. I agree that this may not > be the best possible way out. I will be happy if anybody else can suggest > a better approach. We need this functions for os.path.join(), We should stick to bytes for filesystem API, and translate bytes to unicode at VFS layer as necessary. https://www.mercurial-scm.org/wiki/WindowsUTF8Plan (Also, we'll have to disable PEP 528 and 529 on Python 3.6, which will break existing repositories.) https://docs.python.org/3.6/whatsnew/3.6.html > __slots__ __slots__ can be considered private data, so just use u''. > and few more things. for instance? > +# This function converts its arguments to strings > +# on the basis of python version. Strings in python 3 > +# are unicodes and our transformer converts everything to bytes > +# in python 3. So we need to decode it to unicodes in > +# py3. > + > +def coverttostr(word): > + if sys.version_info[0] < 3: > + assert isinstance(word, str), "Not a string in Python 2" > + return word > + # Checking word is bytes because we have the transformer, else > + # raising error > + assert isinstance(word, bytes), "Should be bytes because of transformer" > + return word.decode(sys.getfilesystemencoding()) Can we assume 'word' was encoded in file-system codec?
* Pulkit Goyal on Wednesday, September 14, 2016 at 22:45:27 +0530 > # HG changeset patch > # User Pulkit Goyal <7895pulkit@gmail.com> > # Date 1473787789 -19800 > # Tue Sep 13 22:59:49 2016 +0530 > # Node ID ec133d50af780e84a6a24825b52d433c10f9cd55 > # Parent 85bd31515225e7fdf9bd88edde054db2c74a33f8 > py3: have an utility function to return string > > There are cases when we need strings and can't use bytes in python 3. > We need an utility function for these cases. I agree that this may not > be the best possible way out. I will be happy if anybody else can suggest > a better approach. We need this functions for os.path.join(), __slots__ > and few more things. Added the function in pycompat.py as it is not too big > to import. > > diff -r 85bd31515225 -r ec133d50af78 mercurial/pycompat.py > --- a/mercurial/pycompat.py Sun Aug 21 13:16:21 2016 +0900 > +++ b/mercurial/pycompat.py Tue Sep 13 22:59:49 2016 +0530 > @@ -164,3 +164,18 @@ > "SimpleHTTPRequestHandler", > "CGIHTTPRequestHandler", > )) > + > +# This function converts its arguments to strings > +# on the basis of python version. Strings in python 3 > +# are unicodes and our transformer converts everything to bytes > +# in python 3. So we need to decode it to unicodes in > +# py3. > + > +def coverttostr(word): converttostr I presume?
On Thu, Sep 15, 2016 at 7:06 PM, Yuya Nishihara <yuya@tcha.org> wrote: > On Wed, 14 Sep 2016 22:45:27 +0530, Pulkit Goyal wrote: >> # HG changeset patch >> # User Pulkit Goyal <7895pulkit@gmail.com> >> # Date 1473787789 -19800 >> # Tue Sep 13 22:59:49 2016 +0530 >> # Node ID ec133d50af780e84a6a24825b52d433c10f9cd55 >> # Parent 85bd31515225e7fdf9bd88edde054db2c74a33f8 >> py3: have an utility function to return string >> >> There are cases when we need strings and can't use bytes in python 3. >> We need an utility function for these cases. I agree that this may not >> be the best possible way out. I will be happy if anybody else can suggest >> a better approach. We need this functions for os.path.join(), > > We should stick to bytes for filesystem API, and translate bytes to unicode > at VFS layer as necessary. > > https://www.mercurial-scm.org/wiki/WindowsUTF8Plan > > (Also, we'll have to disable PEP 528 and 529 on Python 3.6, which will break > existing repositories.) > > https://docs.python.org/3.6/whatsnew/3.6.html > >> __slots__ > > __slots__ can be considered private data, so just use u''. > >> and few more things. > > for instance? This function was motivated from Gregory's reply to https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-August/086704.html , unfortunately I see that he replied to me only so I pasted it here https://bpaste.net/show/ab0d3ea39749 I am going through python documentation and there are things like __slots__, is_frozen() which accepts str in both py2 and py3. Since they are not same, I made this function to get help in such cases. If we can use unicodes in __slots__ in py2, than thats good. > >> +# This function converts its arguments to strings >> +# on the basis of python version. Strings in python 3 >> +# are unicodes and our transformer converts everything to bytes >> +# in python 3. So we need to decode it to unicodes in >> +# py3. >> + >> +def coverttostr(word): >> + if sys.version_info[0] < 3: >> + assert isinstance(word, str), "Not a string in Python 2" >> + return word >> + # Checking word is bytes because we have the transformer, else >> + # raising error >> + assert isinstance(word, bytes), "Should be bytes because of transformer" >> + return word.decode(sys.getfilesystemencoding()) > > Can we assume 'word' was encoded in file-system codec? Yeah because of the tranformer, we added b'' everywhere.
On Thu, Sep 15, 2016 at 7:07 PM, Christian Ebert <blacktrash@gmx.net> wrote: > * Pulkit Goyal on Wednesday, September 14, 2016 at 22:45:27 +0530 >> # HG changeset patch >> # User Pulkit Goyal <7895pulkit@gmail.com> >> # Date 1473787789 -19800 >> # Tue Sep 13 22:59:49 2016 +0530 >> # Node ID ec133d50af780e84a6a24825b52d433c10f9cd55 >> # Parent 85bd31515225e7fdf9bd88edde054db2c74a33f8 >> py3: have an utility function to return string >> >> There are cases when we need strings and can't use bytes in python 3. >> We need an utility function for these cases. I agree that this may not >> be the best possible way out. I will be happy if anybody else can suggest >> a better approach. We need this functions for os.path.join(), __slots__ >> and few more things. Added the function in pycompat.py as it is not too big >> to import. >> >> diff -r 85bd31515225 -r ec133d50af78 mercurial/pycompat.py >> --- a/mercurial/pycompat.py Sun Aug 21 13:16:21 2016 +0900 >> +++ b/mercurial/pycompat.py Tue Sep 13 22:59:49 2016 +0530 >> @@ -164,3 +164,18 @@ >> "SimpleHTTPRequestHandler", >> "CGIHTTPRequestHandler", >> )) >> + >> +# This function converts its arguments to strings >> +# on the basis of python version. Strings in python 3 >> +# are unicodes and our transformer converts everything to bytes >> +# in python 3. So we need to decode it to unicodes in >> +# py3. >> + >> +def coverttostr(word): > > converttostr I presume? Yeah a typo by mistake, it should be converttostr
On 09/15/2016 03:36 PM, Yuya Nishihara wrote: > On Wed, 14 Sep 2016 22:45:27 +0530, Pulkit Goyal wrote: >> # HG changeset patch >> # User Pulkit Goyal <7895pulkit@gmail.com> >> # Date 1473787789 -19800 >> # Tue Sep 13 22:59:49 2016 +0530 >> # Node ID ec133d50af780e84a6a24825b52d433c10f9cd55 >> # Parent 85bd31515225e7fdf9bd88edde054db2c74a33f8 >> py3: have an utility function to return string >> >> There are cases when we need strings and can't use bytes in python 3. >> We need an utility function for these cases. I agree that this may not >> be the best possible way out. I will be happy if anybody else can suggest >> a better approach. We need this functions for os.path.join(), > > We should stick to bytes for filesystem API, and translate bytes to unicode > at VFS layer as necessary. > > https://www.mercurial-scm.org/wiki/WindowsUTF8Plan > > (Also, we'll have to disable PEP 528 and 529 on Python 3.6, which will break > existing repositories.) > > https://docs.python.org/3.6/whatsnew/3.6.html > >> __slots__ > > __slots__ can be considered private data, so just use u''. > >> and few more things. > > for instance? > >> +# This function converts its arguments to strings >> +# on the basis of python version. Strings in python 3 >> +# are unicodes and our transformer converts everything to bytes >> +# in python 3. So we need to decode it to unicodes in >> +# py3. >> + >> +def coverttostr(word): Any reason, this comment is not the python docstring? >> + if sys.version_info[0] < 3: >> + assert isinstance(word, str), "Not a string in Python 2" >> + return word >> + # Checking word is bytes because we have the transformer, else >> + # raising error >> + assert isinstance(word, bytes), "Should be bytes because of transformer" >> + return word.decode(sys.getfilesystemencoding()) > > Can we assume 'word' was encoded in file-system codec? On what kind of string is this going to be used. If we intend to us this on Mercurial internal identifier only, we can probably assume (and actually, enforce) ascii to keep things simple. Cheers,
On 16 September 2016 at 11:09, Pierre-Yves David <pierre-yves.david@ens-lyon.org> wrote: >>> + return word.decode(sys.getfilesystemencoding()) >> >> >> Can we assume 'word' was encoded in file-system codec? No, this is being used for *source code literals*, so getfilesystemencoding is the wrong codec here. Probably the function should be given an encoding='utf8' default instead, so you can specify a different codec. > > On what kind of string is this going to be used. If we intend to us this on > Mercurial internal identifier only, we can probably assume (and actually, > enforce) ascii to keep things simple. If this is only going to be used for Python identifiers in strings (e.g. the string(s) __slots__ accepts) then ASCII is fine, especially because we need to keep the code working in both Python 2 and 3 and 2 only accepts ASCII for identifiers.
And having properly read Gregory's email, I see he intended his patch to be used for *paths* in Python 3, and Pulkit is re-using this for *Python identifiers in __slots__*. This at least explains why getfilesystemencoding was used; it is the right choice for the first use case, not the second. On 16 September 2016 at 11:27, Martijn Pieters <mj@zopatista.com> wrote: > On 16 September 2016 at 11:09, Pierre-Yves David > <pierre-yves.david@ens-lyon.org> wrote: >>>> + return word.decode(sys.getfilesystemencoding()) >>> >>> >>> Can we assume 'word' was encoded in file-system codec? > > No, this is being used for *source code literals*, so > getfilesystemencoding is the wrong codec here. Probably the function > should be given an encoding='utf8' default instead, so you can specify > a different codec. > >> >> On what kind of string is this going to be used. If we intend to us this on >> Mercurial internal identifier only, we can probably assume (and actually, >> enforce) ascii to keep things simple. > > If this is only going to be used for Python identifiers in strings > (e.g. the string(s) __slots__ accepts) then ASCII is fine, especially > because we need to keep the code working in both Python 2 and 3 and 2 > only accepts ASCII for identifiers. > > > -- > Martijn Pieters
On Thu, 15 Sep 2016 23:59:59 +0530, Pulkit Goyal wrote: > On Thu, Sep 15, 2016 at 7:06 PM, Yuya Nishihara <yuya@tcha.org> wrote: > > On Wed, 14 Sep 2016 22:45:27 +0530, Pulkit Goyal wrote: > >> # HG changeset patch > >> # User Pulkit Goyal <7895pulkit@gmail.com> > >> # Date 1473787789 -19800 > >> # Tue Sep 13 22:59:49 2016 +0530 > >> # Node ID ec133d50af780e84a6a24825b52d433c10f9cd55 > >> # Parent 85bd31515225e7fdf9bd88edde054db2c74a33f8 > >> py3: have an utility function to return string > >> > >> There are cases when we need strings and can't use bytes in python 3. > >> We need an utility function for these cases. I agree that this may not > >> be the best possible way out. I will be happy if anybody else can suggest > >> a better approach. We need this functions for os.path.join(), > > > > We should stick to bytes for filesystem API, and translate bytes to unicode > > at VFS layer as necessary. > > > > https://www.mercurial-scm.org/wiki/WindowsUTF8Plan > > > > (Also, we'll have to disable PEP 528 and 529 on Python 3.6, which will break > > existing repositories.) > > > > https://docs.python.org/3.6/whatsnew/3.6.html > > > >> __slots__ > > > > __slots__ can be considered private data, so just use u''. > > > >> and few more things. > > > > for instance? > This function was motivated from Gregory's reply to > https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-August/086704.html > , unfortunately I see that he replied to me only so I pasted it here > https://bpaste.net/show/ab0d3ea39749 > > I am going through python documentation and there are things like > __slots__, is_frozen() which accepts str in both py2 and py3. Since > they are not same, I made this function to get help in such cases. If > we can use unicodes in __slots__ in py2, than thats good. Python 2.6-2.7 accepts both str and unicode in general, but mixing them is disaster so we've never used unicode whenever possible. Unfortunately, Python 3 solved that problem by forcing us to use unicode (named str) everywhere, which doesn't work in Mercurial because we need to process binary data (including unix paths) transparently. All inputs and outputs (except for future Windows file API) should be bytes. So, if is_frozen() of Py3 doesn't take bytes and Py2 doesn't take unicode, we'll need a compatibility function like you proposed. > >> +# This function converts its arguments to strings > >> +# on the basis of python version. Strings in python 3 > >> +# are unicodes and our transformer converts everything to bytes > >> +# in python 3. So we need to decode it to unicodes in > >> +# py3. > >> + > >> +def coverttostr(word): > >> + if sys.version_info[0] < 3: > >> + assert isinstance(word, str), "Not a string in Python 2" > >> + return word > >> + # Checking word is bytes because we have the transformer, else > >> + # raising error > >> + assert isinstance(word, bytes), "Should be bytes because of transformer" > >> + return word.decode(sys.getfilesystemencoding()) > > > > Can we assume 'word' was encoded in file-system codec? > > Yeah because of the tranformer, we added b'' everywhere. As Martijn said, that varies on how 'word' was encoded. Python sources would be latin1 or utf-8 in most cases, but a string read from external world is different. We assume it as encoding.encoding.
On Fri, Sep 16, 2016 at 7:16 PM, Yuya Nishihara <yuya@tcha.org> wrote: > On Thu, 15 Sep 2016 23:59:59 +0530, Pulkit Goyal wrote: >> On Thu, Sep 15, 2016 at 7:06 PM, Yuya Nishihara <yuya@tcha.org> wrote: >> > On Wed, 14 Sep 2016 22:45:27 +0530, Pulkit Goyal wrote: >> >> # HG changeset patch >> >> # User Pulkit Goyal <7895pulkit@gmail.com> >> >> # Date 1473787789 -19800 >> >> # Tue Sep 13 22:59:49 2016 +0530 >> >> # Node ID ec133d50af780e84a6a24825b52d433c10f9cd55 >> >> # Parent 85bd31515225e7fdf9bd88edde054db2c74a33f8 >> >> py3: have an utility function to return string >> >> >> >> There are cases when we need strings and can't use bytes in python 3. >> >> We need an utility function for these cases. I agree that this may not >> >> be the best possible way out. I will be happy if anybody else can suggest >> >> a better approach. We need this functions for os.path.join(), >> > >> > We should stick to bytes for filesystem API, and translate bytes to unicode >> > at VFS layer as necessary. >> > >> > https://www.mercurial-scm.org/wiki/WindowsUTF8Plan >> > >> > (Also, we'll have to disable PEP 528 and 529 on Python 3.6, which will break >> > existing repositories.) >> > >> > https://docs.python.org/3.6/whatsnew/3.6.html >> > >> >> __slots__ >> > >> > __slots__ can be considered private data, so just use u''. >> > >> >> and few more things. >> > >> > for instance? >> This function was motivated from Gregory's reply to >> https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-August/086704.html >> , unfortunately I see that he replied to me only so I pasted it here >> https://bpaste.net/show/ab0d3ea39749 >> >> I am going through python documentation and there are things like >> __slots__, is_frozen() which accepts str in both py2 and py3. Since >> they are not same, I made this function to get help in such cases. If >> we can use unicodes in __slots__ in py2, than thats good. > > Python 2.6-2.7 accepts both str and unicode in general, but mixing them is > disaster so we've never used unicode whenever possible. Unfortunately, Python 3 > solved that problem by forcing us to use unicode (named str) everywhere, which > doesn't work in Mercurial because we need to process binary data (including > unix paths) transparently. All inputs and outputs (except for future Windows > file API) should be bytes. > > So, if is_frozen() of Py3 doesn't take bytes and Py2 doesn't take unicode, > we'll need a compatibility function like you proposed. > >> >> +# This function converts its arguments to strings >> >> +# on the basis of python version. Strings in python 3 >> >> +# are unicodes and our transformer converts everything to bytes >> >> +# in python 3. So we need to decode it to unicodes in >> >> +# py3. >> >> + >> >> +def coverttostr(word): >> >> + if sys.version_info[0] < 3: >> >> + assert isinstance(word, str), "Not a string in Python 2" >> >> + return word >> >> + # Checking word is bytes because we have the transformer, else >> >> + # raising error >> >> + assert isinstance(word, bytes), "Should be bytes because of transformer" >> >> + return word.decode(sys.getfilesystemencoding()) >> > >> > Can we assume 'word' was encoded in file-system codec? >> >> Yeah because of the tranformer, we added b'' everywhere. > > As Martijn said, that varies on how 'word' was encoded. Python sources would > be latin1 or utf-8 in most cases, but a string read from external world is > different. We assume it as encoding.encoding. Is encoding.encoding public or private. Can I convert it to unicode?
On Sun, 2 Oct 2016 06:36:35 +0530, Pulkit Goyal wrote:
> Is encoding.encoding public or private. Can I convert it to unicode?
No. It's read/written freely. We could cache a unicode variant internally if
that matters, but we would need a setter function to invalidate the cache.
% grep encoding.encoding **/*.py
hgext/convert/convcmd.py: # tolocal() because the encoding.encoding convert()
hgext/convert/convcmd.py: orig_encoding = encoding.encoding
hgext/convert/convcmd.py: encoding.encoding = 'UTF-8'
hgext/convert/cvs.py: self.encoding = encoding.encoding
hgext/convert/gnuarch.py: self.encoding = encoding.encoding
hgext/highlight/__init__.py: mt = ''.join(tmpl('mimetype', encoding=encoding.encoding))
hgext/highlight/__init__.py: mt = ''.join(tmpl('mimetype', encoding=encoding.encoding))
hgext/highlight/highlight.py: text = text.decode(encoding.encoding, 'replace')
hgext/highlight/highlight.py: coloriter = (s.encode(encoding.encoding, 'replace')
hgext/win32mbcs.py:By default, win32mbcs uses encoding.encoding decided by Mercurial.
hgext/win32mbcs.py: _encoding = ui.config('win32mbcs', 'encoding', encoding.encoding)
hgext/zeroconf/__init__.py: return name.encode(encoding.encoding)
mercurial/commands.py: ('', 'encoding', encoding.encoding, _('set the charset encoding'),
mercurial/commands.py: ('', 'encodingmode', encoding.encodingmode,
mercurial/commands.py: fm.write('encoding', _("checking encoding (%s)...\n"), encoding.encoding)
mercurial/commandserver.py: self.cresult.write(encoding.encoding)
mercurial/commandserver.py: hellomsg += 'encoding: ' + encoding.encoding
mercurial/dispatch.py: reason = reason.encode(encoding.encoding, 'replace')
mercurial/dispatch.py: encoding.encoding = options["encoding"]
mercurial/dispatch.py: encoding.encodingmode = options["encodingmode"]
mercurial/encoding.py: >>> encoding.encoding = 'utf-8'
mercurial/encoding.py: >>> t = u.encode(encoding.encoding)
mercurial/hgweb/hgweb_mod.py: 'encoding': encoding.encoding,
mercurial/hgweb/hgweb_mod.py: encoding.encoding = rctx.config('web', 'encoding', encoding.encoding)
mercurial/hgweb/hgweb_mod.py: ctype = tmpl('mimetype', encoding=encoding.encoding)
mercurial/hgweb/hgwebdir_mod.py: encoding.encoding = self.ui.config('web', 'encoding',
mercurial/hgweb/hgwebdir_mod.py: encoding.encoding)
mercurial/hgweb/hgwebdir_mod.py: ctype = tmpl('mimetype', encoding=encoding.encoding)
mercurial/hgweb/hgwebdir_mod.py: "encoding": encoding.encoding,
mercurial/hgweb/webcommands.py: mt += '; charset="%s"' % encoding.encoding
mercurial/i18n.py: _msgcache[message] = u.encode(encoding.encoding, "replace")
mercurial/mail.py: encoding.encoding.lower(), 'utf-8']
mercurial/mail.py: for ics in (encoding.encoding, encoding.fallbackencoding):
mercurial/mail.py: dom = dom.decode(encoding.encoding).encode('idna')
mercurial/minirst.py: >>> encoding.encoding = 'latin1'
mercurial/minirst.py: >>> encoding.encoding = 'shiftjis'
mercurial/minirst.py: utext = text.decode(encoding.encoding)
mercurial/minirst.py: return utext.encode(encoding.encoding)
mercurial/templatefilters.py: uctext = unicode(text[start:], encoding.encoding)
mercurial/templatefilters.py: yield (uctext[:w].encode(encoding.encoding),
mercurial/templatefilters.py: uctext[w:].encode(encoding.encoding))
mercurial/templatefilters.py: text = unicode(text, encoding.encoding, 'replace')
mercurial/util.py: line = line.decode(encoding.encoding, encoding.encodingmode)
mercurial/util.py: initindent = initindent.decode(encoding.encoding, encoding.encodingmode)
mercurial/util.py: hangindent = hangindent.decode(encoding.encoding, encoding.encodingmode)
mercurial/util.py: return wrapper.fill(line).encode(encoding.encoding)
tests/test-context.py: encoding.encoding = enc
Patch
diff -r 85bd31515225 -r ec133d50af78 mercurial/pycompat.py --- a/mercurial/pycompat.py Sun Aug 21 13:16:21 2016 +0900 +++ b/mercurial/pycompat.py Tue Sep 13 22:59:49 2016 +0530 @@ -164,3 +164,18 @@ "SimpleHTTPRequestHandler", "CGIHTTPRequestHandler", )) + +# This function converts its arguments to strings +# on the basis of python version. Strings in python 3 +# are unicodes and our transformer converts everything to bytes +# in python 3. So we need to decode it to unicodes in +# py3. + +def coverttostr(word): + if sys.version_info[0] < 3: + assert isinstance(word, str), "Not a string in Python 2" + return word + # Checking word is bytes because we have the transformer, else + # raising error + assert isinstance(word, bytes), "Should be bytes because of transformer" + return word.decode(sys.getfilesystemencoding())