Submitter | Eugene Baranov |
---|---|
Date | July 22, 2015, 3:59 p.m. |
Message ID | <67e43f16613874f2c578.1437580788@ADNADTX6400256.eng.citrite.net> |
Download | mbox | patch |
Permalink | /patch/10054/ |
State | Superseded |
Headers | show |
Comments
On Wed, 2015-07-22 at 16:59 +0100, Eugene Baranov wrote: > # HG changeset patch > # User Eugene Baranov <eug.baranov@gmail.com> > # Date 1437580631 -3600 > # Wed Jul 22 16:57:11 2015 +0100 > # Node ID 67e43f16613874f2c5782ddf11e6f3ee17f1b586 > # Parent bcb96f98d9cfade5098a9a12b49714d218441f83 > convert: use original local encoding when converting from Perforce Since encoding is tricky, I'd like to see a little more rationale here. Please take a look at: https://mercurial.selenic.com/wiki/EncodingStrategy ..and tell us briefly what p4's strategy is and why this is a good fit. Or add a comment. Also, Frank presumably had a reason for using Latin1 in the first place. My guess is that he has a utf-8 locale and thus this would break his setup, which isn't great. So whether something like this can go into stable isn't clear.
In my environment p4 seems to use default system locale (windows-1252) for the output which results in some fancy character like "smart quotes" (“ 201C and ” 201D) to be lost if Latin1 is used. CC'ed Frank, I wonder if he can have a look. On 22 July 2015 at 18:15, Matt Mackall <mpm@selenic.com> wrote: > On Wed, 2015-07-22 at 16:59 +0100, Eugene Baranov wrote: >> # HG changeset patch >> # User Eugene Baranov <eug.baranov@gmail.com> >> # Date 1437580631 -3600 >> # Wed Jul 22 16:57:11 2015 +0100 >> # Node ID 67e43f16613874f2c5782ddf11e6f3ee17f1b586 >> # Parent bcb96f98d9cfade5098a9a12b49714d218441f83 >> convert: use original local encoding when converting from Perforce > > Since encoding is tricky, I'd like to see a little more rationale here. > Please take a look at: > > https://mercurial.selenic.com/wiki/EncodingStrategy > > ..and tell us briefly what p4's strategy is and why this is a good fit. > Or add a comment. > > Also, Frank presumably had a reason for using Latin1 in the first place. > My guess is that he has a utf-8 locale and thus this would break his > setup, which isn't great. So whether something like this can go into > stable isn't clear. > > -- > Mathematics is the supreme nostalgia of our time. >
On Wed, 2015-07-22 at 18:49 +0100, Eugene Baranov wrote: > In my environment p4 seems to use default system locale (windows-1252) It generates 1252-encoded bytes on the command line? Is your command line environment in cp437 or similar? This is Mercurial's strategy on Windows (ignore the OEM code page, only pay attention to the ANSI code page) so I like it, but optimally we'd use UTF-8 for transferring metadata across and a configurable encoding for filenames. > for the output which results in some fancy character like "smart > quotes" (“ 201C and ” 201D) to be lost if Latin1 is used. When you say 'lost', do you mean they replaced with '?' or mojibaked?
My 'active code page' is 850, but p4 indeed generates 1252-encoded text. I've tried to 'convince' p4 to output in UTF-8, but so far I haven't figured out how. By lost I meant getting replaced with '?' On 22 July 2015 at 20:13, Matt Mackall <mpm@selenic.com> wrote: > On Wed, 2015-07-22 at 18:49 +0100, Eugene Baranov wrote: >> In my environment p4 seems to use default system locale (windows-1252) > > It generates 1252-encoded bytes on the command line? Is your command > line environment in cp437 or similar? This is Mercurial's strategy on > Windows (ignore the OEM code page, only pay attention to the ANSI code > page) so I like it, but optimally we'd use UTF-8 for transferring > metadata across and a configurable encoding for filenames. > >> for the output which results in some fancy character like "smart >> quotes" (“ 201C and ” 201D) to be lost if Latin1 is used. > > When you say 'lost', do you mean they replaced with '?' or mojibaked? > > -- > Mathematics is the supreme nostalgia of our time. >
Also I checked Frank's Perfarce extension and haven't noticed it relying on Latin1... On 22 July 2015 at 23:18, Eugene Baranov <eug.baranov@gmail.com> wrote: > My 'active code page' is 850, but p4 indeed generates 1252-encoded text. > > I've tried to 'convince' p4 to output in UTF-8, but so far I haven't > figured out how. > > By lost I meant getting replaced with '?' > > On 22 July 2015 at 20:13, Matt Mackall <mpm@selenic.com> wrote: >> On Wed, 2015-07-22 at 18:49 +0100, Eugene Baranov wrote: >>> In my environment p4 seems to use default system locale (windows-1252) >> >> It generates 1252-encoded bytes on the command line? Is your command >> line environment in cp437 or similar? This is Mercurial's strategy on >> Windows (ignore the OEM code page, only pay attention to the ANSI code >> page) so I like it, but optimally we'd use UTF-8 for transferring >> metadata across and a configurable encoding for filenames. >> >>> for the output which results in some fancy character like "smart >>> quotes" (“ 201C and ” 201D) to be lost if Latin1 is used. >> >> When you say 'lost', do you mean they replaced with '?' or mojibaked? >> >> -- >> Mathematics is the supreme nostalgia of our time. >>
On Wed, 2015-07-22 at 23:18 +0100, Eugene Baranov wrote: > My 'active code page' is 850, but p4 indeed generates 1252-encoded text. > > I've tried to 'convince' p4 to output in UTF-8, but so far I haven't > figured out how. This suggest the magic is to set P4CHARSET=utf8. And there also appears to be a -C switch to force the encoding: http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt From what I gather, the encoding switch affects: a) metadata b) filenames c) contents of files marked as type 'unicode' (b) doesn't agree with the Mercurial approach, which treats filenames themselves as data to be byte-preserved. And if we get UTF8 filenames out of p4, we're going to have a problem on Windows until this thing is finished: https://mercurial.selenic.com/wiki/WindowsUTF8Plan So we actually might want a "split" approach here: use -C utf8 to extract metadata and -C <some configured encoding> to extract filenames. (c) is a bit of a problem: if we have a file named café containing "café" marked as Unicode, it's not clear that there's a way to ask for it by name in Latin1/1252 and get its contents back in UTF8.
On 23/07/15 16:14, Matt Mackall wrote: > On Wed, 2015-07-22 at 23:18 +0100, Eugene Baranov wrote: >> My 'active code page' is 850, but p4 indeed generates 1252-encoded text. >> >> I've tried to 'convince' p4 to output in UTF-8, but so far I haven't >> figured out how. > This suggest the magic is to set P4CHARSET=utf8. And there also appears > to be a -C switch to force the encoding: > > http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt Not a lot of time to think about this today, sorry. I think setting P4CHARSET to any value at all fails unless the p4 server is running in "unicode mode", so that may not be useful. The help has this unhelpful paragraph: If P4CHARSET is not set explicitly when connecting to a Unicode mode server, a default charset will be chosen based on the client's platform and/or code page. Frank
I indeed tried setting P4CHARSET to utf8 and it fails since my server isn't running in "unicode mode". Also tried setting P4COMMANDCHARSET (http://www.perforce.com/perforce/doc.current/manuals/cmdref/P4COMMANDCHARSET.html) but that didn't do anything at all. On 23 July 2015 at 16:49, Frank Kingswood <frank@kingswood-consulting.co.uk> wrote: > On 23/07/15 16:14, Matt Mackall wrote: >> >> On Wed, 2015-07-22 at 23:18 +0100, Eugene Baranov wrote: >>> >>> My 'active code page' is 850, but p4 indeed generates 1252-encoded text. >>> >>> I've tried to 'convince' p4 to output in UTF-8, but so far I haven't >>> figured out how. >> >> This suggest the magic is to set P4CHARSET=utf8. And there also appears >> to be a -C switch to force the encoding: >> >> http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt > > Not a lot of time to think about this today, sorry. > > I think setting P4CHARSET to any value at all fails unless the p4 server is > running in "unicode mode", so that may not be useful. The help has this > unhelpful paragraph: > > If P4CHARSET is not set explicitly when connecting to a Unicode mode > server, a default charset will be chosen based on the client's > platform and/or code page. > > Frank
Patch
diff -r bcb96f98d9cf -r 67e43f166138 hgext/convert/p4.py --- a/hgext/convert/p4.py Thu Jul 16 17:57:38 2015 +0100 +++ b/hgext/convert/p4.py Wed Jul 22 16:57:11 2015 +0100 @@ -9,6 +9,7 @@ from mercurial.i18n import _ from common import commit, converter_source, checktool, NoRepo +import convcmd import marshal import re @@ -139,7 +140,7 @@ self.tags = {} self.lastbranch = {} self.parent = {} - self.encoding = "latin_1" + self.encoding = convcmd.orig_encoding self.depotname = {} # mapping from local name to depot name self.localname = {} # mapping from depot name to local name self.re_type = re.compile(