Patchwork [stable,v2] convert: use original local encoding when converting from Perforce

login
register
mail settings
Submitter Eugene Baranov
Date July 22, 2015, 3:59 p.m.
Message ID <67e43f16613874f2c578.1437580788@ADNADTX6400256.eng.citrite.net>
Download mbox | patch
Permalink /patch/10054/
State Superseded
Headers show

Comments

Eugene Baranov - July 22, 2015, 3:59 p.m.
# HG changeset patch
# User Eugene Baranov <eug.baranov@gmail.com>
# Date 1437580631 -3600
#      Wed Jul 22 16:57:11 2015 +0100
# Node ID 67e43f16613874f2c5782ddf11e6f3ee17f1b586
# Parent  bcb96f98d9cfade5098a9a12b49714d218441f83
convert: use original local encoding when converting from Perforce
Matt Mackall - July 22, 2015, 5:15 p.m.
On Wed, 2015-07-22 at 16:59 +0100, Eugene Baranov wrote:
> # HG changeset patch
> # User Eugene Baranov <eug.baranov@gmail.com>
> # Date 1437580631 -3600
> #      Wed Jul 22 16:57:11 2015 +0100
> # Node ID 67e43f16613874f2c5782ddf11e6f3ee17f1b586
> # Parent  bcb96f98d9cfade5098a9a12b49714d218441f83
> convert: use original local encoding when converting from Perforce

Since encoding is tricky, I'd like to see a little more rationale here.
Please take a look at:

https://mercurial.selenic.com/wiki/EncodingStrategy

..and tell us briefly what p4's strategy is and why this is a good fit.
Or add a comment.

Also, Frank presumably had a reason for using Latin1 in the first place.
My guess is that he has a utf-8 locale and thus this would break his
setup, which isn't great. So whether something like this can go into
stable isn't clear.
Eugene Baranov - July 22, 2015, 5:49 p.m.
In my environment p4 seems to use default system locale (windows-1252)
for the output which results in some fancy character like "smart
quotes" (“ 201C and ” 201D) to be lost if Latin1 is used.

CC'ed Frank, I wonder if he can have a look.

On 22 July 2015 at 18:15, Matt Mackall <mpm@selenic.com> wrote:
> On Wed, 2015-07-22 at 16:59 +0100, Eugene Baranov wrote:
>> # HG changeset patch
>> # User Eugene Baranov <eug.baranov@gmail.com>
>> # Date 1437580631 -3600
>> #      Wed Jul 22 16:57:11 2015 +0100
>> # Node ID 67e43f16613874f2c5782ddf11e6f3ee17f1b586
>> # Parent  bcb96f98d9cfade5098a9a12b49714d218441f83
>> convert: use original local encoding when converting from Perforce
>
> Since encoding is tricky, I'd like to see a little more rationale here.
> Please take a look at:
>
> https://mercurial.selenic.com/wiki/EncodingStrategy
>
> ..and tell us briefly what p4's strategy is and why this is a good fit.
> Or add a comment.
>
> Also, Frank presumably had a reason for using Latin1 in the first place.
> My guess is that he has a utf-8 locale and thus this would break his
> setup, which isn't great. So whether something like this can go into
> stable isn't clear.
>
> --
> Mathematics is the supreme nostalgia of our time.
>
Matt Mackall - July 22, 2015, 7:13 p.m.
On Wed, 2015-07-22 at 18:49 +0100, Eugene Baranov wrote:
> In my environment p4 seems to use default system locale (windows-1252)

It generates 1252-encoded bytes on the command line? Is your command
line environment in cp437 or similar? This is Mercurial's strategy on
Windows (ignore the OEM code page, only pay attention to the ANSI code
page) so I like it, but optimally we'd use UTF-8 for transferring
metadata across and a configurable encoding for filenames.

> for the output which results in some fancy character like "smart
> quotes" (“ 201C and ” 201D) to be lost if Latin1 is used.

When you say 'lost', do you mean they replaced with '?' or mojibaked?
Eugene Baranov - July 22, 2015, 10:18 p.m.
My 'active code page' is 850, but p4 indeed generates 1252-encoded text.

I've tried to 'convince' p4 to output in UTF-8, but so far I haven't
figured out how.

By lost I meant getting replaced with '?'

On 22 July 2015 at 20:13, Matt Mackall <mpm@selenic.com> wrote:
> On Wed, 2015-07-22 at 18:49 +0100, Eugene Baranov wrote:
>> In my environment p4 seems to use default system locale (windows-1252)
>
> It generates 1252-encoded bytes on the command line? Is your command
> line environment in cp437 or similar? This is Mercurial's strategy on
> Windows (ignore the OEM code page, only pay attention to the ANSI code
> page) so I like it, but optimally we'd use UTF-8 for transferring
> metadata across and a configurable encoding for filenames.
>
>> for the output which results in some fancy character like "smart
>> quotes" (“ 201C and ” 201D) to be lost if Latin1 is used.
>
> When you say 'lost', do you mean they replaced with '?' or mojibaked?
>
> --
> Mathematics is the supreme nostalgia of our time.
>
Eugene Baranov - July 22, 2015, 10:23 p.m.
Also I checked Frank's Perfarce extension and haven't noticed it
relying on Latin1...

On 22 July 2015 at 23:18, Eugene Baranov <eug.baranov@gmail.com> wrote:
> My 'active code page' is 850, but p4 indeed generates 1252-encoded text.
>
> I've tried to 'convince' p4 to output in UTF-8, but so far I haven't
> figured out how.
>
> By lost I meant getting replaced with '?'
>
> On 22 July 2015 at 20:13, Matt Mackall <mpm@selenic.com> wrote:
>> On Wed, 2015-07-22 at 18:49 +0100, Eugene Baranov wrote:
>>> In my environment p4 seems to use default system locale (windows-1252)
>>
>> It generates 1252-encoded bytes on the command line? Is your command
>> line environment in cp437 or similar? This is Mercurial's strategy on
>> Windows (ignore the OEM code page, only pay attention to the ANSI code
>> page) so I like it, but optimally we'd use UTF-8 for transferring
>> metadata across and a configurable encoding for filenames.
>>
>>> for the output which results in some fancy character like "smart
>>> quotes" (“ 201C and ” 201D) to be lost if Latin1 is used.
>>
>> When you say 'lost', do you mean they replaced with '?' or mojibaked?
>>
>> --
>> Mathematics is the supreme nostalgia of our time.
>>
Matt Mackall - July 23, 2015, 3:14 p.m.
On Wed, 2015-07-22 at 23:18 +0100, Eugene Baranov wrote:
> My 'active code page' is 850, but p4 indeed generates 1252-encoded text.
> 
> I've tried to 'convince' p4 to output in UTF-8, but so far I haven't
> figured out how.

This suggest the magic is to set P4CHARSET=utf8. And there also appears
to be a -C switch to force the encoding:

http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt

From what I gather, the encoding switch affects:

a) metadata
b) filenames
c) contents of files marked as type 'unicode'

(b) doesn't agree with the Mercurial approach, which treats filenames
themselves as data to be byte-preserved. And if we get UTF8 filenames
out of p4, we're going to have a problem on Windows until this thing is
finished:

https://mercurial.selenic.com/wiki/WindowsUTF8Plan

So we actually might want a "split" approach here: use -C utf8 to
extract metadata and -C <some configured encoding> to extract filenames.

(c) is a bit of a problem: if we have a file named café containing
"café" marked as Unicode, it's not clear that there's a way to ask for
it by name in Latin1/1252 and get its contents back in UTF8.
Frank Kingswood - July 23, 2015, 3:49 p.m.
On 23/07/15 16:14, Matt Mackall wrote:
> On Wed, 2015-07-22 at 23:18 +0100, Eugene Baranov wrote:
>> My 'active code page' is 850, but p4 indeed generates 1252-encoded text.
>>
>> I've tried to 'convince' p4 to output in UTF-8, but so far I haven't
>> figured out how.
> This suggest the magic is to set P4CHARSET=utf8. And there also appears
> to be a -C switch to force the encoding:
>
> http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt
Not a lot of time to think about this today, sorry.

I think setting P4CHARSET to any value at all fails unless the p4 server 
is running in "unicode mode", so that may not be useful. The help has 
this unhelpful paragraph:

     If P4CHARSET is not set explicitly when connecting to a Unicode mode
     server, a default charset will be chosen based on the client's
     platform and/or code page.

Frank
Eugene Baranov - July 23, 2015, 5:13 p.m.
I indeed tried setting P4CHARSET to utf8 and it fails since my server
isn't running in "unicode mode".

Also tried setting P4COMMANDCHARSET
(http://www.perforce.com/perforce/doc.current/manuals/cmdref/P4COMMANDCHARSET.html)
but that didn't do anything at all.

On 23 July 2015 at 16:49, Frank Kingswood
<frank@kingswood-consulting.co.uk> wrote:
> On 23/07/15 16:14, Matt Mackall wrote:
>>
>> On Wed, 2015-07-22 at 23:18 +0100, Eugene Baranov wrote:
>>>
>>> My 'active code page' is 850, but p4 indeed generates 1252-encoded text.
>>>
>>> I've tried to 'convince' p4 to output in UTF-8, but so far I haven't
>>> figured out how.
>>
>> This suggest the magic is to set P4CHARSET=utf8. And there also appears
>> to be a -C switch to force the encoding:
>>
>> http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt
>
> Not a lot of time to think about this today, sorry.
>
> I think setting P4CHARSET to any value at all fails unless the p4 server is
> running in "unicode mode", so that may not be useful. The help has this
> unhelpful paragraph:
>
>     If P4CHARSET is not set explicitly when connecting to a Unicode mode
>     server, a default charset will be chosen based on the client's
>     platform and/or code page.
>
> Frank

Patch

diff -r bcb96f98d9cf -r 67e43f166138 hgext/convert/p4.py
--- a/hgext/convert/p4.py	Thu Jul 16 17:57:38 2015 +0100
+++ b/hgext/convert/p4.py	Wed Jul 22 16:57:11 2015 +0100
@@ -9,6 +9,7 @@ 
 from mercurial.i18n import _
 
 from common import commit, converter_source, checktool, NoRepo
+import convcmd
 import marshal
 import re
 
@@ -139,7 +140,7 @@ 
         self.tags = {}
         self.lastbranch = {}
         self.parent = {}
-        self.encoding = "latin_1"
+        self.encoding = convcmd.orig_encoding
         self.depotname = {}           # mapping from local name to depot name
         self.localname = {} # mapping from depot name to local name
         self.re_type = re.compile(