Patchwork patch: decode e-mail headers

login
register
mail settings
Submitter funman@videolan.org
Date Oct. 23, 2013, 12:25 a.m.
Message ID <52671784.1030103@videolan.org>
Download mbox | patch
Permalink /patch/2805/
State Superseded, archived
Headers show

Comments

funman@videolan.org - Oct. 23, 2013, 12:25 a.m.
Le 22/10/2013 20:32, Augie Fackler a écrit :
> On Tue, Oct 22, 2013 at 02:18:00PM +0200, funman@videolan.org wrote:
>> # HG changeset patch
>> # User Rafaël Carré <funman@videolan.org>
>> # Date 1382444275 -7200
>> #      Tue Oct 22 14:17:55 2013 +0200
>> # Branch stable
>> # Node ID e8c0f97e42ca9e09b4000245bd713f03e5d72038
>> # Parent  2c886dedd9021598b6290d95ea0f068731ea4e2b
>> patch: decode e-mail headers
>>
>>     Change commits from:
>> user:        =?UTF-8?q?Rafa=C3=ABl=20Carr=C3=A9?= <funman@videolan.org>
>>     to:
>> user:        Rafaël Carré <funman@videolan.org>
>>
>> diff -r 2c886dedd902 -r e8c0f97e42ca mercurial/patch.py
>> --- a/mercurial/patch.py	Mon Oct 21 10:50:58 2013 -0700
>> +++ b/mercurial/patch.py	Tue Oct 22 14:17:55 2013 +0200
>> @@ -12,6 +12,7 @@
>>  # load. This was not a problem on Python 2.7.
>>  import email.Generator
>>  import email.Parser
>> +from email.header import decode_header
>>
>>  from i18n import _
>>  from node import hex, short
>> @@ -162,6 +163,25 @@
>>      Any item in the returned tuple can be None. If filename is None,
>>      fileobj did not contain a patch. Caller must unlink filename when done.'''
>>
>> +    def header_decode(h):
>> +        '''Decode ?=UTF-8? from e-mail headers.'''
>> +        if h is None:
>> +            return None
>> +        res = ''
>> +        pairs = decode_header(h)
>> +        if pairs is None:
>> +            return None
>> +        n = len(pairs)
>> +        pair = 0
>> +        for p in pairs:
>> +            pair += 1
>> +            if p[1] == 'utf-8' or p[1] is None:
>> +                res += p[0]
>> +                if pair < n:
>> +                    res += ' '
>> +
>> +        return res
>> +
>>      # attempt to detect the start of a patch
>>      # (this heuristic is borrowed from quilt)
>>      diffre = re.compile(r'^(?:Index:[ \t]|diff[ \t]|RCS file: |'
>> @@ -174,8 +194,8 @@
>>      try:
>>          msg = email.Parser.Parser().parse(fileobj)
>>
>> -        subject = msg['Subject']
>> -        user = msg['From']
>> +        subject = header_decode(msg['Subject'])
>> +        user = header_decode(msg['From'])
>>          if not subject and not user:
>>              # Not an email, restore parsed headers if any
>>              subject = '\n'.join(': '.join(h) for h in msg.items()) + '\n'
> 
> Can I get you to add a simple test to one of the existing 'hg import'
> test cases so we don't break this in the future?

Hi, here's the test:

  $ cat > utf8.patch <<EOF
  > From: =?UTF-8?q?=C3=AB?=
  > Subject: patch
  > diff --git /dev/null b/a
  > --- /dev/null
  > +++ b/a
  > @@ -0,0 +1,1 @@
  > +a
  > EOF
  $ hg init utf
  $ cd utf
  $ hg import ../utf8.patch
  $ hg log | grep ^user -
  user:        ë

It currently fails with:

-  user:        ë
+  [1]


Something goes bad in hg import with LANG=C and I'm not sure why.

Why is the backtrace hidden here?
Augie Fackler - Oct. 23, 2013, 12:26 a.m.
On Oct 22, 2013, at 8:25 PM, Rafaël Carré <funman@videolan.org> wrote:

>  $ hg init utf
>  $ cd utf
>  $ hg import ../utf8.patch
>  $ hg log | grep ^user -
>  user:        ë
> 
> It currently fails with:
> 
> --- /media/dev/hg/tests/test-import.t
> +++ /media/dev/hg/tests/test-import.t.err
> @@ -1169,5 +1169,10 @@
>   $ hg init utf
>   $ cd utf
>   $ hg import ../utf8.patch
> +  applying ../utf8.patch
> +  transaction abort!
> +  rollback completed
> +  abort: decoding near '\xc3\xab': 'ascii' codec can't decode byte 0xc3
> in position 0: ordinal not in range(128)! (esc)
> +  [255]
>   $ hg log | grep ^user -
> -  user:        ë
> +  [1]
> 
> 
> Something goes bad in hg import with LANG=C and I'm not sure why.
> 
> Why is the backtrace hidden here?

pass --debug --traceback to the hg import call and you'll get all the gory details.
funman@videolan.org - Oct. 23, 2013, 12:38 a.m.
Le 23/10/2013 02:26, Augie Fackler a écrit :
> 
> On Oct 22, 2013, at 8:25 PM, Rafaël Carré <funman@videolan.org>
> wrote:
> 
>> $ hg init utf $ cd utf $ hg import ../utf8.patch $ hg log | grep
>> ^user - user:        ë
>> 
>> It currently fails with:
>> 
>> --- /media/dev/hg/tests/test-import.t +++
>> /media/dev/hg/tests/test-import.t.err @@ -1169,5 +1169,10 @@ $ hg
>> init utf $ cd utf $ hg import ../utf8.patch +  applying
>> ../utf8.patch +  transaction abort! +  rollback completed +
>> abort: decoding near '\xc3\xab': 'ascii' codec can't decode byte
>> 0xc3 in position 0: ordinal not in range(128)! (esc) +  [255] $
>> hg log | grep ^user - -  user:        ë +  [1]
>> 
>> 
>> Something goes bad in hg import with LANG=C and I'm not sure
>> why.

hg attempts to convert an already utf-8 string from the user locale
(ascii) to utf-8.

HGENCODING=UTF-8 fixes this of course, as UTF-8 was guessed from my
$LANG=fr_FR.UTF-8

The problem is that the encoding is specified by the patch itself and
not by the user environment.

What to do here?

If I use encoding.tolocal() there's no crash anymore but instead of
'ë' I see '?'

>> Why is the backtrace hidden here?
> 
> pass --debug --traceback to the hg import call and you'll get all
> the gory details.

Thanks

Patch

--- /media/dev/hg/tests/test-import.t
+++ /media/dev/hg/tests/test-import.t.err
@@ -1169,5 +1169,10 @@ 
   $ hg init utf
   $ cd utf
   $ hg import ../utf8.patch
+  applying ../utf8.patch
+  transaction abort!
+  rollback completed
+  abort: decoding near '\xc3\xab': 'ascii' codec can't decode byte 0xc3
in position 0: ordinal not in range(128)! (esc)
+  [255]
   $ hg log | grep ^user -