Patchwork [v2] patch: when importing from email, RFC2047-decode From/Subject headers

login
register
mail settings
Submitter Julien Cristau
Date March 3, 2016, 8:44 p.m.
Message ID <20160303204426.GL6200@betterave.cristau.org>
Download mbox | patch
Permalink /patch/13585/
State Accepted
Delegated to: Yuya Nishihara
Headers show

Comments

Julien Cristau - March 3, 2016, 8:44 p.m.
On Thu, Mar  3, 2016 at 12:49:22 -0600, Matt Mackall wrote:

> On Thu, 2016-03-03 at 18:55 +0100, Julien Cristau wrote:
> > # HG changeset patch
> > # User Julien Cristau <julien.cristau@logilab.fr>
> > # Date 1457026459 -3600
> > #      Thu Mar 03 18:34:19 2016 +0100
> > # Node ID 6c153cbad4a032861417dbba9d1d90332964ab5f
> > # Parent  549ff28a345f595cad7e06fb08c2ac6973e2f030
> > patch: when importing from email, RFC2047-decode From/Subject headers
> > 
> > I'm not too sure about the Subject part: it should be possible to use
> > the charset information from the email (RFC2047 encoding and the
> > Content-Type header), but mercurial seems to use its own encoding
> > instead (in the test, that means the commit message ends up as "????"
> > if the import is done without --encoding utf-8).  Advice welcome.
> > 
> > Reported at https://bugs.debian.org/737498
> 
> You should probably immediately relay such reports upstream.
> 
Indeed.  I spent some time tidying https://bugs.debian.org/src:mercurial
today, and out of the remaining bugs (other than this one), one is a
packaging issue, three are 6 year old zeroconf extension issues (I know
nothing of that extension), another one is a 6 year old demandimport
performance issue which should probably just be closed at this point,
and the rest are either already forwarded to hg bz, or marked wontfix.

New attempt at a fix below which should address your comments, changes
in v2:
- moved decoding to new mercurial.mail.headdecode function
- fall back to utf-8 and latin1 instead of ascii
- rename parts variable to uparts as it contains unicode objects

Thanks,
Julien

# HG changeset patch
# User Julien Cristau <julien.cristau@logilab.fr>
# Date 1457026459 -3600
#      Thu Mar 03 18:34:19 2016 +0100
# Node ID 981e5fd56a9973e0069173b5f6c03639d9e176aa
# Parent  e00e57d836535aadcb13337613d2f891492d8e04
patch: when importing from email, RFC2047-decode From/Subject headers

Reported at https://bugs.debian.org/737498
timeless - March 3, 2016, 10:32 p.m.
Julien Cristau <jcristau@debian.org> wrote:
>
>> > Reported at https://bugs.debian.org/737498
>>
>> You should probably immediately relay such reports upstream.

You should actually file a bug in bz.mercurial-scm.org for this issue,
you should be able to change the bts bug to link to it.
Once you do that, you should use the issue number in the commit
description (see check-commit).
Yuya Nishihara - March 5, 2016, 9:29 a.m.
On Thu, 3 Mar 2016 21:44:26 +0100, Julien Cristau wrote:
> # HG changeset patch
> # User Julien Cristau <julien.cristau@logilab.fr>
> # Date 1457026459 -3600
> #      Thu Mar 03 18:34:19 2016 +0100
> # Node ID 981e5fd56a9973e0069173b5f6c03639d9e176aa
> # Parent  e00e57d836535aadcb13337613d2f891492d8e04
> patch: when importing from email, RFC2047-decode From/Subject headers

Looks good. Pushed to the clowncopter, thanks.

> +def headdecode(s):
> +    '''Decodes RFC-2047 header'''
> +    uparts = []
> +    for part, charset in email.Header.decode_header(s):
> +        if charset is not None:
> +            try:
> +                uparts.append(part.decode(charset))
> +                continue
> +            except UnicodeDecodeError:
> +                pass
> +        try:
> +            uparts.append(part.decode('UTF-8'))
> +            continue
> +        except UnicodeDecodeError:
> +            pass
> +        uparts.append(part.decode('ISO-8859-1'))

FWIW, email.charsets might be useful as a fallback charset.

https://www.selenic.com/mercurial/hgrc.5.html#email

Patch

diff --git a/mercurial/mail.py b/mercurial/mail.py
--- a/mercurial/mail.py
+++ b/mercurial/mail.py
@@ -332,3 +332,21 @@  def mimeencode(ui, s, charsets=None, dis
     if not display:
         s, cs = _encode(ui, s, charsets)
     return mimetextqp(s, 'plain', cs)
+
+def headdecode(s):
+    '''Decodes RFC-2047 header'''
+    uparts = []
+    for part, charset in email.Header.decode_header(s):
+        if charset is not None:
+            try:
+                uparts.append(part.decode(charset))
+                continue
+            except UnicodeDecodeError:
+                pass
+        try:
+            uparts.append(part.decode('UTF-8'))
+            continue
+        except UnicodeDecodeError:
+            pass
+        uparts.append(part.decode('ISO-8859-1'))
+    return encoding.tolocal(u' '.join(uparts).encode('UTF-8'))
diff --git a/mercurial/patch.py b/mercurial/patch.py
--- a/mercurial/patch.py
+++ b/mercurial/patch.py
@@ -31,6 +31,7 @@  from . import (
     diffhelpers,
     encoding,
     error,
+    mail,
     mdiff,
     pathutil,
     scmutil,
@@ -210,8 +211,8 @@  def extract(ui, fileobj):
     try:
         msg = email.Parser.Parser().parse(fileobj)
 
-        subject = msg['Subject']
-        data['user'] = msg['From']
+        subject = msg['Subject'] and mail.headdecode(msg['Subject'])
+        data['user'] = msg['From'] and mail.headdecode(msg['From'])
         if not subject and not data['user']:
             # Not an email, restore parsed headers if any
             subject = '\n'.join(': '.join(h) for h in msg.items()) + '\n'
diff --git a/tests/test-import-git.t b/tests/test-import-git.t
--- a/tests/test-import-git.t
+++ b/tests/test-import-git.t
@@ -822,4 +822,27 @@  Test corner case involving copies and mu
   > EOF
   applying patch from stdin
 
+Test email metadata
+
+  $ hg revert -qa
+  $ hg --encoding utf-8 import - <<EOF
+  > From: =?UTF-8?q?Rapha=C3=ABl=20Hertzog?= <hertzog@debian.org>
+  > Subject: [PATCH] =?UTF-8?q?=C5=A7=E2=82=AC=C3=9F=E1=B9=AA?=
+  > 
+  > diff --git a/a b/a
+  > --- a/a
+  > +++ b/a
+  > @@ -1,1 +1,2 @@
+  >  a
+  > +a
+  > EOF
+  applying patch from stdin
+  $ hg --encoding utf-8 log -r .
+  changeset:   2:* (glob)
+  tag:         tip
+  user:        Rapha\xc3\xabl Hertzog <hertzog@debian.org> (esc)
+  date:        * (glob)
+  summary:     \xc5\xa7\xe2\x82\xac\xc3\x9f\xe1\xb9\xaa (esc)
+  
+
   $ cd ..