Patchwork patch: decode e-mail headers

login
register
mail settings
Submitter funman@videolan.org
Date Oct. 22, 2013, 12:18 p.m.
Message ID <e8c0f97e42ca9e09b400.1382444280@fat>
Download mbox | patch
Permalink /patch/2801/
State Superseded, archived
Headers show

Comments

funman@videolan.org - Oct. 22, 2013, 12:18 p.m.
# HG changeset patch
# User Rafaël Carré <funman@videolan.org>
# Date 1382444275 -7200
#      Tue Oct 22 14:17:55 2013 +0200
# Branch stable
# Node ID e8c0f97e42ca9e09b4000245bd713f03e5d72038
# Parent  2c886dedd9021598b6290d95ea0f068731ea4e2b
patch: decode e-mail headers

    Change commits from:
user:        =?UTF-8?q?Rafa=C3=ABl=20Carr=C3=A9?= <funman@videolan.org>
    to:
user:        Rafaël Carré <funman@videolan.org>
Augie Fackler - Oct. 22, 2013, 6:32 p.m.
On Tue, Oct 22, 2013 at 02:18:00PM +0200, funman@videolan.org wrote:
> # HG changeset patch
> # User Rafaël Carré <funman@videolan.org>
> # Date 1382444275 -7200
> #      Tue Oct 22 14:17:55 2013 +0200
> # Branch stable
> # Node ID e8c0f97e42ca9e09b4000245bd713f03e5d72038
> # Parent  2c886dedd9021598b6290d95ea0f068731ea4e2b
> patch: decode e-mail headers
>
>     Change commits from:
> user:        =?UTF-8?q?Rafa=C3=ABl=20Carr=C3=A9?= <funman@videolan.org>
>     to:
> user:        Rafaël Carré <funman@videolan.org>
>
> diff -r 2c886dedd902 -r e8c0f97e42ca mercurial/patch.py
> --- a/mercurial/patch.py	Mon Oct 21 10:50:58 2013 -0700
> +++ b/mercurial/patch.py	Tue Oct 22 14:17:55 2013 +0200
> @@ -12,6 +12,7 @@
>  # load. This was not a problem on Python 2.7.
>  import email.Generator
>  import email.Parser
> +from email.header import decode_header
>
>  from i18n import _
>  from node import hex, short
> @@ -162,6 +163,25 @@
>      Any item in the returned tuple can be None. If filename is None,
>      fileobj did not contain a patch. Caller must unlink filename when done.'''
>
> +    def header_decode(h):
> +        '''Decode ?=UTF-8? from e-mail headers.'''
> +        if h is None:
> +            return None
> +        res = ''
> +        pairs = decode_header(h)
> +        if pairs is None:
> +            return None
> +        n = len(pairs)
> +        pair = 0
> +        for p in pairs:
> +            pair += 1
> +            if p[1] == 'utf-8' or p[1] is None:
> +                res += p[0]
> +                if pair < n:
> +                    res += ' '
> +
> +        return res
> +
>      # attempt to detect the start of a patch
>      # (this heuristic is borrowed from quilt)
>      diffre = re.compile(r'^(?:Index:[ \t]|diff[ \t]|RCS file: |'
> @@ -174,8 +194,8 @@
>      try:
>          msg = email.Parser.Parser().parse(fileobj)
>
> -        subject = msg['Subject']
> -        user = msg['From']
> +        subject = header_decode(msg['Subject'])
> +        user = header_decode(msg['From'])
>          if not subject and not user:
>              # Not an email, restore parsed headers if any
>              subject = '\n'.join(': '.join(h) for h in msg.items()) + '\n'

Can I get you to add a simple test to one of the existing 'hg import'
test cases so we don't break this in the future?

> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Martin Geisler - Oct. 23, 2013, 7:01 a.m.
funman@videolan.org writes:

> # HG changeset patch
> # User Rafaël Carré <funman@videolan.org>
> # Date 1382444275 -7200
> #      Tue Oct 22 14:17:55 2013 +0200
> # Branch stable
> # Node ID e8c0f97e42ca9e09b4000245bd713f03e5d72038
> # Parent  2c886dedd9021598b6290d95ea0f068731ea4e2b
> patch: decode e-mail headers
>
>     Change commits from:
> user:        =?UTF-8?q?Rafa=C3=ABl=20Carr=C3=A9?= <funman@videolan.org>
>     to:
> user:        Rafaël Carré <funman@videolan.org>
>
> diff -r 2c886dedd902 -r e8c0f97e42ca mercurial/patch.py
> --- a/mercurial/patch.py	Mon Oct 21 10:50:58 2013 -0700
> +++ b/mercurial/patch.py	Tue Oct 22 14:17:55 2013 +0200
> @@ -12,6 +12,7 @@
>  # load. This was not a problem on Python 2.7.
>  import email.Generator
>  import email.Parser
> +from email.header import decode_header

We normally only import modules and reference functions with

  module.function

This is because we have a demand-import system in place that can delay
loading the module until it is really needed. This makes Mercurial start
faster.

>  from i18n import _
>  from node import hex, short
> @@ -162,6 +163,25 @@
>      Any item in the returned tuple can be None. If filename is None,
>      fileobj did not contain a patch. Caller must unlink filename when done.'''
>  
> +    def header_decode(h):
> +        '''Decode ?=UTF-8? from e-mail headers.'''
> +        if h is None:
> +            return None
> +        res = ''
> +        pairs = decode_header(h)

You define a header_decode function and there is apparently already a
decode_header function in scope -- that looks confusing to me :-)

The names are so similar that I wouldn't know what the difference is
between the two functions. Maybe you can come up with a better name for
the new function.

> +        if pairs is None:
> +            return None
> +        n = len(pairs)
> +        pair = 0
> +        for p in pairs:
> +            pair += 1
> +            if p[1] == 'utf-8' or p[1] is None:

Does this mean that you only handle UTF-8 encoded headers? What about
ISO-8859-* encoded headers?
funman@videolan.org - Oct. 23, 2013, 1:23 p.m.
Le 23/10/2013 09:01, Martin Geisler a écrit :
> funman@videolan.org writes:
> 
>> # HG changeset patch
>> # User Rafaël Carré <funman@videolan.org>
>> # Date 1382444275 -7200
>> #      Tue Oct 22 14:17:55 2013 +0200
>> # Branch stable
>> # Node ID e8c0f97e42ca9e09b4000245bd713f03e5d72038
>> # Parent  2c886dedd9021598b6290d95ea0f068731ea4e2b
>> patch: decode e-mail headers
>>
>>     Change commits from:
>> user:        =?UTF-8?q?Rafa=C3=ABl=20Carr=C3=A9?= <funman@videolan.org>
>>     to:
>> user:        Rafaël Carré <funman@videolan.org>
>>
>> diff -r 2c886dedd902 -r e8c0f97e42ca mercurial/patch.py
>> --- a/mercurial/patch.py	Mon Oct 21 10:50:58 2013 -0700
>> +++ b/mercurial/patch.py	Tue Oct 22 14:17:55 2013 +0200
>> @@ -12,6 +12,7 @@
>>  # load. This was not a problem on Python 2.7.
>>  import email.Generator
>>  import email.Parser
>> +from email.header import decode_header
> 
> We normally only import modules and reference functions with
> 
>   module.function
> 
> This is because we have a demand-import system in place that can delay
> loading the module until it is really needed. This makes Mercurial start
> faster.

Alright.

>>  from i18n import _
>>  from node import hex, short

That file has counter examples though :)

>> @@ -162,6 +163,25 @@
>>      Any item in the returned tuple can be None. If filename is None,
>>      fileobj did not contain a patch. Caller must unlink filename when done.'''
>>  
>> +    def header_decode(h):
>> +        '''Decode ?=UTF-8? from e-mail headers.'''
>> +        if h is None:
>> +            return None
>> +        res = ''
>> +        pairs = decode_header(h)
> 
> You define a header_decode function and there is apparently already a
> decode_header function in scope -- that looks confusing to me :-)
> 
> The names are so similar that I wouldn't know what the difference is
> between the two functions. Maybe you can come up with a better name for
> the new function.

Yup, will try to come up with something better indeed.

>> +        if pairs is None:
>> +            return None
>> +        n = len(pairs)
>> +        pair = 0
>> +        for p in pairs:
>> +            pair += 1
>> +            if p[1] == 'utf-8' or p[1] is None:
> 
> Does this mean that you only handle UTF-8 encoded headers? What about
> ISO-8859-* encoded headers?

I can look at other encodings but first I need to solve the problem
mentioned in
http://www.selenic.com/pipermail/mercurial-devel/2013-October/054402.html

That is user and subject are expected to use local encoding and not
utf-8, which causes character loss when local encoding is ASCII (e.g. in
the testsuite), and perhaps in other encodings too.

It seems a bit hard problem for a newcomer to mercurial like me so any
advice is very appreciated here.
Mads Kiilerich - Oct. 23, 2013, 4:44 p.m.
On 10/23/2013 09:23 PM, Rafaël Carré wrote:
> I can look at other encodings but first I need to solve the problem
> mentioned in
> http://www.selenic.com/pipermail/mercurial-devel/2013-October/054402.html
>
> That is user and subject are expected to use local encoding and not
> utf-8, which causes character loss when local encoding is ASCII (e.g. in
> the testsuite), and perhaps in other encodings too.
>
> It seems a bit hard problem for a newcomer to mercurial like me so any
> advice is very appreciated here.

http://mercurial.selenic.com/wiki/EncodingStrategy will give some advice 
in this area.

/Mads
funman@videolan.org - Nov. 12, 2013, 10:13 a.m.
This new patch adds a test which is currently failing.

http://mercurial.selenic.com/wiki/EncodingStrategy#The_encoding_tracking_problem
says user names are 'owned and managed' by Mercurial and should be encoded as UTF-8.

This seems to not be the case, as HGENCODING=ascii (as used by the testsuite) will 
render the user name 'ë' (0xc3 0xab) as '?'
Is encoding.fromlocal being used? I couldn't figure it out.

Having email_decode return an UTF-8 string instead of using tolocal will make the test 
fail with:

transaction abort!
rollback completed
abort: decoding near '\xc3\xab': 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)! (esc)

So it seems that user names are being rendered as ascii instead of UTF-8.

Remarks welcome!

Thanks,
Matt Mackall - Nov. 13, 2013, 5:46 p.m.
On Tue, 2013-11-12 at 11:13 +0100, funman@videolan.org wrote:
> This new patch adds a test which is currently failing.
> 
> http://mercurial.selenic.com/wiki/EncodingStrategy#The_encoding_tracking_problem
> says user names are 'owned and managed' by Mercurial and should be encoded as UTF-8.
> 
> This seems to not be the case, as HGENCODING=ascii (as used by the testsuite) will 
> render the user name 'ë' (0xc3 0xab) as '?'

This is correct: when you tell Mercurial that your terminal only does
ASCII, it can't show you an 'ë'.

User names are stored -on disk- as UTF-8, but manipulated and displayed
in the local character set, whatever that may be. 

> Is encoding.fromlocal being used? I couldn't figure it out.

Yes. It's here:

http://www.selenic.com/hg/file/c38c3fdc8b93/mercurial/changelog.py#l308

> Having email_decode return an UTF-8 string instead of using tolocal will make the test 
> fail with:

You must use tolocal. It is magic and can be made to work _even if the
local encoding is ASCII_.

http://www.selenic.com/hg/file/c38c3fdc8b93/mercurial/encoding.py#l61
http://mercurial.selenic.com/wiki/EncodingStrategy#Round-trip_conversion
funman@videolan.org - Nov. 13, 2013, 9:27 p.m.
Le 13/11/2013 18:46, Matt Mackall a écrit :
> On Tue, 2013-11-12 at 11:13 +0100, funman@videolan.org wrote:
>> This new patch adds a test which is currently failing.
>>
>> http://mercurial.selenic.com/wiki/EncodingStrategy#The_encoding_tracking_problem
>> says user names are 'owned and managed' by Mercurial and should be encoded as UTF-8.
>>
>> This seems to not be the case, as HGENCODING=ascii (as used by the testsuite) will 
>> render the user name 'ë' (0xc3 0xab) as '?'
> 
> This is correct: when you tell Mercurial that your terminal only does
> ASCII, it can't show you an 'ë'.
> 
> User names are stored -on disk- as UTF-8, but manipulated and displayed
> in the local character set, whatever that may be. 
> 
>> Is encoding.fromlocal being used? I couldn't figure it out.
> 
> Yes. It's here:
> 
> http://www.selenic.com/hg/file/c38c3fdc8b93/mercurial/changelog.py#l308
> 
>> Having email_decode return an UTF-8 string instead of using tolocal will make the test 
>> fail with:
> 
> You must use tolocal. It is magic and can be made to work _even if the
> local encoding is ASCII_.
> 
> http://www.selenic.com/hg/file/c38c3fdc8b93/mercurial/encoding.py#l61
> http://mercurial.selenic.com/wiki/EncodingStrategy#Round-trip_conversion

Thanks, I didn't think that terminal output could be affected.

In this case how can I write a test for this feature, if the tests only
output ASCII to the terminal?
Matt Mackall - Dec. 18, 2013, 9:07 p.m.
On Wed, 2013-11-13 at 22:27 +0100, Rafaël Carré wrote:
> Le 13/11/2013 18:46, Matt Mackall a écrit :
> > On Tue, 2013-11-12 at 11:13 +0100, funman@videolan.org wrote:
> >> This new patch adds a test which is currently failing.
> >>
> >> http://mercurial.selenic.com/wiki/EncodingStrategy#The_encoding_tracking_problem
> >> says user names are 'owned and managed' by Mercurial and should be encoded as UTF-8.
> >>
> >> This seems to not be the case, as HGENCODING=ascii (as used by the testsuite) will 
> >> render the user name 'ë' (0xc3 0xab) as '?'
> > 
> > This is correct: when you tell Mercurial that your terminal only does
> > ASCII, it can't show you an 'ë'.
> > 
> > User names are stored -on disk- as UTF-8, but manipulated and displayed
> > in the local character set, whatever that may be. 
> > 
> >> Is encoding.fromlocal being used? I couldn't figure it out.
> > 
> > Yes. It's here:
> > 
> > http://www.selenic.com/hg/file/c38c3fdc8b93/mercurial/changelog.py#l308
> > 
> >> Having email_decode return an UTF-8 string instead of using tolocal will make the test 
> >> fail with:
> > 
> > You must use tolocal. It is magic and can be made to work _even if the
> > local encoding is ASCII_.
> > 
> > http://www.selenic.com/hg/file/c38c3fdc8b93/mercurial/encoding.py#l61
> > http://mercurial.selenic.com/wiki/EncodingStrategy#Round-trip_conversion
> 
> Thanks, I didn't think that terminal output could be affected.
> 
> In this case how can I write a test for this feature, if the tests only
> output ASCII to the terminal?

Please look at tests/test-encoding.t. Not only does it show how you can
use arbitrary encodings in tests, it also gives a good idea of what the
behavior looks like when you switch encodings.

Patch

diff -r 2c886dedd902 -r e8c0f97e42ca mercurial/patch.py
--- a/mercurial/patch.py	Mon Oct 21 10:50:58 2013 -0700
+++ b/mercurial/patch.py	Tue Oct 22 14:17:55 2013 +0200
@@ -12,6 +12,7 @@ 
 # load. This was not a problem on Python 2.7.
 import email.Generator
 import email.Parser
+from email.header import decode_header
 
 from i18n import _
 from node import hex, short
@@ -162,6 +163,25 @@ 
     Any item in the returned tuple can be None. If filename is None,
     fileobj did not contain a patch. Caller must unlink filename when done.'''
 
+    def header_decode(h):
+        '''Decode ?=UTF-8? from e-mail headers.'''
+        if h is None:
+            return None
+        res = ''
+        pairs = decode_header(h)
+        if pairs is None:
+            return None
+        n = len(pairs)
+        pair = 0
+        for p in pairs:
+            pair += 1
+            if p[1] == 'utf-8' or p[1] is None:
+                res += p[0]
+                if pair < n:
+                    res += ' '
+
+        return res
+
     # attempt to detect the start of a patch
     # (this heuristic is borrowed from quilt)
     diffre = re.compile(r'^(?:Index:[ \t]|diff[ \t]|RCS file: |'
@@ -174,8 +194,8 @@ 
     try:
         msg = email.Parser.Parser().parse(fileobj)
 
-        subject = msg['Subject']
-        user = msg['From']
+        subject = header_decode(msg['Subject'])
+        user = header_decode(msg['From'])
         if not subject and not user:
             # Not an email, restore parsed headers if any
             subject = '\n'.join(': '.join(h) for h in msg.items()) + '\n'