Patchwork mdiff: split on unicode character boundaries when shortening function name

mail settings
Submitter Josef 'Jeff' Sipek
Date Feb. 21, 2018, 10:56 p.m.
Message ID <20180221225647.GC6374@meili>
Download mbox | patch
Permalink /patch/28219/
State New
Headers show


Josef 'Jeff' Sipek - Feb. 22, 2018, 4:59 p.m.
On Fri, Feb 23, 2018 at 01:06:28 +0900, Yuya Nishihara wrote:
> On Thu, 22 Feb 2018 10:01:00 -0500, Josef 'Jeff' Sipek wrote:
> > Yeah... I thought that might be an issue.  The code in the 'except' is meant
> > as best-effort -
> Ok, I didn't notice that. It's indeed better to catch the UnicodeError.
> That said, UTF-8 is well designed encoding, we can easily find the nearest
> multi-byte boundary by looking back a couple of bytes.

Right, but isn't this code required to handle any-to-any situation?  That
is, the versioned data can be in any encoding, and the terminal can be in
any encoding.  Currently, the code "handles" it by just copying bytes.  This
obviously breaks down the moment multi-byte characters show up.

UTF-8 being resilient is a good thing, but IMO that justifies leaving the
code alone.

I don't know if there is some weird variable length encoding (other than
UTF-8) out there that hg needs to handle.

> > if there is any UTF-8 issue decoding/encoding, just fall
> > back to previous method.  That of course wouldn't help if the input happened
> > to be valid UTF-8 but wasn't actually UTF-8.
> > 
> > I had to do the encode step, otherwise I got a giant stack trace saying that
> > unicode strings cannot be <something I don't remember> using ascii encoder.
> > (Leaving it un-encoded would also mean that this for loop would output
> > either a unicode string or a raw string - which seems unclean.)
> > 
> > I'm not really sure how to proceed.  Most UTF-8 decoders should handle the
> > illegal byte sequence ok, but it still feels wrong to let it make a mess of
> > valid data.  The answer might be to just ignore this issue.  :|
> As an old Linux user, I would say yeah, don't bother about non-ascii characters,
> it's just bytes. Alternatively, maybe we could take it a UTF-8 sequence and find
> a possible boundary, but I'm not sure if it's a good idea.

As in: implement a UTF-8 decoder to "seek" to the right place?  Eh.

I'm looking forward to the day when everything is only Unicode, but that'll
be a while...



diff --git a/mercurial/ b/mercurial/
--- a/mercurial/
+++ b/mercurial/
@@ -348,7 +348,12 @@  def _unidiff(t1, t2, opts=defaultopts):
             # alphanumeric char.
             for i in xrange(astart - 1, lastpos - 1, -1):
                 if l1[i][0:1].isalnum():
-                    func = ' ' + l1[i].rstrip()[:40]
+                    func = l1[i].rstrip()
+                    try:
+                        func = func.decode("utf-8")[:40].encode("utf-8")
+                    except:
+                        func = func[:40]
+                    func = ' ' + func
                     lastfunc[1] = func
             # by recording this hunk's starting point as the next place to