Patchwork [1,of,8,STABLE] encoding: add 'trim' to trim multi-byte characters at most specified columns

login
register
mail settings
Submitter Katsunori FUJIWARA
Date June 13, 2014, 4:22 p.m.
Message ID <b0986d208bd4dc9a5043.1402676549@feefifofum>
Download mbox | patch
Permalink /patch/4989/
State Superseded
Headers show

Comments

Katsunori FUJIWARA - June 13, 2014, 4:22 p.m.
# HG changeset patch
# User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
# Date 1402675901 -32400
#      Sat Jun 14 01:11:41 2014 +0900
# Branch stable
# Node ID b0986d208bd4dc9a5043326457fbb6dd5c3e25d6
# Parent  14560418856dbd2b1b5d0bf1b4ae3bceffc4eef0
encoding: add 'trim' to trim multi-byte characters at most specified columns

Newly added 'trim' is used to trim multi-byte characters at most
specified columns correctly: directly slicing byte sequence should be
replaced with 'encoding.trim', because the former may split at
intermediate multi-byte sequence.

Slicing unicode sequence ('uslice') and concatenation with ellipsis
('concat') are defined as function, to make enhancement in subsequent
patch easier.
Greg Ward - June 15, 2014, 1:27 a.m.
On 14 June 2014, FUJIWARA Katsunori said:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> # Date 1402675901 -32400
> #      Sat Jun 14 01:11:41 2014 +0900
> # Branch stable
> # Node ID b0986d208bd4dc9a5043326457fbb6dd5c3e25d6
> # Parent  14560418856dbd2b1b5d0bf1b4ae3bceffc4eef0
> encoding: add 'trim' to trim multi-byte characters at most specified columns

Does this really belong on stable? Even if it culminates in fixing a
bug, is that bug important enough to risk the stability of the stable
branch?

       Greg
Katsunori FUJIWARA - June 16, 2014, 6:34 a.m.
At Sat, 14 Jun 2014 21:27:06 -0400,
Greg Ward wrote:
> 
> On 14 June 2014, FUJIWARA Katsunori said:
> > # HG changeset patch
> > # User FUJIWARA Katsunori <foozy@lares.dti.ne.jp>
> > # Date 1402675901 -32400
> > #      Sat Jun 14 01:11:41 2014 +0900
> > # Branch stable
> > # Node ID b0986d208bd4dc9a5043326457fbb6dd5c3e25d6
> > # Parent  14560418856dbd2b1b5d0bf1b4ae3bceffc4eef0
> > encoding: add 'trim' to trim multi-byte characters at most specified columns
> 
> Does this really belong on stable? Even if it culminates in fixing a
> bug, is that bug important enough to risk the stability of the stable
> branch?

Before this series, 'progress' and 'histedit' may cause split at
intermediate multi-byte sequence (= show broken multi-byte sequence).

I post this series for stable, because this problem can easily catch
eye of end users using multi-byte sequence.

But I can also understand that this is not so serious from the point
of view of 'core functionality of Mercurial', as you say.

I can agree with treating this series as one for non-stable.


> 
>        Greg
> -- 
> Greg Ward                            http://www.gerg.ca
> <greg@gerg.ca>                       @gergdotca
> 

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy@lares.dti.ne.jp

Patch

diff --git a/mercurial/encoding.py b/mercurial/encoding.py
--- a/mercurial/encoding.py
+++ b/mercurial/encoding.py
@@ -165,6 +165,76 @@ 
         if colwidth(t) == c:
             return t
 
+def trim(s, width, ellipsis=''):
+    """Trim string 's' to at most 'width' columns (including 'ellipsis').
+
+    >>> ellipsis = '+++'
+    >>> from mercurial import encoding
+    >>> encoding.encoding = 'utf-8'
+    >>> t= '1234567890'
+    >>> print trim(t, 12, ellipsis=ellipsis)
+    1234567890
+    >>> print trim(t, 10, ellipsis=ellipsis)
+    1234567890
+    >>> print trim(t, 8, ellipsis=ellipsis)
+    12345+++
+    >>> print trim(t, 8)
+    12345678
+    >>> print trim(t, 3, ellipsis=ellipsis)
+    +++
+    >>> print trim(t, 1, ellipsis=ellipsis)
+    +
+    >>> u = u'\u3042\u3044\u3046\u3048\u304a' # 2 x 5 = 10 columns
+    >>> t = u.encode(encoding.encoding)
+    >>> print trim(t, 12, ellipsis=ellipsis)
+    \xe3\x81\x82\xe3\x81\x84\xe3\x81\x86\xe3\x81\x88\xe3\x81\x8a
+    >>> print trim(t, 10, ellipsis=ellipsis)
+    \xe3\x81\x82\xe3\x81\x84\xe3\x81\x86\xe3\x81\x88\xe3\x81\x8a
+    >>> print trim(t, 8, ellipsis=ellipsis)
+    \xe3\x81\x82\xe3\x81\x84+++
+    >>> print trim(t, 5)
+    \xe3\x81\x82\xe3\x81\x84
+    >>> print trim(t, 4, ellipsis=ellipsis)
+    +++
+    >>> t = '\x11\x22\x33\x44\x55\x66\x77\x88\x99\xaa' # invalid byte sequence
+    >>> print trim(t, 12, ellipsis=ellipsis)
+    \x11\x22\x33\x44\x55\x66\x77\x88\x99\xaa
+    >>> print trim(t, 10, ellipsis=ellipsis)
+    \x11\x22\x33\x44\x55\x66\x77\x88\x99\xaa
+    >>> print trim(t, 8, ellipsis=ellipsis)
+    \x11\x22\x33\x44\x55+++
+    >>> print trim(t, 8)
+    \x11\x22\x33\x44\x55\x66\x77\x88
+    >>> print trim(t, 3, ellipsis=ellipsis)
+    +++
+    >>> print trim(t, 1, ellipsis=ellipsis)
+    +
+    """
+    try:
+        u = s.decode(encoding)
+    except UnicodeDecodeError:
+        if len(s) <= width: # trimming is not needed
+            return s
+        width -= len(ellipsis)
+        if width <= 0: # no enough room even for ellipsis
+            return ellipsis[:width + len(ellipsis)]
+        return s[:width] + ellipsis
+
+    if ucolwidth(u) <= width: # trimming is not needed
+        return s
+
+    width -= len(ellipsis)
+    if width <= 0: # no enough room even for ellipsis
+        return ellipsis[:width + len(ellipsis)]
+
+    uslice = lambda i: u[:-i]
+    concat = lambda s: s + ellipsis
+    for i in xrange(1, len(u)):
+        usub = uslice(i)
+        if ucolwidth(usub) <= width:
+            return concat(usub.encode(encoding))
+    return ellipsis # no enough room for multi-column characters
+
 def lower(s):
     "best-effort encoding-aware case-folding of local string s"
     try: