Patchwork [5,of,6,V2] match: avoid translating glob to matcher multiple times for large sets

login
register
mail settings
Submitter Boris Feld
Date Nov. 22, 2018, 10:17 p.m.
Message ID <018578f3ab597d5ea573.1542925042@localhost.localdomain>
Download mbox | patch
Permalink /patch/36727/
State Superseded
Headers show

Comments

Boris Feld - Nov. 22, 2018, 10:17 p.m.
# HG changeset patch
# User Boris Feld <boris.feld@octobus.net>
# Date 1542916922 -3600
#      Thu Nov 22 21:02:02 2018 +0100
# Node ID 018578f3ab597d5ea573107e7310470de76a3907
# Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
# EXP-Topic perf-ignore-2
# Available At https://bitbucket.org/octobus/mercurial-devel/
#              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r 018578f3ab59
match: avoid translating glob to matcher multiple times for large sets

For hgignore with many globs, the resulting regexp might not fit under the 20K
length limit. So the patterns need to be broken up in smaller pieces.

Before this change, the logic was re-starting the full process from scratch
for each smaller pieces, including the translation of globs into regexp.
Effectively doing the work over and over.

If the 20K limit is reached, we are likely in a case where there is many such
glob, so exporting them is especially expensive and we should be careful not
to do that work more than once.

To work around this, we now translate glob to regexp once and for all. Then,
we assemble the resulting individual regexp into valid blocks.

This raises a very significant performance win for large `.hgignore file`:

Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 (median of 66)
After:  ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 (median of 100)
via Mercurial-devel - Nov. 23, 2018, 7:44 a.m.
On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote:

> # HG changeset patch
> # User Boris Feld <boris.feld@octobus.net>
> # Date 1542916922 -3600
> #      Thu Nov 22 21:02:02 2018 +0100
> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
> # EXP-Topic perf-ignore-2
> # Available At https://bitbucket.org/octobus/mercurial-devel/
> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
> 018578f3ab59
> match: avoid translating glob to matcher multiple times for large sets
>
> For hgignore with many globs, the resulting regexp might not fit under the
> 20K
> length limit. So the patterns need to be broken up in smaller pieces.
>

Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be
worth trying to figure out what Python versions those commits are talking
about. Maybe we've dropped support for those versions and we can simplify
this code.
via Mercurial-devel - Nov. 23, 2018, 8 a.m.
On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
martinvonz@google.com> wrote:

>
>
> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote:
>
>> # HG changeset patch
>> # User Boris Feld <boris.feld@octobus.net>
>> # Date 1542916922 -3600
>> #      Thu Nov 22 21:02:02 2018 +0100
>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>> # EXP-Topic perf-ignore-2
>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
>> 018578f3ab59
>> match: avoid translating glob to matcher multiple times for large sets
>>
>> For hgignore with many globs, the resulting regexp might not fit under
>> the 20K
>> length limit. So the patterns need to be broken up in smaller pieces.
>>
>
> Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
> and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be
> worth trying to figure out what Python versions those commits are talking
> about. Maybe we've dropped support for those versions and we can simplify
> this code.
>

Oh, and what made me do the archaeology there was that you seem to have
lost the handling of OverlowError from the regex engine. As I said above, I
suspect that's fine because we no longer support some very old Python
versions (but please try to figure out what version that refers to). Still,
if we decide to drop that OverflowError handling, I'd prefer to see that in
an explicit commit early in this series.
Yuya Nishihara - Nov. 23, 2018, 9 a.m.
On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via Mercurial-devel wrote:
> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
> martinvonz@google.com> wrote:
> > On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote:
> >
> >> # HG changeset patch
> >> # User Boris Feld <boris.feld@octobus.net>
> >> # Date 1542916922 -3600
> >> #      Thu Nov 22 21:02:02 2018 +0100
> >> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
> >> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
> >> # EXP-Topic perf-ignore-2
> >> # Available At https://bitbucket.org/octobus/mercurial-devel/
> >> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
> >> 018578f3ab59
> >> match: avoid translating glob to matcher multiple times for large sets
> >>
> >> For hgignore with many globs, the resulting regexp might not fit under
> >> the 20K
> >> length limit. So the patterns need to be broken up in smaller pieces.
> >>
> >
> > Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
> > and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be
> > worth trying to figure out what Python versions those commits are talking
> > about. Maybe we've dropped support for those versions and we can simplify
> > this code.
> >
> 
> Oh, and what made me do the archaeology there was that you seem to have
> lost the handling of OverlowError from the regex engine. As I said above, I
> suspect that's fine because we no longer support some very old Python
> versions (but please try to figure out what version that refers to). Still,
> if we decide to drop that OverflowError handling, I'd prefer to see that in
> an explicit commit early in this series.

Perhaps it's been fixed since 2.7.4. The regexp code width is extended
from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to
handle practical patterns.

https://bugs.python.org/issue1160
Yuya Nishihara - Nov. 23, 2018, 9:24 a.m.
On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via Mercurial-devel wrote:
> > On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
> > martinvonz@google.com> wrote:
> > > On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote:
> > >
> > >> # HG changeset patch
> > >> # User Boris Feld <boris.feld@octobus.net>
> > >> # Date 1542916922 -3600
> > >> #      Thu Nov 22 21:02:02 2018 +0100
> > >> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
> > >> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
> > >> # EXP-Topic perf-ignore-2
> > >> # Available At https://bitbucket.org/octobus/mercurial-devel/
> > >> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
> > >> 018578f3ab59
> > >> match: avoid translating glob to matcher multiple times for large sets
> > >>
> > >> For hgignore with many globs, the resulting regexp might not fit under
> > >> the 20K
> > >> length limit. So the patterns need to be broken up in smaller pieces.
> > >>
> > >
> > > Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
> > > and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be
> > > worth trying to figure out what Python versions those commits are talking
> > > about. Maybe we've dropped support for those versions and we can simplify
> > > this code.
> > >
> > 
> > Oh, and what made me do the archaeology there was that you seem to have
> > lost the handling of OverlowError from the regex engine. As I said above, I
> > suspect that's fine because we no longer support some very old Python
> > versions (but please try to figure out what version that refers to). Still,
> > if we decide to drop that OverflowError handling, I'd prefer to see that in
> > an explicit commit early in this series.
> 
> Perhaps it's been fixed since 2.7.4. The regexp code width is extended
> from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to
> handle practical patterns.
> 
> https://bugs.python.org/issue1160

That said, combining more chunks of regex patterns might be likely to
lead to another funny problem.

% python -c 'import re; re.compile("(a)" * 100)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 249, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
    "sorry, but this version only supports 100 named groups"
AssertionError: sorry, but this version only supports 100 named groups

It's unrelated to the OverflowError issue, but splitting patterns could
help avoiding the 100-named-group problem.
via Mercurial-devel - Nov. 23, 2018, 3:58 p.m.
On Fri, Nov 23, 2018, 01:24 Yuya Nishihara <yuya@tcha.org wrote:

> On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
> > On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via
> Mercurial-devel wrote:
> > > On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
> > > martinvonz@google.com> wrote:
> > > > On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net>
> wrote:
> > > >
> > > >> # HG changeset patch
> > > >> # User Boris Feld <boris.feld@octobus.net>
> > > >> # Date 1542916922 -3600
> > > >> #      Thu Nov 22 21:02:02 2018 +0100
> > > >> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
> > > >> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
> > > >> # EXP-Topic perf-ignore-2
> > > >> # Available At https://bitbucket.org/octobus/mercurial-devel/
> > > >> #              hg pull
> https://bitbucket.org/octobus/mercurial-devel/ -r
> > > >> 018578f3ab59
> > > >> match: avoid translating glob to matcher multiple times for large
> sets
> > > >>
> > > >> For hgignore with many globs, the resulting regexp might not fit
> under
> > > >> the 20K
> > > >> length limit. So the patterns need to be broken up in smaller
> pieces.
> > > >>
> > > >
> > > > Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
> > > > and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It
> might be
> > > > worth trying to figure out what Python versions those commits are
> talking
> > > > about. Maybe we've dropped support for those versions and we can
> simplify
> > > > this code.
> > > >
> > >
> > > Oh, and what made me do the archaeology there was that you seem to have
> > > lost the handling of OverlowError from the regex engine. As I said
> above, I
> > > suspect that's fine because we no longer support some very old Python
> > > versions (but please try to figure out what version that refers to).
> Still,
> > > if we decide to drop that OverflowError handling, I'd prefer to see
> that in
> > > an explicit commit early in this series.
> >
> > Perhaps it's been fixed since 2.7.4. The regexp code width is extended
> > from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to
> > handle practical patterns.
> >
> > https://bugs.python.org/issue1160
>
> That said, combining more chunks of regex patterns might be likely to
> lead to another funny problem.
>
> % python -c 'import re; re.compile("(a)" * 100)'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/usr/lib/python2.7/re.py", line 194, in compile
>     return _compile(pattern, flags)
>   File "/usr/lib/python2.7/re.py", line 249, in _compile
>     p = sre_compile.compile(pattern, flags)
>   File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
>     "sorry, but this version only supports 100 named groups"
> AssertionError: sorry, but this version only supports 100 named groups
>
> It's unrelated to the OverflowError issue, but splitting patterns could
> help avoiding the 100-named-group problem.
>

Another solution to that problem seems to be to use unnamed groups, of
course. I'd expect using a single regex to be faster at least with re2 (I
know too little about how Python's regex engine works).

>
Boris Feld - Nov. 23, 2018, 5:20 p.m.
On 23/11/2018 10:24, Yuya Nishihara wrote:
> On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via Mercurial-devel wrote:
>>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>>> martinvonz@google.com> wrote:
>>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote:
>>>>
>>>>> # HG changeset patch
>>>>> # User Boris Feld <boris.feld@octobus.net>
>>>>> # Date 1542916922 -3600
>>>>> #      Thu Nov 22 21:02:02 2018 +0100
>>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>>>>> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>>>>> # EXP-Topic perf-ignore-2
>>>>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>>>>> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
>>>>> 018578f3ab59
>>>>> match: avoid translating glob to matcher multiple times for large sets
>>>>>
>>>>> For hgignore with many globs, the resulting regexp might not fit under
>>>>> the 20K
>>>>> length limit. So the patterns need to be broken up in smaller pieces.
>>>>>
>>>> Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
>>>> and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be
>>>> worth trying to figure out what Python versions those commits are talking
>>>> about. Maybe we've dropped support for those versions and we can simplify
>>>> this code.
>>>>
>>> Oh, and what made me do the archaeology there was that you seem to have
>>> lost the handling of OverlowError from the regex engine. As I said above, I
>>> suspect that's fine because we no longer support some very old Python
>>> versions (but please try to figure out what version that refers to). Still,
>>> if we decide to drop that OverflowError handling, I'd prefer to see that in
>>> an explicit commit early in this series.
To me, 0f6a1bdf89fb (catching error from engine) is superseded by
59a9dc9562e2 (cannot trust the engine, preemptively raise our own error).

So I feel like it is fine to just rely on the size limit.
>> Perhaps it's been fixed since 2.7.4. The regexp code width is extended
>> from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to
>> handle practical patterns.
>>
>> https://bugs.python.org/issue1160

Thanks for digging this out. It looks like we may be able to drop this
limit altogether. However, I would like to make it a change distinct
from this series.

The current code is very problematic for some people (to the point where
the majority of `hg status` time is spent in that function). I would
like to get fast code for the same semantic first. Then look into
changing the semantic.

> That said, combining more chunks of regex patterns might be likely to
> lead to another funny problem.
>
> % python -c 'import re; re.compile("(a)" * 100)'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/usr/lib/python2.7/re.py", line 194, in compile
>     return _compile(pattern, flags)
>   File "/usr/lib/python2.7/re.py", line 249, in _compile
>     p = sre_compile.compile(pattern, flags)
>   File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
>     "sorry, but this version only supports 100 named groups"
> AssertionError: sorry, but this version only supports 100 named groups
>
> It's unrelated to the OverflowError issue, but splitting patterns could
> help avoiding the 100-named-group problem.

By chance, my current gigantic use case does not involve named groups.

Catching AssertionError, will be fun. I wish there were some clean API
to expose and check engine limitation.

> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Patch

diff --git a/mercurial/match.py b/mercurial/match.py
--- a/mercurial/match.py
+++ b/mercurial/match.py
@@ -1185,6 +1185,7 @@  def _buildmatch(kindpats, globsuffix, li
         return regex, lambda f: any(mf(f) for mf in matchfuncs)
 
 MAXRESIZE = 20000
+_BASESIZE = len('(?:)') - 1
 
 def _groupregexps(regexps):
     """gather multiple regexps into a single one"""
@@ -1203,21 +1204,31 @@  def _buildregexmatch(kindpats, globsuffi
     OverflowError
     """
     try:
-        regex = _groupregexps([_regex(k, p, globsuffix)
-                               for (k, p, s) in kindpats])
-        if len(regex) > MAXRESIZE:
-            raise OverflowError
-        return regex, _rematcher(regex)
-    except OverflowError:
-        # We're using a Python with a tiny regex engine and we
-        # made it explode, so we'll divide the pattern list in two
-        # until it works
-        l = len(kindpats)
-        if l < 2:
-            raise
-        regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix)
-        regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix)
-        return regex, lambda s: a(s) or b(s)
+        allgroups = []
+        regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
+        fullregexp = _groupregexps(regexps)
+
+        startidx = 0
+        groupsize = _BASESIZE
+        for idx, r in enumerate(regexps):
+            piecesize = len(r)
+            if (piecesize + 4) > MAXRESIZE:
+                raise OverflowError
+            elif (groupsize + 1 + piecesize) > MAXRESIZE:
+                group = regexps[startidx:idx]
+                allgroups.append(_groupregexps(group))
+                startidx = idx
+                groupsize = _BASESIZE
+            groupsize += piecesize + 1
+
+        if startidx == 0:
+            func = _rematcher(fullregexp)
+        else:
+            group = regexps[startidx:]
+            allgroups.append(_groupregexps(group))
+            allmatchers = [_rematcher(g) for g in allgroups]
+            func = lambda s: any(m(s) for m in allmatchers)
+        return fullregexp, func
     except re.error:
         for k, p, s in kindpats:
             try: