Submitter | Boris Feld |
---|---|
Date | Nov. 22, 2018, 10:17 p.m. |
Message ID | <018578f3ab597d5ea573.1542925042@localhost.localdomain> |
Download | mbox | patch |
Permalink | /patch/36727/ |
State | Superseded |
Headers | show |
Comments
On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote: > # HG changeset patch > # User Boris Feld <boris.feld@octobus.net> > # Date 1542916922 -3600 > # Thu Nov 22 21:02:02 2018 +0100 > # Node ID 018578f3ab597d5ea573107e7310470de76a3907 > # Parent 4628c3cf1fc1052ca25296c8c1a42c4502b59dc9 > # EXP-Topic perf-ignore-2 > # Available At https://bitbucket.org/octobus/mercurial-devel/ > # hg pull https://bitbucket.org/octobus/mercurial-devel/ -r > 018578f3ab59 > match: avoid translating glob to matcher multiple times for large sets > > For hgignore with many globs, the resulting regexp might not fit under the > 20K > length limit. So the patterns need to be broken up in smaller pieces. > Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19) and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be worth trying to figure out what Python versions those commits are talking about. Maybe we've dropped support for those versions and we can simplify this code.
On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk < martinvonz@google.com> wrote: > > > On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote: > >> # HG changeset patch >> # User Boris Feld <boris.feld@octobus.net> >> # Date 1542916922 -3600 >> # Thu Nov 22 21:02:02 2018 +0100 >> # Node ID 018578f3ab597d5ea573107e7310470de76a3907 >> # Parent 4628c3cf1fc1052ca25296c8c1a42c4502b59dc9 >> # EXP-Topic perf-ignore-2 >> # Available At https://bitbucket.org/octobus/mercurial-devel/ >> # hg pull https://bitbucket.org/octobus/mercurial-devel/ -r >> 018578f3ab59 >> match: avoid translating glob to matcher multiple times for large sets >> >> For hgignore with many globs, the resulting regexp might not fit under >> the 20K >> length limit. So the patterns need to be broken up in smaller pieces. >> > > Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19) > and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be > worth trying to figure out what Python versions those commits are talking > about. Maybe we've dropped support for those versions and we can simplify > this code. > Oh, and what made me do the archaeology there was that you seem to have lost the handling of OverlowError from the regex engine. As I said above, I suspect that's fine because we no longer support some very old Python versions (but please try to figure out what version that refers to). Still, if we decide to drop that OverflowError handling, I'd prefer to see that in an explicit commit early in this series.
On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via Mercurial-devel wrote: > On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk < > martinvonz@google.com> wrote: > > On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote: > > > >> # HG changeset patch > >> # User Boris Feld <boris.feld@octobus.net> > >> # Date 1542916922 -3600 > >> # Thu Nov 22 21:02:02 2018 +0100 > >> # Node ID 018578f3ab597d5ea573107e7310470de76a3907 > >> # Parent 4628c3cf1fc1052ca25296c8c1a42c4502b59dc9 > >> # EXP-Topic perf-ignore-2 > >> # Available At https://bitbucket.org/octobus/mercurial-devel/ > >> # hg pull https://bitbucket.org/octobus/mercurial-devel/ -r > >> 018578f3ab59 > >> match: avoid translating glob to matcher multiple times for large sets > >> > >> For hgignore with many globs, the resulting regexp might not fit under > >> the 20K > >> length limit. So the patterns need to be broken up in smaller pieces. > >> > > > > Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19) > > and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be > > worth trying to figure out what Python versions those commits are talking > > about. Maybe we've dropped support for those versions and we can simplify > > this code. > > > > Oh, and what made me do the archaeology there was that you seem to have > lost the handling of OverlowError from the regex engine. As I said above, I > suspect that's fine because we no longer support some very old Python > versions (but please try to figure out what version that refers to). Still, > if we decide to drop that OverflowError handling, I'd prefer to see that in > an explicit commit early in this series. Perhaps it's been fixed since 2.7.4. The regexp code width is extended from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to handle practical patterns. https://bugs.python.org/issue1160
On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote: > On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via Mercurial-devel wrote: > > On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk < > > martinvonz@google.com> wrote: > > > On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote: > > > > > >> # HG changeset patch > > >> # User Boris Feld <boris.feld@octobus.net> > > >> # Date 1542916922 -3600 > > >> # Thu Nov 22 21:02:02 2018 +0100 > > >> # Node ID 018578f3ab597d5ea573107e7310470de76a3907 > > >> # Parent 4628c3cf1fc1052ca25296c8c1a42c4502b59dc9 > > >> # EXP-Topic perf-ignore-2 > > >> # Available At https://bitbucket.org/octobus/mercurial-devel/ > > >> # hg pull https://bitbucket.org/octobus/mercurial-devel/ -r > > >> 018578f3ab59 > > >> match: avoid translating glob to matcher multiple times for large sets > > >> > > >> For hgignore with many globs, the resulting regexp might not fit under > > >> the 20K > > >> length limit. So the patterns need to be broken up in smaller pieces. > > >> > > > > > > Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19) > > > and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be > > > worth trying to figure out what Python versions those commits are talking > > > about. Maybe we've dropped support for those versions and we can simplify > > > this code. > > > > > > > Oh, and what made me do the archaeology there was that you seem to have > > lost the handling of OverlowError from the regex engine. As I said above, I > > suspect that's fine because we no longer support some very old Python > > versions (but please try to figure out what version that refers to). Still, > > if we decide to drop that OverflowError handling, I'd prefer to see that in > > an explicit commit early in this series. > > Perhaps it's been fixed since 2.7.4. The regexp code width is extended > from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to > handle practical patterns. > > https://bugs.python.org/issue1160 That said, combining more chunks of regex patterns might be likely to lead to another funny problem. % python -c 'import re; re.compile("(a)" * 100)' Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib/python2.7/re.py", line 194, in compile return _compile(pattern, flags) File "/usr/lib/python2.7/re.py", line 249, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python2.7/sre_compile.py", line 583, in compile "sorry, but this version only supports 100 named groups" AssertionError: sorry, but this version only supports 100 named groups It's unrelated to the OverflowError issue, but splitting patterns could help avoiding the 100-named-group problem.
On Fri, Nov 23, 2018, 01:24 Yuya Nishihara <yuya@tcha.org wrote: > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote: > > On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via > Mercurial-devel wrote: > > > On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk < > > > martinvonz@google.com> wrote: > > > > On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> > wrote: > > > > > > > >> # HG changeset patch > > > >> # User Boris Feld <boris.feld@octobus.net> > > > >> # Date 1542916922 -3600 > > > >> # Thu Nov 22 21:02:02 2018 +0100 > > > >> # Node ID 018578f3ab597d5ea573107e7310470de76a3907 > > > >> # Parent 4628c3cf1fc1052ca25296c8c1a42c4502b59dc9 > > > >> # EXP-Topic perf-ignore-2 > > > >> # Available At https://bitbucket.org/octobus/mercurial-devel/ > > > >> # hg pull > https://bitbucket.org/octobus/mercurial-devel/ -r > > > >> 018578f3ab59 > > > >> match: avoid translating glob to matcher multiple times for large > sets > > > >> > > > >> For hgignore with many globs, the resulting regexp might not fit > under > > > >> the 20K > > > >> length limit. So the patterns need to be broken up in smaller > pieces. > > > >> > > > > > > > > Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19) > > > > and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It > might be > > > > worth trying to figure out what Python versions those commits are > talking > > > > about. Maybe we've dropped support for those versions and we can > simplify > > > > this code. > > > > > > > > > > Oh, and what made me do the archaeology there was that you seem to have > > > lost the handling of OverlowError from the regex engine. As I said > above, I > > > suspect that's fine because we no longer support some very old Python > > > versions (but please try to figure out what version that refers to). > Still, > > > if we decide to drop that OverflowError handling, I'd prefer to see > that in > > > an explicit commit early in this series. > > > > Perhaps it's been fixed since 2.7.4. The regexp code width is extended > > from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to > > handle practical patterns. > > > > https://bugs.python.org/issue1160 > > That said, combining more chunks of regex patterns might be likely to > lead to another funny problem. > > % python -c 'import re; re.compile("(a)" * 100)' > Traceback (most recent call last): > File "<string>", line 1, in <module> > File "/usr/lib/python2.7/re.py", line 194, in compile > return _compile(pattern, flags) > File "/usr/lib/python2.7/re.py", line 249, in _compile > p = sre_compile.compile(pattern, flags) > File "/usr/lib/python2.7/sre_compile.py", line 583, in compile > "sorry, but this version only supports 100 named groups" > AssertionError: sorry, but this version only supports 100 named groups > > It's unrelated to the OverflowError issue, but splitting patterns could > help avoiding the 100-named-group problem. > Another solution to that problem seems to be to use unnamed groups, of course. I'd expect using a single regex to be faster at least with re2 (I know too little about how Python's regex engine works). >
On 23/11/2018 10:24, Yuya Nishihara wrote: > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote: >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via Mercurial-devel wrote: >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk < >>> martinvonz@google.com> wrote: >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net> wrote: >>>> >>>>> # HG changeset patch >>>>> # User Boris Feld <boris.feld@octobus.net> >>>>> # Date 1542916922 -3600 >>>>> # Thu Nov 22 21:02:02 2018 +0100 >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907 >>>>> # Parent 4628c3cf1fc1052ca25296c8c1a42c4502b59dc9 >>>>> # EXP-Topic perf-ignore-2 >>>>> # Available At https://bitbucket.org/octobus/mercurial-devel/ >>>>> # hg pull https://bitbucket.org/octobus/mercurial-devel/ -r >>>>> 018578f3ab59 >>>>> match: avoid translating glob to matcher multiple times for large sets >>>>> >>>>> For hgignore with many globs, the resulting regexp might not fit under >>>>> the 20K >>>>> length limit. So the patterns need to be broken up in smaller pieces. >>>>> >>>> Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19) >>>> and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It might be >>>> worth trying to figure out what Python versions those commits are talking >>>> about. Maybe we've dropped support for those versions and we can simplify >>>> this code. >>>> >>> Oh, and what made me do the archaeology there was that you seem to have >>> lost the handling of OverlowError from the regex engine. As I said above, I >>> suspect that's fine because we no longer support some very old Python >>> versions (but please try to figure out what version that refers to). Still, >>> if we decide to drop that OverflowError handling, I'd prefer to see that in >>> an explicit commit early in this series. To me, 0f6a1bdf89fb (catching error from engine) is superseded by 59a9dc9562e2 (cannot trust the engine, preemptively raise our own error). So I feel like it is fine to just rely on the size limit. >> Perhaps it's been fixed since 2.7.4. The regexp code width is extended >> from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to >> handle practical patterns. >> >> https://bugs.python.org/issue1160 Thanks for digging this out. It looks like we may be able to drop this limit altogether. However, I would like to make it a change distinct from this series. The current code is very problematic for some people (to the point where the majority of `hg status` time is spent in that function). I would like to get fast code for the same semantic first. Then look into changing the semantic. > That said, combining more chunks of regex patterns might be likely to > lead to another funny problem. > > % python -c 'import re; re.compile("(a)" * 100)' > Traceback (most recent call last): > File "<string>", line 1, in <module> > File "/usr/lib/python2.7/re.py", line 194, in compile > return _compile(pattern, flags) > File "/usr/lib/python2.7/re.py", line 249, in _compile > p = sre_compile.compile(pattern, flags) > File "/usr/lib/python2.7/sre_compile.py", line 583, in compile > "sorry, but this version only supports 100 named groups" > AssertionError: sorry, but this version only supports 100 named groups > > It's unrelated to the OverflowError issue, but splitting patterns could > help avoiding the 100-named-group problem. By chance, my current gigantic use case does not involve named groups. Catching AssertionError, will be fun. I wish there were some clean API to expose and check engine limitation. > _______________________________________________ > Mercurial-devel mailing list > Mercurial-devel@mercurial-scm.org > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Patch
diff --git a/mercurial/match.py b/mercurial/match.py --- a/mercurial/match.py +++ b/mercurial/match.py @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li return regex, lambda f: any(mf(f) for mf in matchfuncs) MAXRESIZE = 20000 +_BASESIZE = len('(?:)') - 1 def _groupregexps(regexps): """gather multiple regexps into a single one""" @@ -1203,21 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi OverflowError """ try: - regex = _groupregexps([_regex(k, p, globsuffix) - for (k, p, s) in kindpats]) - if len(regex) > MAXRESIZE: - raise OverflowError - return regex, _rematcher(regex) - except OverflowError: - # We're using a Python with a tiny regex engine and we - # made it explode, so we'll divide the pattern list in two - # until it works - l = len(kindpats) - if l < 2: - raise - regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix) - regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix) - return regex, lambda s: a(s) or b(s) + allgroups = [] + regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats] + fullregexp = _groupregexps(regexps) + + startidx = 0 + groupsize = _BASESIZE + for idx, r in enumerate(regexps): + piecesize = len(r) + if (piecesize + 4) > MAXRESIZE: + raise OverflowError + elif (groupsize + 1 + piecesize) > MAXRESIZE: + group = regexps[startidx:idx] + allgroups.append(_groupregexps(group)) + startidx = idx + groupsize = _BASESIZE + groupsize += piecesize + 1 + + if startidx == 0: + func = _rematcher(fullregexp) + else: + group = regexps[startidx:] + allgroups.append(_groupregexps(group)) + allmatchers = [_rematcher(g) for g in allgroups] + func = lambda s: any(m(s) for m in allmatchers) + return fullregexp, func except re.error: for k, p, s in kindpats: try: