Patchwork [5,of,6,V2] match: avoid translating glob to matcher multiple times for large sets

login
register
mail settings
Submitter via Mercurial-devel
Date Nov. 23, 2018, 11:51 p.m.
Message ID <CAESOdVBibRPTb_gUfWy7DPh8fqg9AaDZMx9sQd6c6AfaWr28Jw@mail.gmail.com>
Download mbox | patch
Permalink /patch/36755/
State New
Headers show

Comments

via Mercurial-devel - Nov. 23, 2018, 11:51 p.m.
On Fri, Nov 23, 2018 at 9:20 AM Boris FELD <boris.feld@octobus.net> wrote:

>
> On 23/11/2018 10:24, Yuya Nishihara wrote:
> > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
> >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via
> Mercurial-devel wrote:
> >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
> >>> martinvonz@google.com> wrote:
> >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net>
> wrote:
> >>>>
> >>>>> # HG changeset patch
> >>>>> # User Boris Feld <boris.feld@octobus.net>
> >>>>> # Date 1542916922 -3600
> >>>>> #      Thu Nov 22 21:02:02 2018 +0100
> >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
> >>>>> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
> >>>>> # EXP-Topic perf-ignore-2
> >>>>> # Available At https://bitbucket.org/octobus/mercurial-devel/
> >>>>> #              hg pull
> https://bitbucket.org/octobus/mercurial-devel/ -r
> >>>>> 018578f3ab59
> >>>>> match: avoid translating glob to matcher multiple times for large
> sets
> >>>>>
> >>>>> For hgignore with many globs, the resulting regexp might not fit
> under
> >>>>> the 20K
> >>>>> length limit. So the patterns need to be broken up in smaller pieces.
> >>>>>
> >>>> Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
> >>>> and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It
> might be
> >>>> worth trying to figure out what Python versions those commits are
> talking
> >>>> about. Maybe we've dropped support for those versions and we can
> simplify
> >>>> this code.
> >>>>
> >>> Oh, and what made me do the archaeology there was that you seem to have
> >>> lost the handling of OverlowError from the regex engine. As I said
> above, I
> >>> suspect that's fine because we no longer support some very old Python
> >>> versions (but please try to figure out what version that refers to).
> Still,
> >>> if we decide to drop that OverflowError handling, I'd prefer to see
> that in
> >>> an explicit commit early in this series.
> To me, 0f6a1bdf89fb (catching error from engine) is superseded by
> 59a9dc9562e2 (cannot trust the engine, preemptively raise our own error).
>

Yes, perhaps (if it was only expressions longer than 20k that raised
OverflowError). My point was that if that was the case, we should rewrite
to avoid using an internal exception for flow control, i.e. change from
this:

    try:
        regex = # create regex
        if len(regex) > MAX_RE_SIZE:
            raise OverflowError
        return regex, _rematcher(regex)
    except OverflowError:
        # break up into smaller

to this:

    regex = # create regex
    if len(regex) < MAX_RE_SIZE:
        return regex, _rematcher(regex)
    # break up into smaller



>
> So I feel like it is fine to just rely on the size limit.
> >> Perhaps it's been fixed since 2.7.4. The regexp code width is extended
> >> from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to
> >> handle practical patterns.
> >>
> >> https://bugs.python.org/issue1160
>
> Thanks for digging this out. It looks like we may be able to drop this
> limit altogether. However, I would like to make it a change distinct
> from this series.
>
> The current code is very problematic for some people (to the point where
> the majority of `hg status` time is spent in that function). I would
> like to get fast code for the same semantic first. Then look into
> changing the semantic.
>

Is your concern that you might regress in performance of something by
changing how large the groups are? Or that it would be more work?

I tried creating a regex for *every* pattern and that actually seemed
faster (to my surprise), both when creating the matcher and when evaluating
it. I tried it on the mozilla-unified repo both with 1k files and with 10k
files in the hgignores. I used the following patch on top of your series.

         for k, p, s in kindpats:




>
> > That said, combining more chunks of regex patterns might be likely to
> > lead to another funny problem.
> >
> > % python -c 'import re; re.compile("(a)" * 100)'
> > Traceback (most recent call last):
> >   File "<string>", line 1, in <module>
> >   File "/usr/lib/python2.7/re.py", line 194, in compile
> >     return _compile(pattern, flags)
> >   File "/usr/lib/python2.7/re.py", line 249, in _compile
> >     p = sre_compile.compile(pattern, flags)
> >   File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
> >     "sorry, but this version only supports 100 named groups"
> > AssertionError: sorry, but this version only supports 100 named groups
> >
> > It's unrelated to the OverflowError issue, but splitting patterns could
> > help avoiding the 100-named-group problem.
>
> By chance, my current gigantic use case does not involve named groups.
>
> Catching AssertionError, will be fun. I wish there were some clean API
> to expose and check engine limitation.
>
> > _______________________________________________
> > Mercurial-devel mailing list
> > Mercurial-devel@mercurial-scm.org
> > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Yuya Nishihara - Nov. 24, 2018, 2:13 a.m.
On Fri, 23 Nov 2018 15:51:58 -0800, Martin von Zweigbergk wrote:
> On Fri, Nov 23, 2018 at 9:20 AM Boris FELD <boris.feld@octobus.net> wrote:
> > So I feel like it is fine to just rely on the size limit.
> > >> Perhaps it's been fixed since 2.7.4. The regexp code width is extended
> > >> from 16bit to 32bit (or Py_UCS4) integer. That should be large enough to
> > >> handle practical patterns.
> > >>
> > >> https://bugs.python.org/issue1160
> >
> > Thanks for digging this out. It looks like we may be able to drop this
> > limit altogether. However, I would like to make it a change distinct
> > from this series.
> >
> > The current code is very problematic for some people (to the point where
> > the majority of `hg status` time is spent in that function). I would
> > like to get fast code for the same semantic first. Then look into
> > changing the semantic.
> >
> 
> Is your concern that you might regress in performance of something by
> changing how large the groups are? Or that it would be more work?
> 
> I tried creating a regex for *every* pattern and that actually seemed
> faster (to my surprise), both when creating the matcher and when evaluating
> it. I tried it on the mozilla-unified repo both with 1k files and with 10k
> files in the hgignores. I used the following patch on top of your series.

Wow. If we don't need to combine patterns into one, numbered groups should
just work.
Boris Feld - Nov. 28, 2018, 3:41 p.m.
On 24/11/2018 00:51, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
> On Fri, Nov 23, 2018 at 9:20 AM Boris FELD <boris.feld@octobus.net
> <mailto:boris.feld@octobus.net>> wrote:
>
>
>     On 23/11/2018 10:24, Yuya Nishihara wrote:
>     > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>     >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via
>     Mercurial-devel wrote:
>     >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>     >>> martinvonz@google.com <mailto:martinvonz@google.com>> wrote:
>     >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld
>     <boris.feld@octobus.net <mailto:boris.feld@octobus.net>> wrote:
>     >>>>
>     >>>>> # HG changeset patch
>     >>>>> # User Boris Feld <boris.feld@octobus.net
>     <mailto:boris.feld@octobus.net>>
>     >>>>> # Date 1542916922 -3600
>     >>>>> #      Thu Nov 22 21:02:02 2018 +0100
>     >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>     >>>>> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>     >>>>> # EXP-Topic perf-ignore-2
>     >>>>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>     >>>>> #              hg pull
>     https://bitbucket.org/octobus/mercurial-devel/ -r
>     >>>>> 018578f3ab59
>     >>>>> match: avoid translating glob to matcher multiple times for
>     large sets
>     >>>>>
>     >>>>> For hgignore with many globs, the resulting regexp might not
>     fit under
>     >>>>> the 20K
>     >>>>> length limit. So the patterns need to be broken up in
>     smaller pieces.
>     >>>>>
>     >>>> Did you see 0f6a1bdf89fb (match: handle large regexes,
>     2007-08-19)
>     >>>> and 59a9dc9562e2 (ignore: split up huge patterns,
>     2008-02-11)? It might be
>     >>>> worth trying to figure out what Python versions those commits
>     are talking
>     >>>> about. Maybe we've dropped support for those versions and we
>     can simplify
>     >>>> this code.
>     >>>>
>     >>> Oh, and what made me do the archaeology there was that you
>     seem to have
>     >>> lost the handling of OverlowError from the regex engine. As I
>     said above, I
>     >>> suspect that's fine because we no longer support some very old
>     Python
>     >>> versions (but please try to figure out what version that
>     refers to). Still,
>     >>> if we decide to drop that OverflowError handling, I'd prefer
>     to see that in
>     >>> an explicit commit early in this series.
>     To me, 0f6a1bdf89fb (catching error from engine) is superseded by
>     59a9dc9562e2 (cannot trust the engine, preemptively raise our own
>     error).
>
>
> Yes, perhaps (if it was only expressions longer than 20k that raised
> OverflowError). My point was that if that was the case, we should
> rewrite to avoid using an internal exception for flow control, i.e.
> change from this:
>
>     try:
>         regex = # create regex
>         if len(regex) > MAX_RE_SIZE:
>             raise OverflowError
>         return regex, _rematcher(regex)
>     except OverflowError:
>         # break up into smaller
>
> to this:
>
>     regex = # create regex
>     if len(regex) < MAX_RE_SIZE:
>         return regex, _rematcher(regex)
>     # break up into smaller
I don't disagree with the point. However, I do not know why it is
relevant to the current discussion. The series in review is no longer
using exception for flow control. What am I missing?
>
>  
>
>
>     So I feel like it is fine to just rely on the size limit.
>     >> Perhaps it's been fixed since 2.7.4. The regexp code width is
>     extended
>     >> from 16bit to 32bit (or Py_UCS4) integer. That should be large
>     enough to
>     >> handle practical patterns.
>     >>
>     >> https://bugs.python.org/issue1160
>
>     Thanks for digging this out. It looks like we may be able to drop this
>     limit altogether. However, I would like to make it a change distinct
>     from this series.
>
>     The current code is very problematic for some people (to the point
>     where
>     the majority of `hg status` time is spent in that function). I would
>     like to get fast code for the same semantic first. Then look into
>     changing the semantic.
>
>
> Is your concern that you might regress in performance of something by
> changing how large the groups are? Or that it would be more work?

My concern is that we are currently delaying a simple code change
yielding a huge improvement because there is the option for another
larger, riskier one.

I'm not dismissing your finding about lifting the limit and avoiding the
regexps joining. They look very promising and we are planning to explore
them. However, they are a different change.

The current series is preserving all the assumption we have been using
for the past 10 years, implementing a faster code for the same result.
This is a very low risk and solves the most immediate problem of a user
group we support.

Dropping the limit is a different matter. It is probably harmless to do
so, but Mercurial runs on a variety of systems and regexp engines. We'll
have to look at all the case to make sure our new assumption is valid.
The same goes for stopping the grouping, we'll have to check the
behavior on various real-life use cases to make sure it is valid.

We are happy to explore these two ideas ourselves, but I would prefer to
see the faster grouping in first. This is a good intermediate step that
improves the current situation.

In addition, we won't have direct access to the real-life repository
until in about two months, so it will hard for us to check the timing of
a significantly new approach on real-life data.

>
> I tried creating a regex for *every* pattern and that actually seemed
> faster (to my surprise), both when creating the matcher and when
> evaluating it. I tried it on the mozilla-unified repo both with 1k
> files and with 10k files in the hgignores. I used the following patch
> on top of your series.
Could you share more details on the patterns you generated, the
operation you performed and how you gathered the timings?
>
> diff --git a/mercurial/match.py b/mercurial/match.py
> --- a/mercurial/match.py
> +++ b/mercurial/match.py
> @@ -1184,51 +1184,15 @@ def _buildmatch(kindpats, globsuffix, li
>      else:
>          return regex, lambda f: any(mf(f) for mf in matchfuncs)
>
> -MAX_RE_SIZE = 20000
> -_BASE_SIZE = len('(?:)') - 1
> -
> -def _joinregexes(regexps):
> -    """gather multiple regular expressions into a single one"""
> -    return '(?:%s)' % '|'.join(regexps)
> -
>  def _buildregexmatch(kindpats, globsuffix):
>      """Build a match function from a list of kinds and kindpats,
>      return regexp string and a matcher function.
> -
> -    Test too large input
> -    >>> _buildregexmatch([
> -    ...     ('relglob', '?' * MAX_RE_SIZE, '')
> -    ... ], '$')
> -    Traceback (most recent call last):
> -    ...
> -    Abort: matcher pattern is too long (20009 bytes)
>      """
>      try:
> -        allgroups = []
>          regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
> -        fullregexp = _joinregexes(regexps)
> -
> -        startidx = 0
> -        groupsize = _BASE_SIZE
> -        for idx, r in enumerate(regexps):
> -            piecesize = len(r)
> -            if (piecesize + 4) > MAX_RE_SIZE:
> -                msg = _("matcher pattern is too long (%d bytes)") %
> piecesize
> -                raise error.Abort(msg)
> -            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
> -                group = regexps[startidx:idx]
> -                allgroups.append(_joinregexes(group))
> -                startidx = idx
> -                groupsize = _BASE_SIZE
> -            groupsize += piecesize + 1
> -
> -        if startidx == 0:
> -            func = _rematcher(fullregexp)
> -        else:
> -            group = regexps[startidx:]
> -            allgroups.append(_joinregexes(group))
> -            allmatchers = [_rematcher(g) for g in allgroups]
> -            func = lambda s: any(m(s) for m in allmatchers)
> +        fullregexp = '(?:%s)' % '|'.join(regexps)
> +        allmatchers = [_rematcher(r) for r in regexps]
> +        func = lambda s: any(m(s) for m in allmatchers)
>          return fullregexp, func
>      except re.error:
>          for k, p, s in kindpats:
>
>
>  
>
>
>     > That said, combining more chunks of regex patterns might be
>     likely to
>     > lead to another funny problem.
>     >
>     > % python -c 'import re; re.compile("(a)" * 100)'
>     > Traceback (most recent call last):
>     >   File "<string>", line 1, in <module>
>     >   File "/usr/lib/python2.7/re.py", line 194, in compile
>     >     return _compile(pattern, flags)
>     >   File "/usr/lib/python2.7/re.py", line 249, in _compile
>     >     p = sre_compile.compile(pattern, flags)
>     >   File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
>     >     "sorry, but this version only supports 100 named groups"
>     > AssertionError: sorry, but this version only supports 100 named
>     groups
>     >
>     > It's unrelated to the OverflowError issue, but splitting
>     patterns could
>     > help avoiding the 100-named-group problem.
>
>     By chance, my current gigantic use case does not involve named groups.
>
>     Catching AssertionError, will be fun. I wish there were some clean API
>     to expose and check engine limitation.
>
>     > _______________________________________________
>     > Mercurial-devel mailing list
>     > Mercurial-devel@mercurial-scm.org
>     <mailto:Mercurial-devel@mercurial-scm.org>
>     > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
via Mercurial-devel - Nov. 28, 2018, 6:30 p.m.
On Wed, Nov 28, 2018 at 7:41 AM Boris FELD <boris.feld@octobus.net> wrote:

>
> On 24/11/2018 00:51, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
>
> On Fri, Nov 23, 2018 at 9:20 AM Boris FELD <boris.feld@octobus.net> wrote:
>
>>
>> On 23/11/2018 10:24, Yuya Nishihara wrote:
>> > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>> >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via
>> Mercurial-devel wrote:
>> >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>> >>> martinvonz@google.com> wrote:
>> >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld@octobus.net>
>> wrote:
>> >>>>
>> >>>>> # HG changeset patch
>> >>>>> # User Boris Feld <boris.feld@octobus.net>
>> >>>>> # Date 1542916922 -3600
>> >>>>> #      Thu Nov 22 21:02:02 2018 +0100
>> >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>> >>>>> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>> >>>>> # EXP-Topic perf-ignore-2
>> >>>>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>> >>>>> #              hg pull
>> https://bitbucket.org/octobus/mercurial-devel/ -r
>> >>>>> 018578f3ab59
>> >>>>> match: avoid translating glob to matcher multiple times for large
>> sets
>> >>>>>
>> >>>>> For hgignore with many globs, the resulting regexp might not fit
>> under
>> >>>>> the 20K
>> >>>>> length limit. So the patterns need to be broken up in smaller
>> pieces.
>> >>>>>
>> >>>> Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
>> >>>> and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It
>> might be
>> >>>> worth trying to figure out what Python versions those commits are
>> talking
>> >>>> about. Maybe we've dropped support for those versions and we can
>> simplify
>> >>>> this code.
>> >>>>
>> >>> Oh, and what made me do the archaeology there was that you seem to
>> have
>> >>> lost the handling of OverlowError from the regex engine. As I said
>> above, I
>> >>> suspect that's fine because we no longer support some very old Python
>> >>> versions (but please try to figure out what version that refers to).
>> Still,
>> >>> if we decide to drop that OverflowError handling, I'd prefer to see
>> that in
>> >>> an explicit commit early in this series.
>> To me, 0f6a1bdf89fb (catching error from engine) is superseded by
>> 59a9dc9562e2 (cannot trust the engine, preemptively raise our own error).
>>
>
> Yes, perhaps (if it was only expressions longer than 20k that raised
> OverflowError). My point was that if that was the case, we should rewrite
> to avoid using an internal exception for flow control, i.e. change from
> this:
>
>     try:
>         regex = # create regex
>         if len(regex) > MAX_RE_SIZE:
>             raise OverflowError
>         return regex, _rematcher(regex)
>     except OverflowError:
>         # break up into smaller
>
> to this:
>
>     regex = # create regex
>     if len(regex) < MAX_RE_SIZE:
>         return regex, _rematcher(regex)
>     # break up into smaller
>
> I don't disagree with the point. However, I do not know why it is relevant
> to the current discussion. The series in review is no longer using
> exception for flow control. What am I missing?
>

My request was to make the removal of the support for OverflowError
explicit. As I said above, it seems it was just lost in this patch without
a mention. I've sent D5309 to show what I mean. Your patch 6/6 would go
well right after that. You could rebase your series on top, or maybe I'm
just being nitpicky and I should drop the patch.

>
>
>
>>
>> So I feel like it is fine to just rely on the size limit.
>> >> Perhaps it's been fixed since 2.7.4. The regexp code width is extended
>> >> from 16bit to 32bit (or Py_UCS4) integer. That should be large enough
>> to
>> >> handle practical patterns.
>> >>
>> >> https://bugs.python.org/issue1160
>>
>> Thanks for digging this out. It looks like we may be able to drop this
>> limit altogether. However, I would like to make it a change distinct
>> from this series.
>>
>> The current code is very problematic for some people (to the point where
>> the majority of `hg status` time is spent in that function). I would
>> like to get fast code for the same semantic first. Then look into
>> changing the semantic.
>>
>
> Is your concern that you might regress in performance of something by
> changing how large the groups are? Or that it would be more work?
>
>
> My concern is that we are currently delaying a simple code change yielding
> a huge improvement because there is the option for another larger, riskier
> one.
>
> I'm not dismissing your finding about lifting the limit and avoiding the
> regexps joining. They look very promising and we are planning to explore
> them. However, they are a different change.
>
> The current series is preserving all the assumption we have been using for
> the past 10 years, implementing a faster code for the same result. This is
> a very low risk and solves the most immediate problem of a user group we
> support.
>
> Dropping the limit is a different matter. It is probably harmless to do
> so, but Mercurial runs on a variety of systems and regexp engines. We'll
> have to look at all the case to make sure our new assumption is valid.
> The same goes for stopping the grouping, we'll have to check the behavior
> on various real-life use cases to make sure it is valid.
>

I suspect it will be best (depending on regex engine) to either not do any
grouping, or to put all patterns in one group. I'd like (for someone) to
see how re2 behaves here.


> We are happy to explore these two ideas ourselves,
>

Thanks!


> but I would prefer to see the faster grouping in first. This is a good
> intermediate step that improves the current situation.
>
> In addition, we won't have direct access to the real-life repository until
> in about two months, so it will hard for us to check the timing of a
> significantly new approach on real-life data.
>
>
> I tried creating a regex for *every* pattern and that actually seemed
> faster (to my surprise), both when creating the matcher and when evaluating
> it. I tried it on the mozilla-unified repo both with 1k files and with 10k
> files in the hgignores. I used the following patch on top of your series.
>
> Could you share more details on the patterns you generated, the operation
> you performed and how you gathered the timings?
>

Have to run to a meeting. Will get back to you later.

>
> diff --git a/mercurial/match.py b/mercurial/match.py
> --- a/mercurial/match.py
> +++ b/mercurial/match.py
> @@ -1184,51 +1184,15 @@ def _buildmatch(kindpats, globsuffix, li
>      else:
>          return regex, lambda f: any(mf(f) for mf in matchfuncs)
>
> -MAX_RE_SIZE = 20000
> -_BASE_SIZE = len('(?:)') - 1
> -
> -def _joinregexes(regexps):
> -    """gather multiple regular expressions into a single one"""
> -    return '(?:%s)' % '|'.join(regexps)
> -
>  def _buildregexmatch(kindpats, globsuffix):
>      """Build a match function from a list of kinds and kindpats,
>      return regexp string and a matcher function.
> -
> -    Test too large input
> -    >>> _buildregexmatch([
> -    ...     ('relglob', '?' * MAX_RE_SIZE, '')
> -    ... ], '$')
> -    Traceback (most recent call last):
> -    ...
> -    Abort: matcher pattern is too long (20009 bytes)
>      """
>      try:
> -        allgroups = []
>          regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
> -        fullregexp = _joinregexes(regexps)
> -
> -        startidx = 0
> -        groupsize = _BASE_SIZE
> -        for idx, r in enumerate(regexps):
> -            piecesize = len(r)
> -            if (piecesize + 4) > MAX_RE_SIZE:
> -                msg = _("matcher pattern is too long (%d bytes)") %
> piecesize
> -                raise error.Abort(msg)
> -            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
> -                group = regexps[startidx:idx]
> -                allgroups.append(_joinregexes(group))
> -                startidx = idx
> -                groupsize = _BASE_SIZE
> -            groupsize += piecesize + 1
> -
> -        if startidx == 0:
> -            func = _rematcher(fullregexp)
> -        else:
> -            group = regexps[startidx:]
> -            allgroups.append(_joinregexes(group))
> -            allmatchers = [_rematcher(g) for g in allgroups]
> -            func = lambda s: any(m(s) for m in allmatchers)
> +        fullregexp = '(?:%s)' % '|'.join(regexps)
> +        allmatchers = [_rematcher(r) for r in regexps]
> +        func = lambda s: any(m(s) for m in allmatchers)
>          return fullregexp, func
>      except re.error:
>          for k, p, s in kindpats:
>
>
>
>
>>
>> > That said, combining more chunks of regex patterns might be likely to
>> > lead to another funny problem.
>> >
>> > % python -c 'import re; re.compile("(a)" * 100)'
>> > Traceback (most recent call last):
>> >   File "<string>", line 1, in <module>
>> >   File "/usr/lib/python2.7/re.py", line 194, in compile
>> >     return _compile(pattern, flags)
>> >   File "/usr/lib/python2.7/re.py", line 249, in _compile
>> >     p = sre_compile.compile(pattern, flags)
>> >   File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
>> >     "sorry, but this version only supports 100 named groups"
>> > AssertionError: sorry, but this version only supports 100 named groups
>> >
>> > It's unrelated to the OverflowError issue, but splitting patterns could
>> > help avoiding the 100-named-group problem.
>>
>> By chance, my current gigantic use case does not involve named groups.
>>
>> Catching AssertionError, will be fun. I wish there were some clean API
>> to expose and check engine limitation.
>>
>> > _______________________________________________
>> > Mercurial-devel mailing list
>> > Mercurial-devel@mercurial-scm.org
>> > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>
> _______________________________________________
> Mercurial-devel mailing listMercurial-devel@mercurial-scm.orghttps://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
Boris Feld - Dec. 1, 2018, 12:05 p.m.
On 28/11/2018 19:30, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
> On Wed, Nov 28, 2018 at 7:41 AM Boris FELD <boris.feld@octobus.net
> <mailto:boris.feld@octobus.net>> wrote:
>
>
>     On 24/11/2018 00:51, Martin von Zweigbergk via Mercurial-devel wrote:
>>
>>
>>     On Fri, Nov 23, 2018 at 9:20 AM Boris FELD
>>     <boris.feld@octobus.net <mailto:boris.feld@octobus.net>> wrote:
>>
>>
>>         On 23/11/2018 10:24, Yuya Nishihara wrote:
>>         > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>>         >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk
>>         via Mercurial-devel wrote:
>>         >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>>         >>> martinvonz@google.com <mailto:martinvonz@google.com>> wrote:
>>         >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld
>>         <boris.feld@octobus.net <mailto:boris.feld@octobus.net>> wrote:
>>         >>>>
>>         >>>>> # HG changeset patch
>>         >>>>> # User Boris Feld <boris.feld@octobus.net
>>         <mailto:boris.feld@octobus.net>>
>>         >>>>> # Date 1542916922 -3600
>>         >>>>> #      Thu Nov 22 21:02:02 2018 +0100
>>         >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>>         >>>>> # Parent  4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>>         >>>>> # EXP-Topic perf-ignore-2
>>         >>>>> # Available At
>>         https://bitbucket.org/octobus/mercurial-devel/
>>         >>>>> #              hg pull
>>         https://bitbucket.org/octobus/mercurial-devel/ -r
>>         >>>>> 018578f3ab59
>>         >>>>> match: avoid translating glob to matcher multiple times
>>         for large sets
>>         >>>>>
>>         >>>>> For hgignore with many globs, the resulting regexp
>>         might not fit under
>>         >>>>> the 20K
>>         >>>>> length limit. So the patterns need to be broken up in
>>         smaller pieces.
>>         >>>>>
>>         >>>> Did you see 0f6a1bdf89fb (match: handle large regexes,
>>         2007-08-19)
>>         >>>> and 59a9dc9562e2 (ignore: split up huge patterns,
>>         2008-02-11)? It might be
>>         >>>> worth trying to figure out what Python versions those
>>         commits are talking
>>         >>>> about. Maybe we've dropped support for those versions
>>         and we can simplify
>>         >>>> this code.
>>         >>>>
>>         >>> Oh, and what made me do the archaeology there was that
>>         you seem to have
>>         >>> lost the handling of OverlowError from the regex engine.
>>         As I said above, I
>>         >>> suspect that's fine because we no longer support some
>>         very old Python
>>         >>> versions (but please try to figure out what version that
>>         refers to). Still,
>>         >>> if we decide to drop that OverflowError handling, I'd
>>         prefer to see that in
>>         >>> an explicit commit early in this series.
>>         To me, 0f6a1bdf89fb (catching error from engine) is superseded by
>>         59a9dc9562e2 (cannot trust the engine, preemptively raise our
>>         own error).
>>
>>
>>     Yes, perhaps (if it was only expressions longer than 20k that
>>     raised OverflowError). My point was that if that was the case, we
>>     should rewrite to avoid using an internal exception for flow
>>     control, i.e. change from this:
>>
>>         try:
>>             regex = # create regex
>>             if len(regex) > MAX_RE_SIZE:
>>                 raise OverflowError
>>             return regex, _rematcher(regex)
>>         except OverflowError:
>>             # break up into smaller
>>
>>     to this:
>>
>>         regex = # create regex
>>         if len(regex) < MAX_RE_SIZE:
>>             return regex, _rematcher(regex)
>>         # break up into smaller
>     I don't disagree with the point. However, I do not know why it is
>     relevant to the current discussion. The series in review is no
>     longer using exception for flow control. What am I missing?
>
>
> My request was to make the removal of the support for OverflowError
> explicit. As I said above, it seems it was just lost in this patch
> without a mention. I've sent D5309 to show what I mean. Your patch 6/6
> would go well right after that. You could rebase your series on top,
> or maybe I'm just being nitpicky and I should drop the patch.
Oh! I see now, it makes sense. It has been integrated into the series.
>
>>
>>      
>>
>>
>>         So I feel like it is fine to just rely on the size limit.
>>         >> Perhaps it's been fixed since 2.7.4. The regexp code width
>>         is extended
>>         >> from 16bit to 32bit (or Py_UCS4) integer. That should be
>>         large enough to
>>         >> handle practical patterns.
>>         >>
>>         >> https://bugs.python.org/issue1160
>>
>>         Thanks for digging this out. It looks like we may be able to
>>         drop this
>>         limit altogether. However, I would like to make it a change
>>         distinct
>>         from this series.
>>
>>         The current code is very problematic for some people (to the
>>         point where
>>         the majority of `hg status` time is spent in that function).
>>         I would
>>         like to get fast code for the same semantic first. Then look into
>>         changing the semantic.
>>
>>
>>     Is your concern that you might regress in performance of
>>     something by changing how large the groups are? Or that it would
>>     be more work?
>
>     My concern is that we are currently delaying a simple code change
>     yielding a huge improvement because there is the option for
>     another larger, riskier one.
>
>     I'm not dismissing your finding about lifting the limit and
>     avoiding the regexps joining. They look very promising and we are
>     planning to explore them. However, they are a different change.
>
>     The current series is preserving all the assumption we have been
>     using for the past 10 years, implementing a faster code for the
>     same result. This is a very low risk and solves the most immediate
>     problem of a user group we support.
>
>     Dropping the limit is a different matter. It is probably harmless
>     to do so, but Mercurial runs on a variety of systems and regexp
>     engines. We'll have to look at all the case to make sure our new
>     assumption is valid.
>     The same goes for stopping the grouping, we'll have to check the
>     behavior on various real-life use cases to make sure it is valid.
>
>
> I suspect it will be best (depending on regex engine) to either not do
> any grouping, or to put all patterns in one group. I'd like (for
> someone) to see how re2 behaves here.
We'll look into it. Can we get the current changes landed in the meantime?
>
>
>     We are happy to explore these two ideas ourselves,
>
>
> Thanks!
>  
>
>     but I would prefer to see the faster grouping in first. This is a
>     good intermediate step that improves the current situation.
>
>     In addition, we won't have direct access to the real-life
>     repository until in about two months, so it will hard for us to
>     check the timing of a significantly new approach on real-life data.
>
>>
>>     I tried creating a regex for *every* pattern and that actually
>>     seemed faster (to my surprise), both when creating the matcher
>>     and when evaluating it. I tried it on the mozilla-unified repo
>>     both with 1k files and with 10k files in the hgignores. I used
>>     the following patch on top of your series.
>     Could you share more details on the patterns you generated, the
>     operation you performed and how you gathered the timings?
>
>
> Have to run to a meeting. Will get back to you later.
>
>>
>>     diff --git a/mercurial/match.py b/mercurial/match.py
>>     --- a/mercurial/match.py
>>     +++ b/mercurial/match.py
>>     @@ -1184,51 +1184,15 @@ def _buildmatch(kindpats, globsuffix, li
>>          else:
>>              return regex, lambda f: any(mf(f) for mf in matchfuncs)
>>
>>     -MAX_RE_SIZE = 20000
>>     -_BASE_SIZE = len('(?:)') - 1
>>     -
>>     -def _joinregexes(regexps):
>>     -    """gather multiple regular expressions into a single one"""
>>     -    return '(?:%s)' % '|'.join(regexps)
>>     -
>>      def _buildregexmatch(kindpats, globsuffix):
>>          """Build a match function from a list of kinds and kindpats,
>>          return regexp string and a matcher function.
>>     -
>>     -    Test too large input
>>     -    >>> _buildregexmatch([
>>     -    ...     ('relglob', '?' * MAX_RE_SIZE, '')
>>     -    ... ], '$')
>>     -    Traceback (most recent call last):
>>     -    ...
>>     -    Abort: matcher pattern is too long (20009 bytes)
>>          """
>>          try:
>>     -        allgroups = []
>>              regexps = [_regex(k, p, globsuffix) for (k, p, s) in
>>     kindpats]
>>     -        fullregexp = _joinregexes(regexps)
>>     -
>>     -        startidx = 0
>>     -        groupsize = _BASE_SIZE
>>     -        for idx, r in enumerate(regexps):
>>     -            piecesize = len(r)
>>     -            if (piecesize + 4) > MAX_RE_SIZE:
>>     -                msg = _("matcher pattern is too long (%d
>>     bytes)") % piecesize
>>     -                raise error.Abort(msg)
>>     -            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
>>     -                group = regexps[startidx:idx]
>>     -                allgroups.append(_joinregexes(group))
>>     -                startidx = idx
>>     -                groupsize = _BASE_SIZE
>>     -            groupsize += piecesize + 1
>>     -
>>     -        if startidx == 0:
>>     -            func = _rematcher(fullregexp)
>>     -        else:
>>     -            group = regexps[startidx:]
>>     -            allgroups.append(_joinregexes(group))
>>     -            allmatchers = [_rematcher(g) for g in allgroups]
>>     -            func = lambda s: any(m(s) for m in allmatchers)
>>     +        fullregexp = '(?:%s)' % '|'.join(regexps)
>>     +        allmatchers = [_rematcher(r) for r in regexps]
>>     +        func = lambda s: any(m(s) for m in allmatchers)
>>              return fullregexp, func
>>          except re.error:
>>              for k, p, s in kindpats:
>>
>>
>>      
>>
>>
>>         > That said, combining more chunks of regex patterns might be
>>         likely to
>>         > lead to another funny problem.
>>         >
>>         > % python -c 'import re; re.compile("(a)" * 100)'
>>         > Traceback (most recent call last):
>>         >   File "<string>", line 1, in <module>
>>         >   File "/usr/lib/python2.7/re.py", line 194, in compile
>>         >     return _compile(pattern, flags)
>>         >   File "/usr/lib/python2.7/re.py", line 249, in _compile
>>         >     p = sre_compile.compile(pattern, flags)
>>         >   File "/usr/lib/python2.7/sre_compile.py", line 583, in
>>         compile
>>         >     "sorry, but this version only supports 100 named groups"
>>         > AssertionError: sorry, but this version only supports 100
>>         named groups
>>         >
>>         > It's unrelated to the OverflowError issue, but splitting
>>         patterns could
>>         > help avoiding the 100-named-group problem.
>>
>>         By chance, my current gigantic use case does not involve
>>         named groups.
>>
>>         Catching AssertionError, will be fun. I wish there were some
>>         clean API
>>         to expose and check engine limitation.
>>
>>         > _______________________________________________
>>         > Mercurial-devel mailing list
>>         > Mercurial-devel@mercurial-scm.org
>>         <mailto:Mercurial-devel@mercurial-scm.org>
>>         > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>>
>>     _______________________________________________
>>     Mercurial-devel mailing list
>>     Mercurial-devel@mercurial-scm.org <mailto:Mercurial-devel@mercurial-scm.org>
>>     https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Patch

diff --git a/mercurial/match.py b/mercurial/match.py
--- a/mercurial/match.py
+++ b/mercurial/match.py
@@ -1184,51 +1184,15 @@  def _buildmatch(kindpats, globsuffix, li
     else:
         return regex, lambda f: any(mf(f) for mf in matchfuncs)

-MAX_RE_SIZE = 20000
-_BASE_SIZE = len('(?:)') - 1
-
-def _joinregexes(regexps):
-    """gather multiple regular expressions into a single one"""
-    return '(?:%s)' % '|'.join(regexps)
-
 def _buildregexmatch(kindpats, globsuffix):
     """Build a match function from a list of kinds and kindpats,
     return regexp string and a matcher function.
-
-    Test too large input
-    >>> _buildregexmatch([
-    ...     ('relglob', '?' * MAX_RE_SIZE, '')
-    ... ], '$')
-    Traceback (most recent call last):
-    ...
-    Abort: matcher pattern is too long (20009 bytes)
     """
     try:
-        allgroups = []
         regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
-        fullregexp = _joinregexes(regexps)
-
-        startidx = 0
-        groupsize = _BASE_SIZE
-        for idx, r in enumerate(regexps):
-            piecesize = len(r)
-            if (piecesize + 4) > MAX_RE_SIZE:
-                msg = _("matcher pattern is too long (%d bytes)") %
piecesize
-                raise error.Abort(msg)
-            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
-                group = regexps[startidx:idx]
-                allgroups.append(_joinregexes(group))
-                startidx = idx
-                groupsize = _BASE_SIZE
-            groupsize += piecesize + 1
-
-        if startidx == 0:
-            func = _rematcher(fullregexp)
-        else:
-            group = regexps[startidx:]
-            allgroups.append(_joinregexes(group))
-            allmatchers = [_rematcher(g) for g in allgroups]
-            func = lambda s: any(m(s) for m in allmatchers)
+        fullregexp = '(?:%s)' % '|'.join(regexps)
+        allmatchers = [_rematcher(r) for r in regexps]
+        func = lambda s: any(m(s) for m in allmatchers)
         return fullregexp, func
     except re.error: