Patchwork [4,of,5,V4] match: avoid translating glob to matcher multiple times for large sets

login
register
mail settings
Submitter Boris Feld
Date Dec. 1, 2018, 12:07 p.m.
Message ID <0a28ff17100a67c21a99.1543666054@pc62.home>
Download mbox | patch
Permalink /patch/36895/
State Accepted
Headers show

Comments

Boris Feld - Dec. 1, 2018, 12:07 p.m.
# HG changeset patch
# User Boris Feld <boris.feld@octobus.net>
# Date 1542916922 -3600
#      Thu Nov 22 21:02:02 2018 +0100
# Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b
# Parent  066912081df7b43ed271bea4475e35661e16c1cf
# EXP-Topic perf-ignore
# Available At https://bitbucket.org/octobus/mercurial-devel/
#              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r 0a28ff17100a
match: avoid translating glob to matcher multiple times for large sets

For hgignore with many globs, the resulting regexp might not fit under the 20K
length limit. So the patterns need to be broken up in smaller pieces.

Before this change, the logic was re-starting the full process from scratch
for each smaller pieces, including the translation of globs into regexp.
Effectively doing the work over and over.

If the 20K limit is reached, we are likely in a case where there is many such
glob, so exporting them is especially expensive and we should be careful not
to do that work more than once.

To work around this, we now translate glob to regexp once and for all. Then,
we assemble the resulting individual regexp into valid blocks.

This raises a very significant performance win for large `.hgignore file`:

Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 (median of 66)
After:  ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 (median of 100)
via Mercurial-devel - Dec. 2, 2018, 12:38 a.m.
I have a few more questions. Sorry I didn't notice last time (because I
only considered the higher-level issues).

On Sat, Dec 1, 2018 at 4:07 AM Boris Feld <boris.feld@octobus.net> wrote:

> # HG changeset patch
> # User Boris Feld <boris.feld@octobus.net>
> # Date 1542916922 -3600
> #      Thu Nov 22 21:02:02 2018 +0100
> # Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b
> # Parent  066912081df7b43ed271bea4475e35661e16c1cf
> # EXP-Topic perf-ignore
> # Available At https://bitbucket.org/octobus/mercurial-devel/
> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
> 0a28ff17100a
> match: avoid translating glob to matcher multiple times for large sets
>
> For hgignore with many globs, the resulting regexp might not fit under the
> 20K
> length limit. So the patterns need to be broken up in smaller pieces.
>
> Before this change, the logic was re-starting the full process from scratch
> for each smaller pieces, including the translation of globs into regexp.
> Effectively doing the work over and over.
>
> If the 20K limit is reached, we are likely in a case where there is many
> such
> glob, so exporting them is especially expensive and we should be careful
> not
> to do that work more than once.
>
> To work around this, we now translate glob to regexp once and for all.
> Then,
> we assemble the resulting individual regexp into valid blocks.
>
> This raises a very significant performance win for large `.hgignore file`:
>
> Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 (median
> of 66)
> After:  ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 (median
> of 100)
>
> diff --git a/mercurial/match.py b/mercurial/match.py
> --- a/mercurial/match.py
> +++ b/mercurial/match.py
> @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li
>          return regex, lambda f: any(mf(f) for mf in matchfuncs)
>
>  MAX_RE_SIZE = 20000
> +_BASE_SIZE = len('(?:)') - 1
>

I think I asked before why MAX_RE_SIZE is "public" and _BASE_SIZE is
"private".

Also, why do you subtract 1 from BASE_SIZE?


>
>  def _joinregexes(regexps):
>      """gather multiple regular expressions into a single one"""
> @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi
>      OverflowError
>      """
>      try:
> -        regex = _joinregexes([_regex(k, p, globsuffix)
> -                                     for (k, p, s) in kindpats])
> -        if len(regex) <= MAX_RE_SIZE:
> -            return regex, _rematcher(regex)
> -        # We're using a Python with a tiny regex engine and we
> -        # made it explode, so we'll divide the pattern list in two
> -        # until it works
> -        l = len(kindpats)
> -        if l < 2:
> -            # TODO: raise error.Abort here
> -            raise OverflowError
> -        regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix)
> -        regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix)
> -        return regex, lambda s: a(s) or b(s)
> +        allgroups = []
> +        regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
> +        fullregexp = _joinregexes(regexps)
> +
> +        startidx = 0
> +        groupsize = _BASE_SIZE
> +        for idx, r in enumerate(regexps):
> +            piecesize = len(r)
>

This size is excluding the unnamed group wrapping, right? (I'm not asking
for a change here, just as context for later questions.)

+            if (piecesize + 4) > MAX_RE_SIZE:
>

What's this 4? I think it's for the unnamed group wrapping, i.e.
(BASE_SIZE + 1). Use that?


> +                raise OverflowError
> +            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
> +                group = regexps[startidx:idx]
> +                allgroups.append(_joinregexes(group))
> +                startidx = idx
> +                groupsize = _BASE_SIZE
> +            groupsize += piecesize + 1
>

I suppose the "1" is for the "|"-separator. Shouldn't we add (BASE_SIZE +
1) here too (and maybe redefine BASE_SIZE to include the additional 1)?
Perhaps the

Would it be correct to initialize groupsize to 0 and add

+
> +        if startidx == 0:
> +            func = _rematcher(fullregexp)
> +        else:
> +            group = regexps[startidx:]
> +            allgroups.append(_joinregexes(group))
> +            allmatchers = [_rematcher(g) for g in allgroups]
> +            func = lambda s: any(m(s) for m in allmatchers)
> +        return fullregexp, func
>      except re.error:
>          for k, p, s in kindpats:
>              try:
>
Boris Feld - Dec. 2, 2018, 3:26 p.m.
On 02/12/2018 01:38, Martin von Zweigbergk via Mercurial-devel wrote:
> I have a few more questions. Sorry I didn't notice last time (because
> I only considered the higher-level issues).
>
> On Sat, Dec 1, 2018 at 4:07 AM Boris Feld <boris.feld@octobus.net
> <mailto:boris.feld@octobus.net>> wrote:
>
>     # HG changeset patch
>     # User Boris Feld <boris.feld@octobus.net
>     <mailto:boris.feld@octobus.net>>
>     # Date 1542916922 -3600
>     #      Thu Nov 22 21:02:02 2018 +0100
>     # Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b
>     # Parent  066912081df7b43ed271bea4475e35661e16c1cf
>     # EXP-Topic perf-ignore
>     # Available At https://bitbucket.org/octobus/mercurial-devel/
>     #              hg pull
>     https://bitbucket.org/octobus/mercurial-devel/ -r 0a28ff17100a
>     match: avoid translating glob to matcher multiple times for large sets
>
>     For hgignore with many globs, the resulting regexp might not fit
>     under the 20K
>     length limit. So the patterns need to be broken up in smaller pieces.
>
>     Before this change, the logic was re-starting the full process
>     from scratch
>     for each smaller pieces, including the translation of globs into
>     regexp.
>     Effectively doing the work over and over.
>
>     If the 20K limit is reached, we are likely in a case where there
>     is many such
>     glob, so exporting them is especially expensive and we should be
>     careful not
>     to do that work more than once.
>
>     To work around this, we now translate glob to regexp once and for
>     all. Then,
>     we assemble the resulting individual regexp into valid blocks.
>
>     This raises a very significant performance win for large
>     `.hgignore file`:
>
>     Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000
>     (median of 66)
>     After:  ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000
>     (median of 100)
>
>     diff --git a/mercurial/match.py b/mercurial/match.py
>     --- a/mercurial/match.py
>     +++ b/mercurial/match.py
>     @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li
>              return regex, lambda f: any(mf(f) for mf in matchfuncs)
>
>      MAX_RE_SIZE = 20000
>     +_BASE_SIZE = len('(?:)') - 1
>
>
> I think I asked before why MAX_RE_SIZE is "public" and _BASE_SIZE is
> "private".

MAX_RE_SIZE is a limit that impacts the external world (since RE larger
than 20 000 are rejected). The other one is a pure implementation detail.

I don't feel strongly about it and I'm fine aligning them one way or
another if you feel like it.
>
> Also, why do you subtract 1 from BASE_SIZE?
Because for N joined regexp there will be N-1 '|'. See lower for more
details
>  
>
>
>      def _joinregexes(regexps):
>          """gather multiple regular expressions into a single one"""
>     @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi
>          OverflowError
>          """
>          try:
>     -        regex = _joinregexes([_regex(k, p, globsuffix)
>     -                                     for (k, p, s) in kindpats])
>     -        if len(regex) <= MAX_RE_SIZE:
>     -            return regex, _rematcher(regex)
>     -        # We're using a Python with a tiny regex engine and we
>     -        # made it explode, so we'll divide the pattern list in two
>     -        # until it works
>     -        l = len(kindpats)
>     -        if l < 2:
>     -            # TODO: raise error.Abort here
>     -            raise OverflowError
>     -        regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix)
>     -        regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix)
>     -        return regex, lambda s: a(s) or b(s)
>     +        allgroups = []
>     +        regexps = [_regex(k, p, globsuffix) for (k, p, s) in
>     kindpats]
>     +        fullregexp = _joinregexes(regexps)
>     +
>     +        startidx = 0
>     +        groupsize = _BASE_SIZE
>     +        for idx, r in enumerate(regexps):
>     +            piecesize = len(r)
>
>
> This size is excluding the unnamed group wrapping, right? (I'm not
> asking for a change here, just as context for later questions.)
Yes, this is the pure size of the regexps produced by `_regex`.
>
>     +            if (piecesize + 4) > MAX_RE_SIZE:
>
>
> What's this 4? I think it's for the unnamed group wrapping, i.e.
> (BASE_SIZE + 1). Use that?
Yes, another way to do this would be to avoid the `-1` on _BASE_SIZE and
applying it explicitly in the `groupsize = _BASE_SIZE` lines (`groupsize
= _BASE_SIZE + 1`)
>  
>
>     +                raise OverflowError
>     +            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
>     +                group = regexps[startidx:idx]
>     +                allgroups.append(_joinregexes(group))
>     +                startidx = idx
>     +                groupsize = _BASE_SIZE
>     +            groupsize += piecesize + 1
>
>
> I suppose the "1" is for the "|"-separator. Shouldn't we add
> (BASE_SIZE + 1) here too
The 1 is for the |. we must not add _BASE_SIZE since there is only a
single external grouping all joined regexp. _BASE_SIZE added once for
each group.
> (and maybe redefine BASE_SIZE to include the additional 1)? Perhaps the
Sentence seems to have been cut.
>
> Would it be correct to initialize groupsize to 0 and add
This sentence seems to have been cut too.
>
>     +
>     +        if startidx == 0:
>     +            func = _rematcher(fullregexp)
>     +        else:
>     +            group = regexps[startidx:]
>     +            allgroups.append(_joinregexes(group))
>     +            allmatchers = [_rematcher(g) for g in allgroups]
>     +            func = lambda s: any(m(s) for m in allmatchers)
>     +        return fullregexp, func
>          except re.error:
>              for k, p, s in kindpats:
>                  try:
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
via Mercurial-devel - Dec. 2, 2018, 8:51 p.m.
On Sun, Dec 2, 2018 at 7:26 AM Boris FELD <boris.feld@octobus.net> wrote:

>
> On 02/12/2018 01:38, Martin von Zweigbergk via Mercurial-devel wrote:
>
> I have a few more questions. Sorry I didn't notice last time (because I
> only considered the higher-level issues).
>
> On Sat, Dec 1, 2018 at 4:07 AM Boris Feld <boris.feld@octobus.net> wrote:
>
>> # HG changeset patch
>> # User Boris Feld <boris.feld@octobus.net>
>> # Date 1542916922 -3600
>> #      Thu Nov 22 21:02:02 2018 +0100
>> # Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b
>> # Parent  066912081df7b43ed271bea4475e35661e16c1cf
>> # EXP-Topic perf-ignore
>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
>> 0a28ff17100a
>> match: avoid translating glob to matcher multiple times for large sets
>>
>> For hgignore with many globs, the resulting regexp might not fit under
>> the 20K
>> length limit. So the patterns need to be broken up in smaller pieces.
>>
>> Before this change, the logic was re-starting the full process from
>> scratch
>> for each smaller pieces, including the translation of globs into regexp.
>> Effectively doing the work over and over.
>>
>> If the 20K limit is reached, we are likely in a case where there is many
>> such
>> glob, so exporting them is especially expensive and we should be careful
>> not
>> to do that work more than once.
>>
>> To work around this, we now translate glob to regexp once and for all.
>> Then,
>> we assemble the resulting individual regexp into valid blocks.
>>
>> This raises a very significant performance win for large `.hgignore file`:
>>
>> Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 (median
>> of 66)
>> After:  ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 (median
>> of 100)
>>
>> diff --git a/mercurial/match.py b/mercurial/match.py
>> --- a/mercurial/match.py
>> +++ b/mercurial/match.py
>> @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li
>>          return regex, lambda f: any(mf(f) for mf in matchfuncs)
>>
>>  MAX_RE_SIZE = 20000
>> +_BASE_SIZE = len('(?:)') - 1
>>
>
> I think I asked before why MAX_RE_SIZE is "public" and _BASE_SIZE is
> "private".
>
>
> MAX_RE_SIZE is a limit that impacts the external world (since RE larger
> than 20 000 are rejected). The other one is a pure implementation detail.
>
> I don't feel strongly about it and I'm fine aligning them one way or
> another if you feel like it.
>
>
> Also, why do you subtract 1 from BASE_SIZE?
>
> Because for N joined regexp there will be N-1 '|'. See lower for more
> details
>
>
>
>>
>>  def _joinregexes(regexps):
>>      """gather multiple regular expressions into a single one"""
>> @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi
>>      OverflowError
>>      """
>>      try:
>> -        regex = _joinregexes([_regex(k, p, globsuffix)
>> -                                     for (k, p, s) in kindpats])
>> -        if len(regex) <= MAX_RE_SIZE:
>> -            return regex, _rematcher(regex)
>> -        # We're using a Python with a tiny regex engine and we
>> -        # made it explode, so we'll divide the pattern list in two
>> -        # until it works
>> -        l = len(kindpats)
>> -        if l < 2:
>> -            # TODO: raise error.Abort here
>> -            raise OverflowError
>> -        regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix)
>> -        regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix)
>> -        return regex, lambda s: a(s) or b(s)
>> +        allgroups = []
>> +        regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
>> +        fullregexp = _joinregexes(regexps)
>> +
>> +        startidx = 0
>> +        groupsize = _BASE_SIZE
>> +        for idx, r in enumerate(regexps):
>> +            piecesize = len(r)
>>
>
> This size is excluding the unnamed group wrapping, right? (I'm not asking
> for a change here, just as context for later questions.)
>
> Yes, this is the pure size of the regexps produced by `_regex`.
>
>
> +            if (piecesize + 4) > MAX_RE_SIZE:
>>
>
> What's this 4? I think it's for the unnamed group wrapping, i.e.
> (BASE_SIZE + 1). Use that?
>
> Yes, another way to do this would be to avoid the `-1` on _BASE_SIZE and
> applying it explicitly in the `groupsize = _BASE_SIZE` lines (`groupsize =
> _BASE_SIZE + 1`)
>

Okay, as long as it's not wrong in this version of the patch, we can just
do that in a follow-up if we decide to do it at all. Queued, thanks.

>
>
>> +                raise OverflowError
>> +            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
>> +                group = regexps[startidx:idx]
>> +                allgroups.append(_joinregexes(group))
>> +                startidx = idx
>> +                groupsize = _BASE_SIZE
>> +            groupsize += piecesize + 1
>>
>
> I suppose the "1" is for the "|"-separator. Shouldn't we add (BASE_SIZE +
> 1) here too
>
> The 1 is for the |. we must not add _BASE_SIZE since there is only a
> single external grouping all joined regexp. _BASE_SIZE added once for each
> group.
>
> (and maybe redefine BASE_SIZE to include the additional 1)? Perhaps the
>
> Sentence seems to have been cut.
>

Oops, yes. (I got interrupted and set it in rush.)


>
> Would it be correct to initialize groupsize to 0 and add
>
> This sentence seems to have been cut too.
>
>
> +
>> +        if startidx == 0:
>> +            func = _rematcher(fullregexp)
>> +        else:
>> +            group = regexps[startidx:]
>> +            allgroups.append(_joinregexes(group))
>> +            allmatchers = [_rematcher(g) for g in allgroups]
>> +            func = lambda s: any(m(s) for m in allmatchers)
>> +        return fullregexp, func
>>      except re.error:
>>          for k, p, s in kindpats:
>>              try:
>>
>
> _______________________________________________
> Mercurial-devel mailing listMercurial-devel@mercurial-scm.orghttps://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>

Patch

diff --git a/mercurial/match.py b/mercurial/match.py
--- a/mercurial/match.py
+++ b/mercurial/match.py
@@ -1185,6 +1185,7 @@  def _buildmatch(kindpats, globsuffix, li
         return regex, lambda f: any(mf(f) for mf in matchfuncs)
 
 MAX_RE_SIZE = 20000
+_BASE_SIZE = len('(?:)') - 1
 
 def _joinregexes(regexps):
     """gather multiple regular expressions into a single one"""
@@ -1203,20 +1204,31 @@  def _buildregexmatch(kindpats, globsuffi
     OverflowError
     """
     try:
-        regex = _joinregexes([_regex(k, p, globsuffix)
-                                     for (k, p, s) in kindpats])
-        if len(regex) <= MAX_RE_SIZE:
-            return regex, _rematcher(regex)
-        # We're using a Python with a tiny regex engine and we
-        # made it explode, so we'll divide the pattern list in two
-        # until it works
-        l = len(kindpats)
-        if l < 2:
-            # TODO: raise error.Abort here
-            raise OverflowError
-        regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix)
-        regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix)
-        return regex, lambda s: a(s) or b(s)
+        allgroups = []
+        regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
+        fullregexp = _joinregexes(regexps)
+
+        startidx = 0
+        groupsize = _BASE_SIZE
+        for idx, r in enumerate(regexps):
+            piecesize = len(r)
+            if (piecesize + 4) > MAX_RE_SIZE:
+                raise OverflowError
+            elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
+                group = regexps[startidx:idx]
+                allgroups.append(_joinregexes(group))
+                startidx = idx
+                groupsize = _BASE_SIZE
+            groupsize += piecesize + 1
+
+        if startidx == 0:
+            func = _rematcher(fullregexp)
+        else:
+            group = regexps[startidx:]
+            allgroups.append(_joinregexes(group))
+            allmatchers = [_rematcher(g) for g in allgroups]
+            func = lambda s: any(m(s) for m in allmatchers)
+        return fullregexp, func
     except re.error:
         for k, p, s in kindpats:
             try: