Submitter | Boris Feld |
---|---|
Date | Dec. 1, 2018, 12:07 p.m. |
Message ID | <0a28ff17100a67c21a99.1543666054@pc62.home> |
Download | mbox | patch |
Permalink | /patch/36895/ |
State | Accepted |
Headers | show |
Comments
I have a few more questions. Sorry I didn't notice last time (because I only considered the higher-level issues). On Sat, Dec 1, 2018 at 4:07 AM Boris Feld <boris.feld@octobus.net> wrote: > # HG changeset patch > # User Boris Feld <boris.feld@octobus.net> > # Date 1542916922 -3600 > # Thu Nov 22 21:02:02 2018 +0100 > # Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b > # Parent 066912081df7b43ed271bea4475e35661e16c1cf > # EXP-Topic perf-ignore > # Available At https://bitbucket.org/octobus/mercurial-devel/ > # hg pull https://bitbucket.org/octobus/mercurial-devel/ -r > 0a28ff17100a > match: avoid translating glob to matcher multiple times for large sets > > For hgignore with many globs, the resulting regexp might not fit under the > 20K > length limit. So the patterns need to be broken up in smaller pieces. > > Before this change, the logic was re-starting the full process from scratch > for each smaller pieces, including the translation of globs into regexp. > Effectively doing the work over and over. > > If the 20K limit is reached, we are likely in a case where there is many > such > glob, so exporting them is especially expensive and we should be careful > not > to do that work more than once. > > To work around this, we now translate glob to regexp once and for all. > Then, > we assemble the resulting individual regexp into valid blocks. > > This raises a very significant performance win for large `.hgignore file`: > > Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 (median > of 66) > After: ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 (median > of 100) > > diff --git a/mercurial/match.py b/mercurial/match.py > --- a/mercurial/match.py > +++ b/mercurial/match.py > @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li > return regex, lambda f: any(mf(f) for mf in matchfuncs) > > MAX_RE_SIZE = 20000 > +_BASE_SIZE = len('(?:)') - 1 > I think I asked before why MAX_RE_SIZE is "public" and _BASE_SIZE is "private". Also, why do you subtract 1 from BASE_SIZE? > > def _joinregexes(regexps): > """gather multiple regular expressions into a single one""" > @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi > OverflowError > """ > try: > - regex = _joinregexes([_regex(k, p, globsuffix) > - for (k, p, s) in kindpats]) > - if len(regex) <= MAX_RE_SIZE: > - return regex, _rematcher(regex) > - # We're using a Python with a tiny regex engine and we > - # made it explode, so we'll divide the pattern list in two > - # until it works > - l = len(kindpats) > - if l < 2: > - # TODO: raise error.Abort here > - raise OverflowError > - regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix) > - regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix) > - return regex, lambda s: a(s) or b(s) > + allgroups = [] > + regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats] > + fullregexp = _joinregexes(regexps) > + > + startidx = 0 > + groupsize = _BASE_SIZE > + for idx, r in enumerate(regexps): > + piecesize = len(r) > This size is excluding the unnamed group wrapping, right? (I'm not asking for a change here, just as context for later questions.) + if (piecesize + 4) > MAX_RE_SIZE: > What's this 4? I think it's for the unnamed group wrapping, i.e. (BASE_SIZE + 1). Use that? > + raise OverflowError > + elif (groupsize + 1 + piecesize) > MAX_RE_SIZE: > + group = regexps[startidx:idx] > + allgroups.append(_joinregexes(group)) > + startidx = idx > + groupsize = _BASE_SIZE > + groupsize += piecesize + 1 > I suppose the "1" is for the "|"-separator. Shouldn't we add (BASE_SIZE + 1) here too (and maybe redefine BASE_SIZE to include the additional 1)? Perhaps the Would it be correct to initialize groupsize to 0 and add + > + if startidx == 0: > + func = _rematcher(fullregexp) > + else: > + group = regexps[startidx:] > + allgroups.append(_joinregexes(group)) > + allmatchers = [_rematcher(g) for g in allgroups] > + func = lambda s: any(m(s) for m in allmatchers) > + return fullregexp, func > except re.error: > for k, p, s in kindpats: > try: >
On 02/12/2018 01:38, Martin von Zweigbergk via Mercurial-devel wrote: > I have a few more questions. Sorry I didn't notice last time (because > I only considered the higher-level issues). > > On Sat, Dec 1, 2018 at 4:07 AM Boris Feld <boris.feld@octobus.net > <mailto:boris.feld@octobus.net>> wrote: > > # HG changeset patch > # User Boris Feld <boris.feld@octobus.net > <mailto:boris.feld@octobus.net>> > # Date 1542916922 -3600 > # Thu Nov 22 21:02:02 2018 +0100 > # Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b > # Parent 066912081df7b43ed271bea4475e35661e16c1cf > # EXP-Topic perf-ignore > # Available At https://bitbucket.org/octobus/mercurial-devel/ > # hg pull > https://bitbucket.org/octobus/mercurial-devel/ -r 0a28ff17100a > match: avoid translating glob to matcher multiple times for large sets > > For hgignore with many globs, the resulting regexp might not fit > under the 20K > length limit. So the patterns need to be broken up in smaller pieces. > > Before this change, the logic was re-starting the full process > from scratch > for each smaller pieces, including the translation of globs into > regexp. > Effectively doing the work over and over. > > If the 20K limit is reached, we are likely in a case where there > is many such > glob, so exporting them is especially expensive and we should be > careful not > to do that work more than once. > > To work around this, we now translate glob to regexp once and for > all. Then, > we assemble the resulting individual regexp into valid blocks. > > This raises a very significant performance win for large > `.hgignore file`: > > Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 > (median of 66) > After: ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 > (median of 100) > > diff --git a/mercurial/match.py b/mercurial/match.py > --- a/mercurial/match.py > +++ b/mercurial/match.py > @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li > return regex, lambda f: any(mf(f) for mf in matchfuncs) > > MAX_RE_SIZE = 20000 > +_BASE_SIZE = len('(?:)') - 1 > > > I think I asked before why MAX_RE_SIZE is "public" and _BASE_SIZE is > "private". MAX_RE_SIZE is a limit that impacts the external world (since RE larger than 20 000 are rejected). The other one is a pure implementation detail. I don't feel strongly about it and I'm fine aligning them one way or another if you feel like it. > > Also, why do you subtract 1 from BASE_SIZE? Because for N joined regexp there will be N-1 '|'. See lower for more details > > > > def _joinregexes(regexps): > """gather multiple regular expressions into a single one""" > @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi > OverflowError > """ > try: > - regex = _joinregexes([_regex(k, p, globsuffix) > - for (k, p, s) in kindpats]) > - if len(regex) <= MAX_RE_SIZE: > - return regex, _rematcher(regex) > - # We're using a Python with a tiny regex engine and we > - # made it explode, so we'll divide the pattern list in two > - # until it works > - l = len(kindpats) > - if l < 2: > - # TODO: raise error.Abort here > - raise OverflowError > - regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix) > - regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix) > - return regex, lambda s: a(s) or b(s) > + allgroups = [] > + regexps = [_regex(k, p, globsuffix) for (k, p, s) in > kindpats] > + fullregexp = _joinregexes(regexps) > + > + startidx = 0 > + groupsize = _BASE_SIZE > + for idx, r in enumerate(regexps): > + piecesize = len(r) > > > This size is excluding the unnamed group wrapping, right? (I'm not > asking for a change here, just as context for later questions.) Yes, this is the pure size of the regexps produced by `_regex`. > > + if (piecesize + 4) > MAX_RE_SIZE: > > > What's this 4? I think it's for the unnamed group wrapping, i.e. > (BASE_SIZE + 1). Use that? Yes, another way to do this would be to avoid the `-1` on _BASE_SIZE and applying it explicitly in the `groupsize = _BASE_SIZE` lines (`groupsize = _BASE_SIZE + 1`) > > > + raise OverflowError > + elif (groupsize + 1 + piecesize) > MAX_RE_SIZE: > + group = regexps[startidx:idx] > + allgroups.append(_joinregexes(group)) > + startidx = idx > + groupsize = _BASE_SIZE > + groupsize += piecesize + 1 > > > I suppose the "1" is for the "|"-separator. Shouldn't we add > (BASE_SIZE + 1) here too The 1 is for the |. we must not add _BASE_SIZE since there is only a single external grouping all joined regexp. _BASE_SIZE added once for each group. > (and maybe redefine BASE_SIZE to include the additional 1)? Perhaps the Sentence seems to have been cut. > > Would it be correct to initialize groupsize to 0 and add This sentence seems to have been cut too. > > + > + if startidx == 0: > + func = _rematcher(fullregexp) > + else: > + group = regexps[startidx:] > + allgroups.append(_joinregexes(group)) > + allmatchers = [_rematcher(g) for g in allgroups] > + func = lambda s: any(m(s) for m in allmatchers) > + return fullregexp, func > except re.error: > for k, p, s in kindpats: > try: > > > _______________________________________________ > Mercurial-devel mailing list > Mercurial-devel@mercurial-scm.org > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
On Sun, Dec 2, 2018 at 7:26 AM Boris FELD <boris.feld@octobus.net> wrote: > > On 02/12/2018 01:38, Martin von Zweigbergk via Mercurial-devel wrote: > > I have a few more questions. Sorry I didn't notice last time (because I > only considered the higher-level issues). > > On Sat, Dec 1, 2018 at 4:07 AM Boris Feld <boris.feld@octobus.net> wrote: > >> # HG changeset patch >> # User Boris Feld <boris.feld@octobus.net> >> # Date 1542916922 -3600 >> # Thu Nov 22 21:02:02 2018 +0100 >> # Node ID 0a28ff17100a67c21a99ef363f15aef09e4dfa8b >> # Parent 066912081df7b43ed271bea4475e35661e16c1cf >> # EXP-Topic perf-ignore >> # Available At https://bitbucket.org/octobus/mercurial-devel/ >> # hg pull https://bitbucket.org/octobus/mercurial-devel/ -r >> 0a28ff17100a >> match: avoid translating glob to matcher multiple times for large sets >> >> For hgignore with many globs, the resulting regexp might not fit under >> the 20K >> length limit. So the patterns need to be broken up in smaller pieces. >> >> Before this change, the logic was re-starting the full process from >> scratch >> for each smaller pieces, including the translation of globs into regexp. >> Effectively doing the work over and over. >> >> If the 20K limit is reached, we are likely in a case where there is many >> such >> glob, so exporting them is especially expensive and we should be careful >> not >> to do that work more than once. >> >> To work around this, we now translate glob to regexp once and for all. >> Then, >> we assemble the resulting individual regexp into valid blocks. >> >> This raises a very significant performance win for large `.hgignore file`: >> >> Before: ! wall 0.153153 comb 0.150000 user 0.150000 sys 0.000000 (median >> of 66) >> After: ! wall 0.059793 comb 0.060000 user 0.060000 sys 0.000000 (median >> of 100) >> >> diff --git a/mercurial/match.py b/mercurial/match.py >> --- a/mercurial/match.py >> +++ b/mercurial/match.py >> @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li >> return regex, lambda f: any(mf(f) for mf in matchfuncs) >> >> MAX_RE_SIZE = 20000 >> +_BASE_SIZE = len('(?:)') - 1 >> > > I think I asked before why MAX_RE_SIZE is "public" and _BASE_SIZE is > "private". > > > MAX_RE_SIZE is a limit that impacts the external world (since RE larger > than 20 000 are rejected). The other one is a pure implementation detail. > > I don't feel strongly about it and I'm fine aligning them one way or > another if you feel like it. > > > Also, why do you subtract 1 from BASE_SIZE? > > Because for N joined regexp there will be N-1 '|'. See lower for more > details > > > >> >> def _joinregexes(regexps): >> """gather multiple regular expressions into a single one""" >> @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi >> OverflowError >> """ >> try: >> - regex = _joinregexes([_regex(k, p, globsuffix) >> - for (k, p, s) in kindpats]) >> - if len(regex) <= MAX_RE_SIZE: >> - return regex, _rematcher(regex) >> - # We're using a Python with a tiny regex engine and we >> - # made it explode, so we'll divide the pattern list in two >> - # until it works >> - l = len(kindpats) >> - if l < 2: >> - # TODO: raise error.Abort here >> - raise OverflowError >> - regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix) >> - regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix) >> - return regex, lambda s: a(s) or b(s) >> + allgroups = [] >> + regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats] >> + fullregexp = _joinregexes(regexps) >> + >> + startidx = 0 >> + groupsize = _BASE_SIZE >> + for idx, r in enumerate(regexps): >> + piecesize = len(r) >> > > This size is excluding the unnamed group wrapping, right? (I'm not asking > for a change here, just as context for later questions.) > > Yes, this is the pure size of the regexps produced by `_regex`. > > > + if (piecesize + 4) > MAX_RE_SIZE: >> > > What's this 4? I think it's for the unnamed group wrapping, i.e. > (BASE_SIZE + 1). Use that? > > Yes, another way to do this would be to avoid the `-1` on _BASE_SIZE and > applying it explicitly in the `groupsize = _BASE_SIZE` lines (`groupsize = > _BASE_SIZE + 1`) > Okay, as long as it's not wrong in this version of the patch, we can just do that in a follow-up if we decide to do it at all. Queued, thanks. > > >> + raise OverflowError >> + elif (groupsize + 1 + piecesize) > MAX_RE_SIZE: >> + group = regexps[startidx:idx] >> + allgroups.append(_joinregexes(group)) >> + startidx = idx >> + groupsize = _BASE_SIZE >> + groupsize += piecesize + 1 >> > > I suppose the "1" is for the "|"-separator. Shouldn't we add (BASE_SIZE + > 1) here too > > The 1 is for the |. we must not add _BASE_SIZE since there is only a > single external grouping all joined regexp. _BASE_SIZE added once for each > group. > > (and maybe redefine BASE_SIZE to include the additional 1)? Perhaps the > > Sentence seems to have been cut. > Oops, yes. (I got interrupted and set it in rush.) > > Would it be correct to initialize groupsize to 0 and add > > This sentence seems to have been cut too. > > > + >> + if startidx == 0: >> + func = _rematcher(fullregexp) >> + else: >> + group = regexps[startidx:] >> + allgroups.append(_joinregexes(group)) >> + allmatchers = [_rematcher(g) for g in allgroups] >> + func = lambda s: any(m(s) for m in allmatchers) >> + return fullregexp, func >> except re.error: >> for k, p, s in kindpats: >> try: >> > > _______________________________________________ > Mercurial-devel mailing listMercurial-devel@mercurial-scm.orghttps://www.mercurial-scm.org/mailman/listinfo/mercurial-devel > >
Patch
diff --git a/mercurial/match.py b/mercurial/match.py --- a/mercurial/match.py +++ b/mercurial/match.py @@ -1185,6 +1185,7 @@ def _buildmatch(kindpats, globsuffix, li return regex, lambda f: any(mf(f) for mf in matchfuncs) MAX_RE_SIZE = 20000 +_BASE_SIZE = len('(?:)') - 1 def _joinregexes(regexps): """gather multiple regular expressions into a single one""" @@ -1203,20 +1204,31 @@ def _buildregexmatch(kindpats, globsuffi OverflowError """ try: - regex = _joinregexes([_regex(k, p, globsuffix) - for (k, p, s) in kindpats]) - if len(regex) <= MAX_RE_SIZE: - return regex, _rematcher(regex) - # We're using a Python with a tiny regex engine and we - # made it explode, so we'll divide the pattern list in two - # until it works - l = len(kindpats) - if l < 2: - # TODO: raise error.Abort here - raise OverflowError - regexa, a = _buildregexmatch(kindpats[:l//2], globsuffix) - regexb, b = _buildregexmatch(kindpats[l//2:], globsuffix) - return regex, lambda s: a(s) or b(s) + allgroups = [] + regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats] + fullregexp = _joinregexes(regexps) + + startidx = 0 + groupsize = _BASE_SIZE + for idx, r in enumerate(regexps): + piecesize = len(r) + if (piecesize + 4) > MAX_RE_SIZE: + raise OverflowError + elif (groupsize + 1 + piecesize) > MAX_RE_SIZE: + group = regexps[startidx:idx] + allgroups.append(_joinregexes(group)) + startidx = idx + groupsize = _BASE_SIZE + groupsize += piecesize + 1 + + if startidx == 0: + func = _rematcher(fullregexp) + else: + group = regexps[startidx:] + allgroups.append(_joinregexes(group)) + allmatchers = [_rematcher(g) for g in allgroups] + func = lambda s: any(m(s) for m in allmatchers) + return fullregexp, func except re.error: for k, p, s in kindpats: try: