Patchwork [11,of,11,sparse] dirstate: integrate sparse matcher with _ignore (API)

login
register
mail settings
Submitter Gregory Szorc
Date July 8, 2017, 11:29 p.m.
Message ID <94f98bc84936defadb95.1499556546@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/22135/
State Accepted
Headers show

Comments

Gregory Szorc - July 8, 2017, 11:29 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1499555309 25200
#      Sat Jul 08 16:08:29 2017 -0700
# Node ID 94f98bc84936defadb959e31012555dba170d8cd
# Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
dirstate: integrate sparse matcher with _ignore (API)

Previously, the sparse extension monkeypatched dirstate._ignore
(using exotic means because dirstate._ignore was a file cache
property).

In this commit, we move the logic from dirstate_ignore to a new method,
_hgignorematch(). We replace dirstate._ignore with a regular property.
The implementation of that property obtains a sparse matcher. If one
is defined, it returns a unionmatcher.

The implementation is funcionally equivalent to the old code in
the sparse extension. However, the property caching of dirstate._ignore
has been dropped. This means that every attribute lookup will:

* resolve a sparse matcher by calling sparse.matcher()
* call matcher.always()
* possibly construct a new unionmatcher if the sparse matcher is
  defined

The overhead is unfortunate. However, it is simple. sparse.matcher()
should be relatively fast. And I have plans for caching the sparse
matcher more intelligently to avoid all these lookups. Furthermore,
I don't see any obvious code paths where dirstate._ignore is resolved
in tight loops, so the perf impact may be mostly irrelevant.

After this commit, the sparse extension now only contains modifications
to commands: all code directly supporting sparse checkouts is now in
core!
via Mercurial-devel - July 10, 2017, 5:01 p.m.
(For Durham)

On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc <gregory.szorc@gmail.com> wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1499555309 25200
> #      Sat Jul 08 16:08:29 2017 -0700
> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
> dirstate: integrate sparse matcher with _ignore (API)

Why does sparse do it this way instead of intersecting the sparse
matcher with the user's matcher?
Durham Goode - July 10, 2017, 6:45 p.m.
On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
> (For Durham)
>
> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc <gregory.szorc@gmail.com> wrote:
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1499555309 25200
>> #      Sat Jul 08 16:08:29 2017 -0700
>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
>> dirstate: integrate sparse matcher with _ignore (API)
>
> Why does sparse do it this way instead of intersecting the sparse
> matcher with the user's matcher?

I'm not sure I understand the question.  What is the "user's matcher" 
here? The ignore matcher?

This code produces a matcher that returns true for any file that should 
be ignored.  Since both hgignore files and sparse-ignored files should 
be ignored, I'm not sure how that could be expressed with intersection 
of those two matchers?
via Mercurial-devel - July 10, 2017, 6:55 p.m.
On Mon, Jul 10, 2017 at 11:45 AM, Durham Goode <durham@fb.com> wrote:
> On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
>>
>> (For Durham)
>>
>> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc <gregory.szorc@gmail.com>
>> wrote:
>>>
>>> # HG changeset patch
>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>> # Date 1499555309 25200
>>> #      Sat Jul 08 16:08:29 2017 -0700
>>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
>>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
>>> dirstate: integrate sparse matcher with _ignore (API)
>>
>>
>> Why does sparse do it this way instead of intersecting the sparse
>> matcher with the user's matcher?
>
>
> I'm not sure I understand the question.  What is the "user's matcher" here?
> The ignore matcher?

I mean the matcher the user may have provided on the command line (or
match.always() by default), as in "hg status dir/" (where the matcher
would be "relpath:dir").

>
> This code produces a matcher that returns true for any file that should be
> ignored.  Since both hgignore files and sparse-ignored files should be
> ignored, I'm not sure how that could be expressed with intersection of those
> two matchers?

For narrowhg, we did it the other way around: filtering in instead of
filtering out. So if the narrowspec (like sparse config, IIUC) says to
include foo/ and bar/ and the user provides 'glob:*c', we'd intersect
that and list *.c files in those two directories (recursively).
Durham Goode - July 10, 2017, 6:58 p.m.
On 7/10/17 11:55 AM, Martin von Zweigbergk wrote:
> On Mon, Jul 10, 2017 at 11:45 AM, Durham Goode <durham@fb.com> wrote:
>> On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
>>>
>>> (For Durham)
>>>
>>> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc <gregory.szorc@gmail.com>
>>> wrote:
>>>>
>>>> # HG changeset patch
>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>>> # Date 1499555309 25200
>>>> #      Sat Jul 08 16:08:29 2017 -0700
>>>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
>>>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
>>>> dirstate: integrate sparse matcher with _ignore (API)
>>>
>>>
>>> Why does sparse do it this way instead of intersecting the sparse
>>> matcher with the user's matcher?
>>
>>
>> I'm not sure I understand the question.  What is the "user's matcher" here?
>> The ignore matcher?
>
> I mean the matcher the user may have provided on the command line (or
> match.always() by default), as in "hg status dir/" (where the matcher
> would be "relpath:dir").
>
>>
>> This code produces a matcher that returns true for any file that should be
>> ignored.  Since both hgignore files and sparse-ignored files should be
>> ignored, I'm not sure how that could be expressed with intersection of those
>> two matchers?
>
> For narrowhg, we did it the other way around: filtering in instead of
> filtering out. So if the narrowspec (like sparse config, IIUC) says to
> include foo/ and bar/ and the user provides 'glob:*c', we'd intersect
> that and list *.c files in those two directories (recursively).

I'd have to look at the code to be specific, but I think the dirstate 
ignore logic covers more cases than the user provided matcher logic. I'd 
be surprised if all commands that hit dirstate.ignore also happened to 
take patterns at the command level.  It just seemed cleaner to have a 
unified matcher for ignored files at the repo level.  The user specific 
matcher can always be added on top of it later for commands that take 
patterns.
via Mercurial-devel - July 10, 2017, 8:04 p.m.
On Mon, Jul 10, 2017 at 11:58 AM, Durham Goode <durham@fb.com> wrote:
>
>
> On 7/10/17 11:55 AM, Martin von Zweigbergk wrote:
>>
>> On Mon, Jul 10, 2017 at 11:45 AM, Durham Goode <durham@fb.com> wrote:
>>>
>>> On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
>>>>
>>>>
>>>> (For Durham)
>>>>
>>>> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc <gregory.szorc@gmail.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> # HG changeset patch
>>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>>>> # Date 1499555309 25200
>>>>> #      Sat Jul 08 16:08:29 2017 -0700
>>>>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
>>>>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
>>>>> dirstate: integrate sparse matcher with _ignore (API)
>>>>
>>>>
>>>>
>>>> Why does sparse do it this way instead of intersecting the sparse
>>>> matcher with the user's matcher?
>>>
>>>
>>>
>>> I'm not sure I understand the question.  What is the "user's matcher"
>>> here?
>>> The ignore matcher?
>>
>>
>> I mean the matcher the user may have provided on the command line (or
>> match.always() by default), as in "hg status dir/" (where the matcher
>> would be "relpath:dir").
>>
>>>
>>> This code produces a matcher that returns true for any file that should
>>> be
>>> ignored.  Since both hgignore files and sparse-ignored files should be
>>> ignored, I'm not sure how that could be expressed with intersection of
>>> those
>>> two matchers?
>>
>>
>> For narrowhg, we did it the other way around: filtering in instead of
>> filtering out. So if the narrowspec (like sparse config, IIUC) says to
>> include foo/ and bar/ and the user provides 'glob:*c', we'd intersect
>> that and list *.c files in those two directories (recursively).
>
>
> I'd have to look at the code to be specific, but I think the dirstate ignore
> logic covers more cases than the user provided matcher logic. I'd be
> surprised if all commands that hit dirstate.ignore also happened to take
> patterns at the command level.

If they don't, then the sparse matcher can be passed as is.

>  It just seemed cleaner to have a unified
> matcher for ignored files at the repo level.  The user specific matcher can
> always be added on top of it later for commands that take patterns.

For narrow, we have to apply the matcher when walking the manifest
too. The user can pass a matcher to e.g. "hg status -c ." or "hg files
-r ." and in those cases we need to intersect the narrow matcher with
the user-supplied one. It seemed more natural to do the same for
dirstate walks.

It also seems simpler to control which directories are visited if
using a positive matcher than a negative one. For example, let's say
the narrow matcher is path:dir/. The narrowhg code will then restrict
the walk to visit only the root directory, dir/, and subdirectories of
dir/ (both for manifest walks and dirstate walks). I think we can
simply make negatematcher's visitdir return False iff the
narrow/sparse matcher returns 'all', so it's probably easy to get it
to work. It still seems more natural to me to match what should be
included.
Gregory Szorc - July 10, 2017, 8:18 p.m.
On Mon, Jul 10, 2017 at 1:04 PM, Martin von Zweigbergk <
martinvonz@google.com> wrote:

> On Mon, Jul 10, 2017 at 11:58 AM, Durham Goode <durham@fb.com> wrote:
> >
> >
> > On 7/10/17 11:55 AM, Martin von Zweigbergk wrote:
> >>
> >> On Mon, Jul 10, 2017 at 11:45 AM, Durham Goode <durham@fb.com> wrote:
> >>>
> >>> On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
> >>>>
> >>>>
> >>>> (For Durham)
> >>>>
> >>>> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc <
> gregory.szorc@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>
> >>>>> # HG changeset patch
> >>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
> >>>>> # Date 1499555309 25200
> >>>>> #      Sat Jul 08 16:08:29 2017 -0700
> >>>>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
> >>>>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
> >>>>> dirstate: integrate sparse matcher with _ignore (API)
> >>>>
> >>>>
> >>>>
> >>>> Why does sparse do it this way instead of intersecting the sparse
> >>>> matcher with the user's matcher?
> >>>
> >>>
> >>>
> >>> I'm not sure I understand the question.  What is the "user's matcher"
> >>> here?
> >>> The ignore matcher?
> >>
> >>
> >> I mean the matcher the user may have provided on the command line (or
> >> match.always() by default), as in "hg status dir/" (where the matcher
> >> would be "relpath:dir").
> >>
> >>>
> >>> This code produces a matcher that returns true for any file that should
> >>> be
> >>> ignored.  Since both hgignore files and sparse-ignored files should be
> >>> ignored, I'm not sure how that could be expressed with intersection of
> >>> those
> >>> two matchers?
> >>
> >>
> >> For narrowhg, we did it the other way around: filtering in instead of
> >> filtering out. So if the narrowspec (like sparse config, IIUC) says to
> >> include foo/ and bar/ and the user provides 'glob:*c', we'd intersect
> >> that and list *.c files in those two directories (recursively).
> >
> >
> > I'd have to look at the code to be specific, but I think the dirstate
> ignore
> > logic covers more cases than the user provided matcher logic. I'd be
> > surprised if all commands that hit dirstate.ignore also happened to take
> > patterns at the command level.
>
> If they don't, then the sparse matcher can be passed as is.
>
> >  It just seemed cleaner to have a unified
> > matcher for ignored files at the repo level.  The user specific matcher
> can
> > always be added on top of it later for commands that take patterns.
>
> For narrow, we have to apply the matcher when walking the manifest
> too. The user can pass a matcher to e.g. "hg status -c ." or "hg files
> -r ." and in those cases we need to intersect the narrow matcher with
> the user-supplied one. It seemed more natural to do the same for
> dirstate walks.
>
> It also seems simpler to control which directories are visited if
> using a positive matcher than a negative one. For example, let's say
> the narrow matcher is path:dir/. The narrowhg code will then restrict
> the walk to visit only the root directory, dir/, and subdirectories of
> dir/ (both for manifest walks and dirstate walks). I think we can
> simply make negatematcher's visitdir return False iff the
> narrow/sparse matcher returns 'all', so it's probably easy to get it
> to work. It still seems more natural to me to match what should be
> included.
>

I think there's tons of potential to improve how the filtering/matching is
performed. At the very least, we'll likely attach a matcher to the repo
instance instead of dynamically obtaining one for every call site. I think
a lot of the design of the current sparse code is heavily influenced by a)
state of the code 3+ years ago b) what's easiest for an extension to do.
Now that things are in core, we can and should think about better ways to
implement everything.
Durham Goode - July 10, 2017, 9:48 p.m.
On 7/10/17 1:04 PM, Martin von Zweigbergk wrote:
> On Mon, Jul 10, 2017 at 11:58 AM, Durham Goode <durham@fb.com> wrote:
>>
>>
>> On 7/10/17 11:55 AM, Martin von Zweigbergk wrote:
>>>
>>> On Mon, Jul 10, 2017 at 11:45 AM, Durham Goode <durham@fb.com> wrote:
>>>>
>>>> On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
>>>>>
>>>>>
>>>>> (For Durham)
>>>>>
>>>>> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc <gregory.szorc@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> # HG changeset patch
>>>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>>>>> # Date 1499555309 25200
>>>>>> #      Sat Jul 08 16:08:29 2017 -0700
>>>>>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
>>>>>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
>>>>>> dirstate: integrate sparse matcher with _ignore (API)
>>>>>
>>>>>
>>>>>
>>>>> Why does sparse do it this way instead of intersecting the sparse
>>>>> matcher with the user's matcher?
>>>>
>>>>
>>>>
>>>> I'm not sure I understand the question.  What is the "user's matcher"
>>>> here?
>>>> The ignore matcher?
>>>
>>>
>>> I mean the matcher the user may have provided on the command line (or
>>> match.always() by default), as in "hg status dir/" (where the matcher
>>> would be "relpath:dir").
>>>
>>>>
>>>> This code produces a matcher that returns true for any file that should
>>>> be
>>>> ignored.  Since both hgignore files and sparse-ignored files should be
>>>> ignored, I'm not sure how that could be expressed with intersection of
>>>> those
>>>> two matchers?
>>>
>>>
>>> For narrowhg, we did it the other way around: filtering in instead of
>>> filtering out. So if the narrowspec (like sparse config, IIUC) says to
>>> include foo/ and bar/ and the user provides 'glob:*c', we'd intersect
>>> that and list *.c files in those two directories (recursively).
>>
>>
>> I'd have to look at the code to be specific, but I think the dirstate ignore
>> logic covers more cases than the user provided matcher logic. I'd be
>> surprised if all commands that hit dirstate.ignore also happened to take
>> patterns at the command level.
>
> If they don't, then the sparse matcher can be passed as is.
>
>>  It just seemed cleaner to have a unified
>> matcher for ignored files at the repo level.  The user specific matcher can
>> always be added on top of it later for commands that take patterns.
>
> For narrow, we have to apply the matcher when walking the manifest
> too. The user can pass a matcher to e.g. "hg status -c ." or "hg files
> -r ." and in those cases we need to intersect the narrow matcher with
> the user-supplied one. It seemed more natural to do the same for
> dirstate walks.
>
> It also seems simpler to control which directories are visited if
> using a positive matcher than a negative one. For example, let's say
> the narrow matcher is path:dir/. The narrowhg code will then restrict
> the walk to visit only the root directory, dir/, and subdirectories of
> dir/ (both for manifest walks and dirstate walks). I think we can
> simply make negatematcher's visitdir return False iff the
> narrow/sparse matcher returns 'all', so it's probably easy to get it
> to work. It still seems more natural to me to match what should be
> included.
>

I don't have a strong opinion either way. When I made sparse, it was 
specific to the working copy, so it mapped to the ignore matcher very 
tightly. If that needs to change, that's fine.

I just want to avoid duplicating repetitive matcher logic amongst 
individual commands.
via Mercurial-devel - July 11, 2017, 5:16 p.m.
On Mon, Jul 10, 2017 at 2:48 PM, Durham Goode <durham@fb.com> wrote:
>
>
> On 7/10/17 1:04 PM, Martin von Zweigbergk wrote:
>>
>> On Mon, Jul 10, 2017 at 11:58 AM, Durham Goode <durham@fb.com> wrote:
>>>
>>>
>>>
>>> On 7/10/17 11:55 AM, Martin von Zweigbergk wrote:
>>>>
>>>>
>>>> On Mon, Jul 10, 2017 at 11:45 AM, Durham Goode <durham@fb.com> wrote:
>>>>>
>>>>>
>>>>> On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> (For Durham)
>>>>>>
>>>>>> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc
>>>>>> <gregory.szorc@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> # HG changeset patch
>>>>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>>>>>> # Date 1499555309 25200
>>>>>>> #      Sat Jul 08 16:08:29 2017 -0700
>>>>>>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
>>>>>>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
>>>>>>> dirstate: integrate sparse matcher with _ignore (API)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Why does sparse do it this way instead of intersecting the sparse
>>>>>> matcher with the user's matcher?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I'm not sure I understand the question.  What is the "user's matcher"
>>>>> here?
>>>>> The ignore matcher?
>>>>
>>>>
>>>>
>>>> I mean the matcher the user may have provided on the command line (or
>>>> match.always() by default), as in "hg status dir/" (where the matcher
>>>> would be "relpath:dir").
>>>>
>>>>>
>>>>> This code produces a matcher that returns true for any file that should
>>>>> be
>>>>> ignored.  Since both hgignore files and sparse-ignored files should be
>>>>> ignored, I'm not sure how that could be expressed with intersection of
>>>>> those
>>>>> two matchers?
>>>>
>>>>
>>>>
>>>> For narrowhg, we did it the other way around: filtering in instead of
>>>> filtering out. So if the narrowspec (like sparse config, IIUC) says to
>>>> include foo/ and bar/ and the user provides 'glob:*c', we'd intersect
>>>> that and list *.c files in those two directories (recursively).
>>>
>>>
>>>
>>> I'd have to look at the code to be specific, but I think the dirstate
>>> ignore
>>> logic covers more cases than the user provided matcher logic. I'd be
>>> surprised if all commands that hit dirstate.ignore also happened to take
>>> patterns at the command level.
>>
>>
>> If they don't, then the sparse matcher can be passed as is.
>>
>>>  It just seemed cleaner to have a unified
>>> matcher for ignored files at the repo level.  The user specific matcher
>>> can
>>> always be added on top of it later for commands that take patterns.
>>
>>
>> For narrow, we have to apply the matcher when walking the manifest
>> too. The user can pass a matcher to e.g. "hg status -c ." or "hg files
>> -r ." and in those cases we need to intersect the narrow matcher with
>> the user-supplied one. It seemed more natural to do the same for
>> dirstate walks.
>>
>> It also seems simpler to control which directories are visited if
>> using a positive matcher than a negative one. For example, let's say
>> the narrow matcher is path:dir/. The narrowhg code will then restrict
>> the walk to visit only the root directory, dir/, and subdirectories of
>> dir/ (both for manifest walks and dirstate walks). I think we can
>> simply make negatematcher's visitdir return False iff the
>> narrow/sparse matcher returns 'all', so it's probably easy to get it
>> to work. It still seems more natural to me to match what should be
>> included.
>>
>
> I don't have a strong opinion either way. When I made sparse, it was
> specific to the working copy, so it mapped to the ignore matcher very
> tightly. If that needs to change, that's fine.

I just tried this. I think the result is cleaner, but there's a
functional change that perhaps suggests the reason you modified the
ignores instead of modifying the walk: untracked files outside the
sparse config are reported as ignored with your version and are not
reported at all with my version. That's an interesting aspect that I
had not thought about.

I had previously considered adding an option to status for narrowhg to
make it also list files outside the narrow config. One option would be
to not include them by default and list their status as something
special (maybe O for outside?) since neither untracked nor ignored is
really accurate. They may (kind of) be tracked, just not in the
working copy, and with narrowhg with treemanifests we can't even tell
if they're in the manifest or not.

So, would you be okay with changing the behavior to not report these
files as ignored by default and instead adding an option to status
(maybe --no-sparse)?

For narrow, even committed revisions will be filtered by the
narrowspec by default. One can imagine passing --no-sparse (or
--no-narrow?) for committed revisions as well. We could then
potentially report things as outside the narrowspec and added,
removed, modified, or clean, so there would be those three statuses
with an extra bit saying that it's also outside the narrowspec (since
they're all letters, we could use lowercase 'a', 'r', 'm', 'c', but
I'm not sure that's a good idea). We can and should the discuss the
specifics later.

(When running hg on top of our overlay file system at Google, we
definitely do not want to walk the entire working copy (because it has
millions of files outside the sparse/narrow config). But that's not a
normal hg setup, so it would be our problem to set up an override for
that if we decided to report these files as ignored in hg core.)

>
> I just want to avoid duplicating repetitive matcher logic amongst individual
> commands.
Durham Goode - July 11, 2017, 5:23 p.m.
On 7/11/17 10:16 AM, Martin von Zweigbergk wrote:
> On Mon, Jul 10, 2017 at 2:48 PM, Durham Goode <durham@fb.com> wrote:
>>
>>
>> On 7/10/17 1:04 PM, Martin von Zweigbergk wrote:
>>>
>>> On Mon, Jul 10, 2017 at 11:58 AM, Durham Goode <durham@fb.com> wrote:
>>>>
>>>>
>>>>
>>>> On 7/10/17 11:55 AM, Martin von Zweigbergk wrote:
>>>>>
>>>>>
>>>>> On Mon, Jul 10, 2017 at 11:45 AM, Durham Goode <durham@fb.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> (For Durham)
>>>>>>>
>>>>>>> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc
>>>>>>> <gregory.szorc@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> # HG changeset patch
>>>>>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>>>>>>> # Date 1499555309 25200
>>>>>>>> #      Sat Jul 08 16:08:29 2017 -0700
>>>>>>>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
>>>>>>>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
>>>>>>>> dirstate: integrate sparse matcher with _ignore (API)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Why does sparse do it this way instead of intersecting the sparse
>>>>>>> matcher with the user's matcher?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm not sure I understand the question.  What is the "user's matcher"
>>>>>> here?
>>>>>> The ignore matcher?
>>>>>
>>>>>
>>>>>
>>>>> I mean the matcher the user may have provided on the command line (or
>>>>> match.always() by default), as in "hg status dir/" (where the matcher
>>>>> would be "relpath:dir").
>>>>>
>>>>>>
>>>>>> This code produces a matcher that returns true for any file that should
>>>>>> be
>>>>>> ignored.  Since both hgignore files and sparse-ignored files should be
>>>>>> ignored, I'm not sure how that could be expressed with intersection of
>>>>>> those
>>>>>> two matchers?
>>>>>
>>>>>
>>>>>
>>>>> For narrowhg, we did it the other way around: filtering in instead of
>>>>> filtering out. So if the narrowspec (like sparse config, IIUC) says to
>>>>> include foo/ and bar/ and the user provides 'glob:*c', we'd intersect
>>>>> that and list *.c files in those two directories (recursively).
>>>>
>>>>
>>>>
>>>> I'd have to look at the code to be specific, but I think the dirstate
>>>> ignore
>>>> logic covers more cases than the user provided matcher logic. I'd be
>>>> surprised if all commands that hit dirstate.ignore also happened to take
>>>> patterns at the command level.
>>>
>>>
>>> If they don't, then the sparse matcher can be passed as is.
>>>
>>>>  It just seemed cleaner to have a unified
>>>> matcher for ignored files at the repo level.  The user specific matcher
>>>> can
>>>> always be added on top of it later for commands that take patterns.
>>>
>>>
>>> For narrow, we have to apply the matcher when walking the manifest
>>> too. The user can pass a matcher to e.g. "hg status -c ." or "hg files
>>> -r ." and in those cases we need to intersect the narrow matcher with
>>> the user-supplied one. It seemed more natural to do the same for
>>> dirstate walks.
>>>
>>> It also seems simpler to control which directories are visited if
>>> using a positive matcher than a negative one. For example, let's say
>>> the narrow matcher is path:dir/. The narrowhg code will then restrict
>>> the walk to visit only the root directory, dir/, and subdirectories of
>>> dir/ (both for manifest walks and dirstate walks). I think we can
>>> simply make negatematcher's visitdir return False iff the
>>> narrow/sparse matcher returns 'all', so it's probably easy to get it
>>> to work. It still seems more natural to me to match what should be
>>> included.
>>>
>>
>> I don't have a strong opinion either way. When I made sparse, it was
>> specific to the working copy, so it mapped to the ignore matcher very
>> tightly. If that needs to change, that's fine.
>
> I just tried this. I think the result is cleaner, but there's a
> functional change that perhaps suggests the reason you modified the
> ignores instead of modifying the walk: untracked files outside the
> sparse config are reported as ignored with your version and are not
> reported at all with my version. That's an interesting aspect that I
> had not thought about.
>
> I had previously considered adding an option to status for narrowhg to
> make it also list files outside the narrow config. One option would be
> to not include them by default and list their status as something
> special (maybe O for outside?) since neither untracked nor ignored is
> really accurate. They may (kind of) be tracked, just not in the
> working copy, and with narrowhg with treemanifests we can't even tell
> if they're in the manifest or not.
>
> So, would you be okay with changing the behavior to not report these
> files as ignored by default and instead adding an option to status
> (maybe --no-sparse)?

If that's the only effect, I'm fine with that change.  The only time we 
care about the files outside the sparse checkout is when widening the 
sparse checkout (it checks if files already exist on disk and prevents 
the sparse-widening from overwriting them if the content differs), but 
that might be done via other means.
via Mercurial-devel - July 11, 2017, 5:29 p.m.
On Tue, Jul 11, 2017 at 10:23 AM, Durham Goode <durham@fb.com> wrote:
>
>
> On 7/11/17 10:16 AM, Martin von Zweigbergk wrote:
>>
>> On Mon, Jul 10, 2017 at 2:48 PM, Durham Goode <durham@fb.com> wrote:
>>>
>>>
>>>
>>> On 7/10/17 1:04 PM, Martin von Zweigbergk wrote:
>>>>
>>>>
>>>> On Mon, Jul 10, 2017 at 11:58 AM, Durham Goode <durham@fb.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 7/10/17 11:55 AM, Martin von Zweigbergk wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 10, 2017 at 11:45 AM, Durham Goode <durham@fb.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 7/10/17 10:01 AM, Martin von Zweigbergk wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> (For Durham)
>>>>>>>>
>>>>>>>> On Sat, Jul 8, 2017 at 4:29 PM, Gregory Szorc
>>>>>>>> <gregory.szorc@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> # HG changeset patch
>>>>>>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>>>>>>>> # Date 1499555309 25200
>>>>>>>>> #      Sat Jul 08 16:08:29 2017 -0700
>>>>>>>>> # Node ID 94f98bc84936defadb959e31012555dba170d8cd
>>>>>>>>> # Parent  a2867557f9c2314aeea19a946dfb8e167def4fb8
>>>>>>>>> dirstate: integrate sparse matcher with _ignore (API)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Why does sparse do it this way instead of intersecting the sparse
>>>>>>>> matcher with the user's matcher?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm not sure I understand the question.  What is the "user's matcher"
>>>>>>> here?
>>>>>>> The ignore matcher?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I mean the matcher the user may have provided on the command line (or
>>>>>> match.always() by default), as in "hg status dir/" (where the matcher
>>>>>> would be "relpath:dir").
>>>>>>
>>>>>>>
>>>>>>> This code produces a matcher that returns true for any file that
>>>>>>> should
>>>>>>> be
>>>>>>> ignored.  Since both hgignore files and sparse-ignored files should
>>>>>>> be
>>>>>>> ignored, I'm not sure how that could be expressed with intersection
>>>>>>> of
>>>>>>> those
>>>>>>> two matchers?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> For narrowhg, we did it the other way around: filtering in instead of
>>>>>> filtering out. So if the narrowspec (like sparse config, IIUC) says to
>>>>>> include foo/ and bar/ and the user provides 'glob:*c', we'd intersect
>>>>>> that and list *.c files in those two directories (recursively).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I'd have to look at the code to be specific, but I think the dirstate
>>>>> ignore
>>>>> logic covers more cases than the user provided matcher logic. I'd be
>>>>> surprised if all commands that hit dirstate.ignore also happened to
>>>>> take
>>>>> patterns at the command level.
>>>>
>>>>
>>>>
>>>> If they don't, then the sparse matcher can be passed as is.
>>>>
>>>>>  It just seemed cleaner to have a unified
>>>>> matcher for ignored files at the repo level.  The user specific matcher
>>>>> can
>>>>> always be added on top of it later for commands that take patterns.
>>>>
>>>>
>>>>
>>>> For narrow, we have to apply the matcher when walking the manifest
>>>> too. The user can pass a matcher to e.g. "hg status -c ." or "hg files
>>>> -r ." and in those cases we need to intersect the narrow matcher with
>>>> the user-supplied one. It seemed more natural to do the same for
>>>> dirstate walks.
>>>>
>>>> It also seems simpler to control which directories are visited if
>>>> using a positive matcher than a negative one. For example, let's say
>>>> the narrow matcher is path:dir/. The narrowhg code will then restrict
>>>> the walk to visit only the root directory, dir/, and subdirectories of
>>>> dir/ (both for manifest walks and dirstate walks). I think we can
>>>> simply make negatematcher's visitdir return False iff the
>>>> narrow/sparse matcher returns 'all', so it's probably easy to get it
>>>> to work. It still seems more natural to me to match what should be
>>>> included.
>>>>
>>>
>>> I don't have a strong opinion either way. When I made sparse, it was
>>> specific to the working copy, so it mapped to the ignore matcher very
>>> tightly. If that needs to change, that's fine.
>>
>>
>> I just tried this. I think the result is cleaner, but there's a
>> functional change that perhaps suggests the reason you modified the
>> ignores instead of modifying the walk: untracked files outside the
>> sparse config are reported as ignored with your version and are not
>> reported at all with my version. That's an interesting aspect that I
>> had not thought about.
>>
>> I had previously considered adding an option to status for narrowhg to
>> make it also list files outside the narrow config. One option would be
>> to not include them by default and list their status as something
>> special (maybe O for outside?) since neither untracked nor ignored is
>> really accurate. They may (kind of) be tracked, just not in the
>> working copy, and with narrowhg with treemanifests we can't even tell
>> if they're in the manifest or not.
>>
>> So, would you be okay with changing the behavior to not report these
>> files as ignored by default and instead adding an option to status
>> (maybe --no-sparse)?
>
>
> If that's the only effect, I'm fine with that change.  The only time we care
> about the files outside the sparse checkout is when widening the sparse
> checkout (it checks if files already exist on disk and prevents the
> sparse-widening from overwriting them if the content differs), but that
> might be done via other means.

Great. I'll send patches (first a few patches to fix the matchers that
are currently broken).

Patch

diff --git a/hgext/sparse.py b/hgext/sparse.py
--- a/hgext/sparse.py
+++ b/hgext/sparse.py
@@ -78,11 +78,9 @@  from mercurial.i18n import _
 from mercurial import (
     cmdutil,
     commands,
-    dirstate,
     error,
     extensions,
     hg,
-    match as matchmod,
     registrar,
     sparse,
     util,
@@ -103,23 +101,6 @@  def extsetup(ui):
     _setupclone(ui)
     _setuplog(ui)
     _setupadd(ui)
-    _setupdirstate(ui)
-
-def replacefilecache(cls, propname, replacement):
-    """Replace a filecache property with a new class. This allows changing the
-    cache invalidation condition."""
-    origcls = cls
-    assert callable(replacement)
-    while cls is not object:
-        if propname in cls.__dict__:
-            orig = cls.__dict__[propname]
-            setattr(cls, propname, replacement(orig))
-            break
-        cls = cls.__bases__[0]
-
-    if cls is object:
-        raise AttributeError(_("type '%s' has no property '%s'") % (origcls,
-                             propname))
 
 def _setuplog(ui):
     entry = commands.table['^log|history']
@@ -187,42 +168,6 @@  def _setupadd(ui):
 
     extensions.wrapcommand(commands.table, 'add', _add)
 
-def _setupdirstate(ui):
-    """Modify the dirstate to prevent stat'ing excluded files,
-    and to prevent modifications to files outside the checkout.
-    """
-
-    # The atrocity below is needed to wrap dirstate._ignore. It is a cached
-    # property, which means normal function wrapping doesn't work.
-    class ignorewrapper(object):
-        def __init__(self, orig):
-            self.orig = orig
-            self.origignore = None
-            self.func = None
-            self.sparsematch = None
-
-        def __get__(self, obj, type=None):
-            origignore = self.orig.__get__(obj)
-
-            sparsematch = obj._sparsematcher
-            if sparsematch.always():
-                return origignore
-
-            if self.sparsematch != sparsematch or self.origignore != origignore:
-                self.func = matchmod.unionmatcher([
-                    origignore, matchmod.negatematcher(sparsematch)])
-                self.sparsematch = sparsematch
-                self.origignore = origignore
-            return self.func
-
-        def __set__(self, obj, value):
-            return self.orig.__set__(obj, value)
-
-        def __delete__(self, obj):
-            return self.orig.__delete__(obj)
-
-    replacefilecache(dirstate.dirstate, '_ignore', ignorewrapper)
-
 @command('^debugsparse', [
     ('I', 'include', False, _('include files in the sparse checkout')),
     ('X', 'exclude', False, _('exclude files in the sparse checkout')),
diff --git a/mercurial/dirstate.py b/mercurial/dirstate.py
--- a/mercurial/dirstate.py
+++ b/mercurial/dirstate.py
@@ -248,7 +248,7 @@  class dirstate(object):
         return self._dirs
 
     @rootcache('.hgignore')
-    def _ignore(self):
+    def _hgignorematch(self):
         files = self._ignorefiles()
         if not files:
             return matchmod.never(self._root, '')
@@ -256,6 +256,15 @@  class dirstate(object):
         pats = ['include:%s' % f for f in files]
         return matchmod.match(self._root, '', [], pats, warn=self._ui.warn)
 
+    @property
+    def _ignore(self):
+        sparsematch = self._sparsematcher
+        if sparsematch.always():
+            return self._hgignorematch
+
+        return matchmod.unionmatcher([self._hgignorematch,
+                                      matchmod.negatematcher(sparsematch)])
+
     @propertycache
     def _slash(self):
         return self._ui.configbool('ui', 'slash') and pycompat.ossep != '/'
@@ -516,7 +525,7 @@  class dirstate(object):
 
         for a in ("_map", "_copymap", "_identity",
                   "_filefoldmap", "_dirfoldmap", "_branch",
-                  "_pl", "_dirs", "_ignore", "_nonnormalset",
+                  "_pl", "_dirs", "_hgignorematcher", "_nonnormalset",
                   "_otherparentset"):
             if a in self.__dict__:
                 delattr(self, a)