Patchwork [1,of,6,RFC] localrepo: establish a base class for an immutable local repository

login
register
mail settings
Submitter Gregory Szorc
Date June 9, 2017, 6:36 a.m.
Message ID <dfe0db942bbf860968b1.1496990165@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/21273/
State Deferred
Headers show

Comments

Gregory Szorc - June 9, 2017, 6:36 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1496977416 25200
#      Thu Jun 08 20:03:36 2017 -0700
# Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
# Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
localrepo: establish a base class for an immutable local repository

Currently, localrepository has a number of properties that are cached
based on things changing. For attributes behind @repofilecache or
@storecache, we perform a stat() on every attribute access. If done
inside a tight loop, you have a perf killer. In the case of "changelog,"
the repoview layer maintains its own cache of which changesets are
visible. And this has to be consulted on every attribute lookup,
which can lead to performance loss, even if it is relatively fast.

Currently, localrepository model is "open the repo and constantly check
if things have changed." I strongly disagree with this model. I believe
an opened repository "handle" should either be an immutable and
globally consistent view of the repo at the time it was opened *or* be
mutable and represent evolving state changed by that process. The
current model of constantly checking if things changed leads to poor
performance by both introducing overhead and preventing aggressive
caching (if something is immutable, caching is easy). It also leads
to complicated and hard-to-reason-about APIs and caching.

This commit starts the process of breaking the localrepository class
into immutable and mutable classes. The end goal is for an
immutable repository instance to be obtained by default and for all
mutation operations to be performed on a mutable repo type. This
will naturally lead to optimization of read-only operations and
commands by removing cache checking.
Gregory Szorc - June 9, 2017, 6:49 a.m.
On Thu, Jun 8, 2017 at 11:36 PM, Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1496977416 25200
> #      Thu Jun 08 20:03:36 2017 -0700
> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
> localrepo: establish a base class for an immutable local repository
>
> Currently, localrepository has a number of properties that are cached
> based on things changing. For attributes behind @repofilecache or
> @storecache, we perform a stat() on every attribute access. If done
> inside a tight loop, you have a perf killer. In the case of "changelog,"
> the repoview layer maintains its own cache of which changesets are
> visible. And this has to be consulted on every attribute lookup,
> which can lead to performance loss, even if it is relatively fast.
>
> Currently, localrepository model is "open the repo and constantly check
> if things have changed." I strongly disagree with this model. I believe
> an opened repository "handle" should either be an immutable and
> globally consistent view of the repo at the time it was opened *or* be
> mutable and represent evolving state changed by that process. The
> current model of constantly checking if things changed leads to poor
> performance by both introducing overhead and preventing aggressive
> caching (if something is immutable, caching is easy). It also leads
> to complicated and hard-to-reason-about APIs and caching.
>
> This commit starts the process of breaking the localrepository class
> into immutable and mutable classes. The end goal is for an
> immutable repository instance to be obtained by default and for all
> mutation operations to be performed on a mutable repo type. This
> will naturally lead to optimization of read-only operations and
> commands by removing cache checking.
>

Full series available at
https://hg.mozilla.org/users/gszorc_mozilla.com/hg/rev/immutablerepo. It
gets really hacky towards the end. But I do have most parts of `hg log` and
revsets working with an immutable repo. There are some nice revset
speedups. I reckon most of that is from converting repo.changelog to an
actual property. But the real reason I want to blow up localrepository is
to improve code comprehension and maintainability. Caching is hard. And
making assumptions that "nothing can change on this type" make that much,
much easier. The performance benefits naturally derive from a more pure
design.


>
> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
> --- a/mercurial/localrepo.py
> +++ b/mercurial/localrepo.py
> @@ -248,7 +248,10 @@ class locallegacypeer(localpeer):
>  # clients.
>  REVLOGV2_REQUIREMENT = 'exp-revlogv2.0'
>
> -class localrepository(object):
> +class immutablelocalrepository(object):
> +    """An immutable repository on local disk."""
> +
> +class localrepository(immutablelocalrepository):
>
>      supportedformats = {
>          'revlogv1',
>
Pierre-Yves David - June 9, 2017, 9:59 a.m.
On 06/09/2017 07:36 AM, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1496977416 25200
> #      Thu Jun 08 20:03:36 2017 -0700
> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
> localrepo: establish a base class for an immutable local repository

I like the general idea and direction of this series. Moving toward more 
guarantee of having a consistent view of the repository is great.

> Currently, localrepository has a number of properties that are cached
> based on things changing. For attributes behind @repofilecache or
> @storecache, we perform a stat() on every attribute access.

Actually, according to the code, we do not do performa that "stat()" 
check for every attribute access.

 
https://www.mercurial-scm.org/repo/hg/file/326c0e2c1a1d/mercurial/scmutil.py#l855

What the filecache does it reusing a previously loaded object if the 
repository caches has been invalidated (so the attribute is missing) but 
underlying file did not changed. It is convoluted, but the "stat()" is 
rarely issued and  we should not see the alarming performance impact 
mentioned here.

That said, none of the above has impact on Greg code and we should still 
move toward Greg idea.

(I did not had time to check the rest of the series or the code more 
through-fully yet.)

Cheers
Gregory Szorc - June 9, 2017, 6:31 p.m.
On Fri, Jun 9, 2017 at 2:59 AM, Pierre-Yves David <
pierre-yves.david@ens-lyon.org> wrote:

>
>
> On 06/09/2017 07:36 AM, Gregory Szorc wrote:
>
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1496977416 25200
>> #      Thu Jun 08 20:03:36 2017 -0700
>> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
>> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
>> localrepo: establish a base class for an immutable local repository
>>
>
> I like the general idea and direction of this series. Moving toward more
> guarantee of having a consistent view of the repository is great.
>

Obviously the current repo store design can't be fully consistent because
there are race conditions between process A opening files and process B
mutating them. However, my utopia end state is we devise a storage model
(either based completely on append-only data structures or on
file/directory based "generations" for data that isn't append-only) and
readers can read a single file denoting current offsets/generations to get
an atomic and consistent view of a repo.


>
> Currently, localrepository has a number of properties that are cached
>> based on things changing. For attributes behind @repofilecache or
>> @storecache, we perform a stat() on every attribute access.
>>
>
> Actually, according to the code, we do not do performa that "stat()" check
> for every attribute access.
>

>
> https://www.mercurial-scm.org/repo/hg/file/326c0e2c1a1d/merc
> urial/scmutil.py#l855
>
> What the filecache does it reusing a previously loaded object if the
> repository caches has been invalidated (so the attribute is missing) but
> underlying file did not changed. It is convoluted, but the "stat()" is
> rarely issued and  we should not see the alarming performance impact
> mentioned here.
>

Ugh. Can't believe I missed that. But it does prove my point that caching
is hard and is best avoided :)


>
> That said, none of the above has impact on Greg code and we should still
> move toward Greg idea.
>
> (I did not had time to check the rest of the series or the code more
> through-fully yet.)
>
> Cheers
>
> --
> Pierre-Yves David
>
Yuya Nishihara - June 11, 2017, 9:01 a.m.
On Fri, 9 Jun 2017 10:59:32 +0100, Pierre-Yves David wrote:
> On 06/09/2017 07:36 AM, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1496977416 25200
> > #      Thu Jun 08 20:03:36 2017 -0700
> > # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
> > # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
> > localrepo: establish a base class for an immutable local repository
> 
> I like the general idea and direction of this series. Moving toward more 
> guarantee of having a consistent view of the repository is great.

Yeah, the idea sounds great. I'm not sure if inheritance is the best option,
but that's probably the easiest choice to move things forward. Perhaps a
better (and harder) alternative is to split storage layer from localrepository
and switch it by immutable or not.

FWIW, maybe immutablerepo could be used to implement 'hg serve --readonly' ?
Pierre-Yves David - June 11, 2017, 12:12 p.m.
On 06/09/2017 07:31 PM, Gregory Szorc wrote:
> On Fri, Jun 9, 2017 at 2:59 AM, Pierre-Yves David <
> pierre-yves.david@ens-lyon.org> wrote:
>
>>
>>
>> On 06/09/2017 07:36 AM, Gregory Szorc wrote:
>>
>>> # HG changeset patch
>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>> # Date 1496977416 25200
>>> #      Thu Jun 08 20:03:36 2017 -0700
>>> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
>>> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
>>> localrepo: establish a base class for an immutable local repository
>>>
>>
>> I like the general idea and direction of this series. Moving toward more
>> guarantee of having a consistent view of the repository is great.
>>
>
> Obviously the current repo store design can't be fully consistent because
> there are race conditions between process A opening files and process B
> mutating them. However, my utopia end state is we devise a storage model
> (either based completely on append-only data structures or on
> file/directory based "generations" for data that isn't append-only) and
> readers can read a single file denoting current offsets/generations to get
> an atomic and consistent view of a repo.

We are on the same page here.
I'm not sure how much code you have written about the above yet. I've a 
series that will allow to introduce more vfs to distinct the various 
types of files we have (wc related, append only, etc). I made good 
progress on it last September but it needed some rework and I did not 
had time to back to it yet. There are founding for it so this should 
happen soonish.

Cheers,
Pierre-Yves David - June 11, 2017, 12:15 p.m.
On 06/11/2017 10:01 AM, Yuya Nishihara wrote:
> On Fri, 9 Jun 2017 10:59:32 +0100, Pierre-Yves David wrote:
>> On 06/09/2017 07:36 AM, Gregory Szorc wrote:
>>> # HG changeset patch
>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>> # Date 1496977416 25200
>>> #      Thu Jun 08 20:03:36 2017 -0700
>>> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
>>> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
>>> localrepo: establish a base class for an immutable local repository
>>
>> I like the general idea and direction of this series. Moving toward more
>> guarantee of having a consistent view of the repository is great.
>
> Yeah, the idea sounds great. I'm not sure if inheritance is the best option,
> but that's probably the easiest choice to move things forward. Perhaps a
> better (and harder) alternative is to split storage layer from localrepository
> and switch it by immutable or not.

I agree with the above. Using inheritance is good first step. It allow 
to introduce all the semantic changes we wants without breaking the API. 
We can look into changing the API are a second step. But having two 
steps will makes this much simpler.

Cheers,
Sean Farley - June 11, 2017, 10:30 p.m.
Pierre-Yves David <pierre-yves.david@ens-lyon.org> writes:

> On 06/11/2017 10:01 AM, Yuya Nishihara wrote:
>> On Fri, 9 Jun 2017 10:59:32 +0100, Pierre-Yves David wrote:
>>> On 06/09/2017 07:36 AM, Gregory Szorc wrote:
>>>> # HG changeset patch
>>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>>> # Date 1496977416 25200
>>>> #      Thu Jun 08 20:03:36 2017 -0700
>>>> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
>>>> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
>>>> localrepo: establish a base class for an immutable local repository
>>>
>>> I like the general idea and direction of this series. Moving toward more
>>> guarantee of having a consistent view of the repository is great.
>>
>> Yeah, the idea sounds great. I'm not sure if inheritance is the best option,
>> but that's probably the easiest choice to move things forward. Perhaps a
>> better (and harder) alternative is to split storage layer from localrepository
>> and switch it by immutable or not.
>
> I agree with the above. Using inheritance is good first step. It allow 
> to introduce all the semantic changes we wants without breaking the API. 
> We can look into changing the API are a second step. But having two 
> steps will makes this much simpler.

I have a slightly different approach (and maybe Greg and I should sync
up) that moves most of the access pattern to context objects. My end
goal is for (e.g. extension authors) most operations you want will be
through a context. This would make localrepo mostly (I haven't finished
so can't say yet) read-only.

More concretely, my first series is about moving localrepo.dirstate ->
workingctx. Then moving localrepo.wvfs -> workingctx; etc.
Augie Fackler - June 13, 2017, 2:13 p.m.
On Sun, Jun 11, 2017 at 03:30:26PM -0700, Sean Farley wrote:
> Pierre-Yves David <pierre-yves.david@ens-lyon.org> writes:
>
> > On 06/11/2017 10:01 AM, Yuya Nishihara wrote:
> >> On Fri, 9 Jun 2017 10:59:32 +0100, Pierre-Yves David wrote:
> >>> On 06/09/2017 07:36 AM, Gregory Szorc wrote:
> >>>> # HG changeset patch
> >>>> # User Gregory Szorc <gregory.szorc@gmail.com>
> >>>> # Date 1496977416 25200
> >>>> #      Thu Jun 08 20:03:36 2017 -0700
> >>>> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
> >>>> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
> >>>> localrepo: establish a base class for an immutable local repository
> >>>
> >>> I like the general idea and direction of this series. Moving toward more
> >>> guarantee of having a consistent view of the repository is great.
> >>
> >> Yeah, the idea sounds great. I'm not sure if inheritance is the best option,
> >> but that's probably the easiest choice to move things forward. Perhaps a
> >> better (and harder) alternative is to split storage layer from localrepository
> >> and switch it by immutable or not.
> >
> > I agree with the above. Using inheritance is good first step. It allow
> > to introduce all the semantic changes we wants without breaking the API.
> > We can look into changing the API are a second step. But having two
> > steps will makes this much simpler.
>
> I have a slightly different approach (and maybe Greg and I should sync
> up) that moves most of the access pattern to context objects.

That sounds promising. Could you and Greg sync at some point and let
us know how that works out?

> My end
> goal is for (e.g. extension authors) most operations you want will be
> through a context. This would make localrepo mostly (I haven't finished
> so can't say yet) read-only.
>
> More concretely, my first series is about moving localrepo.dirstate ->
> workingctx. Then moving localrepo.wvfs -> workingctx; etc.



> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Gregory Szorc - June 14, 2017, 4:18 a.m.
On Sun, Jun 11, 2017 at 3:30 PM, Sean Farley <sean@farley.io> wrote:

> Pierre-Yves David <pierre-yves.david@ens-lyon.org> writes:
>
> > On 06/11/2017 10:01 AM, Yuya Nishihara wrote:
> >> On Fri, 9 Jun 2017 10:59:32 +0100, Pierre-Yves David wrote:
> >>> On 06/09/2017 07:36 AM, Gregory Szorc wrote:
> >>>> # HG changeset patch
> >>>> # User Gregory Szorc <gregory.szorc@gmail.com>
> >>>> # Date 1496977416 25200
> >>>> #      Thu Jun 08 20:03:36 2017 -0700
> >>>> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
> >>>> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
> >>>> localrepo: establish a base class for an immutable local repository
> >>>
> >>> I like the general idea and direction of this series. Moving toward
> more
> >>> guarantee of having a consistent view of the repository is great.
> >>
> >> Yeah, the idea sounds great. I'm not sure if inheritance is the best
> option,
> >> but that's probably the easiest choice to move things forward. Perhaps a
> >> better (and harder) alternative is to split storage layer from
> localrepository
> >> and switch it by immutable or not.
> >
> > I agree with the above. Using inheritance is good first step. It allow
> > to introduce all the semantic changes we wants without breaking the API.
> > We can look into changing the API are a second step. But having two
> > steps will makes this much simpler.
>
> I have a slightly different approach (and maybe Greg and I should sync
> up) that moves most of the access pattern to context objects. My end
> goal is for (e.g. extension authors) most operations you want will be
> through a context. This would make localrepo mostly (I haven't finished
> so can't say yet) read-only.
>
> More concretely, my first series is about moving localrepo.dirstate ->
> workingctx. Then moving localrepo.wvfs -> workingctx; etc.
>

This sounds similar to what I want to do! I want to have a read-only base
class then have context managers that return either a derived class or a
special type that exposes operations to mutate the repo. I didn't get
around to it in this series because that work involves a lot of BC breakage
and I specifically wanted to avoid major BC breakage as part of the initial
refactor in order to maximize the chances of the idea being accepted.

I was also toying with the idea of splitting up functionality into multiple
classes and using multiple inheritance on the repo type. Alternatively, we
could obtain instances of these domain-specific types via context managers.
I haven't thought as much about this aspect because it was too BC.
Sean Farley - June 14, 2017, 4:40 a.m.
Gregory Szorc <gregory.szorc@gmail.com> writes:

> On Sun, Jun 11, 2017 at 3:30 PM, Sean Farley <sean@farley.io> wrote:
>
>> Pierre-Yves David <pierre-yves.david@ens-lyon.org> writes:
>>
>> > On 06/11/2017 10:01 AM, Yuya Nishihara wrote:
>> >> On Fri, 9 Jun 2017 10:59:32 +0100, Pierre-Yves David wrote:
>> >>> On 06/09/2017 07:36 AM, Gregory Szorc wrote:
>> >>>> # HG changeset patch
>> >>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> >>>> # Date 1496977416 25200
>> >>>> #      Thu Jun 08 20:03:36 2017 -0700
>> >>>> # Node ID dfe0db942bbf860968b19fd8579865790d78d5e8
>> >>>> # Parent  e583b786ffba99cb775cf9d3a126cf50db74f85a
>> >>>> localrepo: establish a base class for an immutable local repository
>> >>>
>> >>> I like the general idea and direction of this series. Moving toward
>> more
>> >>> guarantee of having a consistent view of the repository is great.
>> >>
>> >> Yeah, the idea sounds great. I'm not sure if inheritance is the best
>> option,
>> >> but that's probably the easiest choice to move things forward. Perhaps a
>> >> better (and harder) alternative is to split storage layer from
>> localrepository
>> >> and switch it by immutable or not.
>> >
>> > I agree with the above. Using inheritance is good first step. It allow
>> > to introduce all the semantic changes we wants without breaking the API.
>> > We can look into changing the API are a second step. But having two
>> > steps will makes this much simpler.
>>
>> I have a slightly different approach (and maybe Greg and I should sync
>> up) that moves most of the access pattern to context objects. My end
>> goal is for (e.g. extension authors) most operations you want will be
>> through a context. This would make localrepo mostly (I haven't finished
>> so can't say yet) read-only.
>>
>> More concretely, my first series is about moving localrepo.dirstate ->
>> workingctx. Then moving localrepo.wvfs -> workingctx; etc.
>>
>
> This sounds similar to what I want to do! I want to have a read-only base
> class then have context managers that return either a derived class or a
> special type that exposes operations to mutate the repo. I didn't get
> around to it in this series because that work involves a lot of BC breakage
> and I specifically wanted to avoid major BC breakage as part of the initial
> refactor in order to maximize the chances of the idea being accepted.

Ah, perhaps there is a naming clash here. I meant mercurial/context.py
not python's context manager. It's an idea I've been circling since I
started memctx (and that Jun brought up recently due to his LFS work)
merge years ago; not for read-only stuff (which you are working on) but
for data access. Let me use a code example to help illustrate this:

# make a new commit
m = memctx(...)
m['file_foo'] = my_new_data() # returns a read-only object memfilectx
m.commit()

# write something to disk
ctx = repo[None] # workingctx
ctx['file_foo'] = some_data() # returns a read-only workingfilectx object
# nothing to commit here

There are a lot of questions to answer still but I hope this is a more
clear picture of what I'm thinking.

P.S. we both live < 2 miles from each other; perhaps I should schedule
another weekend hackday?

Patch

diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -248,7 +248,10 @@  class locallegacypeer(localpeer):
 # clients.
 REVLOGV2_REQUIREMENT = 'exp-revlogv2.0'
 
-class localrepository(object):
+class immutablelocalrepository(object):
+    """An immutable repository on local disk."""
+
+class localrepository(immutablelocalrepository):
 
     supportedformats = {
         'revlogv1',