Patchwork [2,of,8] transaction: track newly introduced revisions

login
register
mail settings
Submitter Pierre-Yves David
Date May 2, 2017, 11:43 p.m.
Message ID <76a035851a620dcf36a6.1493768619@nodosa.octopoid.net>
Download mbox | patch
Permalink /patch/20364/
State Accepted
Headers show

Comments

Pierre-Yves David - May 2, 2017, 11:43 p.m.
# HG changeset patch
# User Pierre-Yves David <pierre-yves.david@ens-lyon.org>
# Date 1493743551 -7200
#      Tue May 02 18:45:51 2017 +0200
# Branch stable
# Node ID 76a035851a620dcf36a6dd18a37409dff43c6d42
# Parent  6697da7c4eab3fbe3588a2f91fa3f99b16f808ac
# EXP-Topic obscache
# Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/
#              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/ -r 76a035851a62
transaction: track newly introduced revisions

Tracking revisions is not the data that will unlock the most new capability.
However, they are the simplest thing to track and still unlock some nice
improvements in regard with caching.

We plug ourself at the changelog level to make sure we do not miss any revision
additions.

The 'revs' set is configured at the repository level because the transaction
itself does not needs to know that much about the business logic.
Jun Wu - May 3, 2017, 8:09 a.m.
Excerpts from Pierre-Yves David's message of 2017-05-03 01:43:39 +0200:
> # HG changeset patch
> # User Pierre-Yves David <pierre-yves.david@ens-lyon.org>
> # Date 1493743551 -7200
> #      Tue May 02 18:45:51 2017 +0200
> # Branch stable
> # Node ID 76a035851a620dcf36a6dd18a37409dff43c6d42
> # Parent  6697da7c4eab3fbe3588a2f91fa3f99b16f808ac
> # EXP-Topic obscache
> # Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/ 
> #              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/  -r 76a035851a62
> transaction: track newly introduced revisions
> 
> Tracking revisions is not the data that will unlock the most new capability.
> However, they are the simplest thing to track and still unlock some nice
> improvements in regard with caching.
> 
> We plug ourself at the changelog level to make sure we do not miss any revision
> additions.
> 
> The 'revs' set is configured at the repository level because the transaction
> itself does not needs to know that much about the business logic.
> 
> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
> --- a/mercurial/changelog.py
> +++ b/mercurial/changelog.py
> @@ -535,3 +535,14 @@ class changelog(revlog.revlog):
>          just to access this is costly."""
>          extra = self.read(rev)[5]
>          return encoding.tolocal(extra.get("branch")), 'close' in extra
> +
> +    def _addrevision(self, node, rawtext, transaction, *args, **kwargs):
> +        # overlay over the standard revlog._addrevision to track the new
> +        # revision on the transaction.
> +        rev = len(self)
> +        node = super(changelog, self)._addrevision(node, rawtext, transaction,
> +                                                   *args, **kwargs)
> +        revs = transaction.changes.get('revs')
> +        if revs is not None:
> +            revs.add(rev)

Since changelog is append-only, and strip won't happen during a transaction.
Why not just record len(changelog) during transaction start, and test
len(changelog) at the end of transaction? That's O(1), instead of O(N log N)
here.

> +        return node
> diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
> --- a/mercurial/localrepo.py
> +++ b/mercurial/localrepo.py
> @@ -1100,6 +1100,7 @@ class localrepository(object):
>                                       self.store.createmode,
>                                       validator=validate,
>                                       releasefn=releasefn)
> +        tr.changes['revs'] = set()
>  
>          tr.hookargs['txnid'] = txnid
>          # note: writing the fncache only during finalize mean that the file is
Pierre-Yves David - May 3, 2017, 12:49 p.m.
On 05/03/2017 10:09 AM, Jun Wu wrote:
> Excerpts from Pierre-Yves David's message of 2017-05-03 01:43:39 +0200:
>> # HG changeset patch
>> # User Pierre-Yves David <pierre-yves.david@ens-lyon.org>
>> # Date 1493743551 -7200
>> #      Tue May 02 18:45:51 2017 +0200
>> # Branch stable
>> # Node ID 76a035851a620dcf36a6dd18a37409dff43c6d42
>> # Parent  6697da7c4eab3fbe3588a2f91fa3f99b16f808ac
>> # EXP-Topic obscache
>> # Available At https://www.mercurial-scm.org/repo/users/marmoute/mercurial/
>> #              hg pull https://www.mercurial-scm.org/repo/users/marmoute/mercurial/  -r 76a035851a62
>> transaction: track newly introduced revisions
>>
>> Tracking revisions is not the data that will unlock the most new capability.
>> However, they are the simplest thing to track and still unlock some nice
>> improvements in regard with caching.
>>
>> We plug ourself at the changelog level to make sure we do not miss any revision
>> additions.
>>
>> The 'revs' set is configured at the repository level because the transaction
>> itself does not needs to know that much about the business logic.
>>
>> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
>> --- a/mercurial/changelog.py
>> +++ b/mercurial/changelog.py
>> @@ -535,3 +535,14 @@ class changelog(revlog.revlog):
>>          just to access this is costly."""
>>          extra = self.read(rev)[5]
>>          return encoding.tolocal(extra.get("branch")), 'close' in extra
>> +
>> +    def _addrevision(self, node, rawtext, transaction, *args, **kwargs):
>> +        # overlay over the standard revlog._addrevision to track the new
>> +        # revision on the transaction.
>> +        rev = len(self)
>> +        node = super(changelog, self)._addrevision(node, rawtext, transaction,
>> +                                                   *args, **kwargs)
>> +        revs = transaction.changes.get('revs')
>> +        if revs is not None:
>> +            revs.add(rev)
>
> Since changelog is append-only, and strip won't happen during a transaction.
> Why not just record len(changelog) during transaction start, and test
> len(changelog) at the end of transaction?

Mostly for simplicity, having the pure raw information on the 
transaction is simple and cheap. And it gives us access to all kind of 
native operation directly (len, membership, iteration) without the needs 
to perform the same (simple) operation all over the place. It also allow 
user code to operation without having a changelog object, another small win.

User-code in this series are fairly simple, but we'll have smarter and 
smarter usage of these data as more code uses it.

I'm not worried about the memory usage since many other place assumes we 
can keep all repository revs in memory.

> That's O(1), instead of O(N log N)

(note: I'm not sure what type of complexity you refer to, nor what is 
the operation you are considering to get theses number)

Cheers,
Jun Wu - May 3, 2017, 5:24 p.m.
Excerpts from Pierre-Yves David's message of 2017-05-03 14:49:53 +0200:
> > Since changelog is append-only, and strip won't happen during a transaction.
> > Why not just record len(changelog) during transaction start, and test
> > len(changelog) at the end of transaction?
> 
> Mostly for simplicity, having the pure raw information on the 
> transaction is simple and cheap. And it gives us access to all kind of 
> native operation directly (len, membership, iteration) without the needs 
> to perform the same (simple) operation all over the place. It also allow 
> user code to operation without having a changelog object, another small win.
> 
> User-code in this series are fairly simple, but we'll have smarter and 
> smarter usage of these data as more code uses it.

The above are still abstract to me. Could you give some examples about the
future usage of the new information (in a short way) ?
 
> I'm not worried about the memory usage since many other place assumes we 
> can keep all repository revs in memory.

Yes. But I don't think "this is already O(N)" is a good reason to add
another "O(N)" thing given the fact we know there is an easy "O(1)"
alternative.

> > That's O(1), instead of O(N log N)
> 
> (note: I'm not sure what type of complexity you refer to, nor what is 
> the operation you are considering to get theses number)

Both time and memory. If I pulled N revisions, this series is building a set
with N elements by calling set.add N times. The set is only used to detect
new revisions added to changelog, which could be represented as an O(1)
"xrange" by recording changlog length at the start and end of transaction
(which has an O(1) time complexity).
Pierre-Yves David - May 4, 2017, 12:03 a.m.
On 05/03/2017 07:24 PM, Jun Wu wrote:
> Excerpts from Pierre-Yves David's message of 2017-05-03 14:49:53 +0200:
>>> Since changelog is append-only, and strip won't happen during a transaction.
>>> Why not just record len(changelog) during transaction start, and test
>>> len(changelog) at the end of transaction?
>>
>> Mostly for simplicity, having the pure raw information on the
>> transaction is simple and cheap. And it gives us access to all kind of
>> native operation directly (len, membership, iteration) without the needs
>> to perform the same (simple) operation all over the place. It also allow
>> user code to operation without having a changelog object, another small win.
>>
>> User-code in this series are fairly simple, but we'll have smarter and
>> smarter usage of these data as more code uses it.
>
> The above are still abstract to me. Could you give some examples about the
> future usage of the new information (in a short way) ?

Current planned foreseen usages are:

* detecting that new revs have been added, (__nonzero__),
* pre-sizing various cache update with the number of changeset (__len__)
* feeding new revision to cache update function and hooks (__iter__),
* issuing message about new changesets (__len__),
* checking is special property node (head) are been newly added 
(__contains__)
(There will probably be more)

Sure, We could derive all that data from a min/max pair. However this 
means every single user of that data will have to redo the same 
arithmetic all over again. This is error prone, tedious and ineffective
(the code-base is already suffering from this in multiple places).

In addition, 'tr.changes' will contains other types of data "in-full" 
(bookmark movements, tag movements, phase changes, new obs-markers, 
etc…). So having the 'revs' information 'in-full' too is more consistent.

>> I'm not worried about the memory usage since many other place assumes we
>> can keep all repository revs in memory.
>
> Yes. But I don't think "this is already O(N)" is a good reason to add
> another "O(N)" thing given the fact we know there is an easy "O(1)"
> alternative.

If the basic approach (native set) approach turns out to be an issue, we 
can turn it into something smarter (eg: something based on 
smartset.spanset), but for now I see no point in blocking the current 
series and making the approach more complex. Especially when they are no 
clear evidence it will matters..

>>> That's O(1), instead of O(N log N)
>>
>> (note: I'm not sure what type of complexity you refer to, nor what is
>> the operation you are considering to get theses number)
>
> Both time and memory.

(not: here memory in O(N))

> If I pulled N revisions, this series is building a set
> with N elements by calling set.add N times.

If you pull N revisions, You are adding at least N new revlog entries + 
over implied computation on these N new revisions (eg: cache update) .
The time spend doing 'set.add' will be negligible in regard with the above.

> The set is only used to detect
> new revisions added to changelog, which could be represented as an O(1)
> "xrange" by recording changlog length at the start and end of transaction
> (which has an O(1) time complexity).

You seems to be missing the goal of 'tr.changes'. Its goal it to 
contains an accurate record of the changes made by a transaction. The 
current user is simple, because we start simple.

Cheers,
Yuya Nishihara - May 12, 2017, 2:42 p.m.
On Thu, 4 May 2017 02:03:30 +0200, Pierre-Yves David wrote:
> On 05/03/2017 07:24 PM, Jun Wu wrote:
> > Excerpts from Pierre-Yves David's message of 2017-05-03 14:49:53 +0200:
> >>> Since changelog is append-only, and strip won't happen during a transaction.
> >>> Why not just record len(changelog) during transaction start, and test
> >>> len(changelog) at the end of transaction?
> >>
> >> Mostly for simplicity, having the pure raw information on the
> >> transaction is simple and cheap. And it gives us access to all kind of
> >> native operation directly (len, membership, iteration) without the needs
> >> to perform the same (simple) operation all over the place. It also allow
> >> user code to operation without having a changelog object, another small win.
> >>
> >> User-code in this series are fairly simple, but we'll have smarter and
> >> smarter usage of these data as more code uses it.
> >
> > The above are still abstract to me. Could you give some examples about the
> > future usage of the new information (in a short way) ?
> 
> Current planned foreseen usages are:
> 
> * detecting that new revs have been added, (__nonzero__),
> * pre-sizing various cache update with the number of changeset (__len__)
> * feeding new revision to cache update function and hooks (__iter__),
> * issuing message about new changesets (__len__),
> * checking is special property node (head) are been newly added 
> (__contains__)
> (There will probably be more)
> 
> Sure, We could derive all that data from a min/max pair. However this 
> means every single user of that data will have to redo the same 
> arithmetic all over again. This is error prone, tedious and ineffective
> (the code-base is already suffering from this in multiple places).
> 
> In addition, 'tr.changes' will contains other types of data "in-full" 
> (bookmark movements, tag movements, phase changes, new obs-markers, 
> etc…). So having the 'revs' information 'in-full' too is more consistent.

Jun, do you have any other thoughts on this?

Since this series is floating for over a week, and this seems a net win,
I'll queue it in a couple of days.

> >> I'm not worried about the memory usage since many other place assumes we
> >> can keep all repository revs in memory.
> >
> > Yes. But I don't think "this is already O(N)" is a good reason to add
> > another "O(N)" thing given the fact we know there is an easy "O(1)"
> > alternative.
> 
> If the basic approach (native set) approach turns out to be an issue, we 
> can turn it into something smarter (eg: something based on 
> smartset.spanset), but for now I see no point in blocking the current 
> series and making the approach more complex. Especially when they are no 
> clear evidence it will matters..
> 
> >>> That's O(1), instead of O(N log N)
> >>
> >> (note: I'm not sure what type of complexity you refer to, nor what is
> >> the operation you are considering to get theses number)
> >
> > Both time and memory.
> 
> (not: here memory in O(N))
> 
> > If I pulled N revisions, this series is building a set
> > with N elements by calling set.add N times.
> 
> If you pull N revisions, You are adding at least N new revlog entries + 
> over implied computation on these N new revisions (eg: cache update) .
> The time spend doing 'set.add' will be negligible in regard with the above.
> 
> > The set is only used to detect
> > new revisions added to changelog, which could be represented as an O(1)
> > "xrange" by recording changlog length at the start and end of transaction
> > (which has an O(1) time complexity).
> 
> You seems to be missing the goal of 'tr.changes'. Its goal it to 
> contains an accurate record of the changes made by a transaction. The 
> current user is simple, because we start simple.

Do you mean we could have non-contiguous tr.changes['revs'] ? I don't care
much about the O(N) memory use (as the set should be smaller than the other
data we currently have), but I think keeping (startrev, endrev) will be
simpler if the range is contiguous.
Jun Wu - May 12, 2017, 3:38 p.m.
Excerpts from Yuya Nishihara's message of 2017-05-12 23:42:42 +0900:
> Jun, do you have any other thoughts on this?

Nope. Feel free to queue it.
Yuya Nishihara - May 13, 2017, 10:13 a.m.
On Fri, 12 May 2017 08:38:05 -0700, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-05-12 23:42:42 +0900:
> > Jun, do you have any other thoughts on this?
> 
> Nope. Feel free to queue it.

Okay, queued these. Many thanks.

> > >>> That's O(1), instead of O(N log N)
> > >>
> > >> (note: I'm not sure what type of complexity you refer to, nor what is
> > >> the operation you are considering to get theses number)
> > >
> > > Both time and memory.
> > 
> > (not: here memory in O(N))
> > 
> > > If I pulled N revisions, this series is building a set
> > > with N elements by calling set.add N times.
> > 
> > If you pull N revisions, You are adding at least N new revlog entries + 
> > over implied computation on these N new revisions (eg: cache update) .
> > The time spend doing 'set.add' will be negligible in regard with the above.
> > 
> > > The set is only used to detect
> > > new revisions added to changelog, which could be represented as an O(1)
> > > "xrange" by recording changlog length at the start and end of transaction
> > > (which has an O(1) time complexity).
> > 
> > You seems to be missing the goal of 'tr.changes'. Its goal it to 
> > contains an accurate record of the changes made by a transaction. The 
> > current user is simple, because we start simple.
> 
> Do you mean we could have non-contiguous tr.changes['revs'] ? I don't care
> much about the O(N) memory use (as the set should be smaller than the other
> data we currently have), but I think keeping (startrev, endrev) will be
> simpler if the range is contiguous.

I hope this will be addressed as a follow up if that's reasonable.
Pierre-Yves David - May 16, 2017, 9:05 a.m.
On 05/13/2017 12:13 PM, Yuya Nishihara wrote:
> On Fri, 12 May 2017 08:38:05 -0700, Jun Wu wrote:
>> Excerpts from Yuya Nishihara's message of 2017-05-12 23:42:42 +0900:
 >
> […]
 >
>>>>>> That's O(1), instead of O(N log N)
>>>>>
>>>>> (note: I'm not sure what type of complexity you refer to, nor what is
>>>>> the operation you are considering to get theses number)
>>>>
>>>> Both time and memory.
>>>
>>> (not: here memory in O(N))
>>>
>>>> If I pulled N revisions, this series is building a set
>>>> with N elements by calling set.add N times.
>>>
>>> If you pull N revisions, You are adding at least N new revlog entries +
>>> over implied computation on these N new revisions (eg: cache update) .
>>> The time spend doing 'set.add' will be negligible in regard with the above.
>>>
>>>> The set is only used to detect
>>>> new revisions added to changelog, which could be represented as an O(1)
>>>> "xrange" by recording changlog length at the start and end of transaction
>>>> (which has an O(1) time complexity).
>>>
>>> You seems to be missing the goal of 'tr.changes'. Its goal it to
>>> contains an accurate record of the changes made by a transaction. The
>>> current user is simple, because we start simple.
>>
>> Do you mean we could have non-contiguous tr.changes['revs'] ? I don't care
>> much about the O(N) memory use (as the set should be smaller than the other
>> data we currently have), but I think keeping (startrev, endrev) will be
>> simpler if the range is contiguous.
>
> I hope this will be addressed as a follow up if that's reasonable.

My plan is to revisit this once we have more data in the dictionary (eg: 
phase movement), this will having a more concrete grasp on consistency 
there.

Cheers,
Yuya Nishihara - May 16, 2017, 3:12 p.m.
On Tue, 16 May 2017 11:05:29 +0200, Pierre-Yves David wrote:
> On 05/13/2017 12:13 PM, Yuya Nishihara wrote:
> > On Fri, 12 May 2017 08:38:05 -0700, Jun Wu wrote:
> >> Excerpts from Yuya Nishihara's message of 2017-05-12 23:42:42 +0900:
>  >
> > […]
>  >
> >>>>>> That's O(1), instead of O(N log N)
> >>>>>
> >>>>> (note: I'm not sure what type of complexity you refer to, nor what is
> >>>>> the operation you are considering to get theses number)
> >>>>
> >>>> Both time and memory.
> >>>
> >>> (not: here memory in O(N))
> >>>
> >>>> If I pulled N revisions, this series is building a set
> >>>> with N elements by calling set.add N times.
> >>>
> >>> If you pull N revisions, You are adding at least N new revlog entries +
> >>> over implied computation on these N new revisions (eg: cache update) .
> >>> The time spend doing 'set.add' will be negligible in regard with the above.
> >>>
> >>>> The set is only used to detect
> >>>> new revisions added to changelog, which could be represented as an O(1)
> >>>> "xrange" by recording changlog length at the start and end of transaction
> >>>> (which has an O(1) time complexity).
> >>>
> >>> You seems to be missing the goal of 'tr.changes'. Its goal it to
> >>> contains an accurate record of the changes made by a transaction. The
> >>> current user is simple, because we start simple.
> >>
> >> Do you mean we could have non-contiguous tr.changes['revs'] ? I don't care
> >> much about the O(N) memory use (as the set should be smaller than the other
> >> data we currently have), but I think keeping (startrev, endrev) will be
> >> simpler if the range is contiguous.
> >
> > I hope this will be addressed as a follow up if that's reasonable.
> 
> My plan is to revisit this once we have more data in the dictionary (eg: 
> phase movement), this will having a more concrete grasp on consistency 
> there.

Fair enough. More usecases should help designing a good API.

My gut feeling is that we'll need a transaction subclass or wrapper which
knows more details about localrepo, instead of externalizing all repo-related
logic.

Patch

diff --git a/mercurial/changelog.py b/mercurial/changelog.py
--- a/mercurial/changelog.py
+++ b/mercurial/changelog.py
@@ -535,3 +535,14 @@  class changelog(revlog.revlog):
         just to access this is costly."""
         extra = self.read(rev)[5]
         return encoding.tolocal(extra.get("branch")), 'close' in extra
+
+    def _addrevision(self, node, rawtext, transaction, *args, **kwargs):
+        # overlay over the standard revlog._addrevision to track the new
+        # revision on the transaction.
+        rev = len(self)
+        node = super(changelog, self)._addrevision(node, rawtext, transaction,
+                                                   *args, **kwargs)
+        revs = transaction.changes.get('revs')
+        if revs is not None:
+            revs.add(rev)
+        return node
diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -1100,6 +1100,7 @@  class localrepository(object):
                                      self.store.createmode,
                                      validator=validate,
                                      releasefn=releasefn)
+        tr.changes['revs'] = set()
 
         tr.hookargs['txnid'] = txnid
         # note: writing the fncache only during finalize mean that the file is