Patchwork D1606: phases: drop the list with phase of each rev, always comput phase sets

login
register
mail settings
Submitter phabricator
Date Dec. 6, 2017, 2:48 p.m.
Message ID <differential-rev-PHID-DREV-tibfbh364arde4dcflju-req@phab.mercurial-scm.org>
Download mbox | patch
Permalink /patch/25969/
State Superseded
Headers show

Comments

phabricator - Dec. 6, 2017, 2:48 p.m.
joerg.sonnenberger created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  Change the C implementation of phasecache.loadphaserevs to provide only
  the sets for draft and secret phase as well as the number of revisions
  seen.
  
  Change the pure Python implementation of the same functino to compute
  the sets instead of the list of phases for each revision.
  
  Change phasecache.phase to check the phase sets and assume public if the
  revision is in neither draft nor secret set. This is computationally
  slightly more expensive.
  
  Change phasecache.getrevset for public() based queries to compute the
  set of non-matching revisions and return the result as filtered
  fullreposet. A shortcut is taken when no draft or secret revision
  exists.
  
  Bump the module version for the changed interface contract.
  
  Overall, this saves around 16 Bytes per revision whenever the phasecache
  is used, for the test case in issue5691 it is around 3MB. getrevset()
  for a large repository is around 13% slower here, that seems an
  acceptable trade off. Performance impact for phase() should be similar.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1606

AFFECTED FILES
  mercurial/cext/parsers.c
  mercurial/cext/revlog.c
  mercurial/localrepo.py
  mercurial/phases.py
  mercurial/policy.py

CHANGE DETAILS




To: joerg.sonnenberger, #hg-reviewers
Cc: mercurial-devel
phabricator - Dec. 7, 2017, 2:02 a.m.
quark requested changes to this revision.
quark added a comment.
This revision now requires changes to proceed.


  I have been also wanting to remove this O(changelog) space usage for long. I noticed phasecache was the biggest memory consumer among all `repo.` components (dirstate, changelog, etc) when prototyping stateful chg early this year.
  
  Requesting changes mainly because the `if public in phases` branch in `getrevset` has issues. Otherwise it looks good to me.

INLINE COMMENTS

> revlog.c:627
>  
>  static PyObject *compute_phases_map_sets(indexObject *self, PyObject *args)
>  {

I guess the `map` here means a `rev` -> `phase` map, which was the list. Since we now return len(size). Maybe rename the function to `compute_phases_len_sets`.

> revlog.c:1965
>  	 "get an index entry"},
>  	{"computephasesmapsets", (PyCFunction)compute_phases_map_sets,
>  			METH_VARARGS, "compute phases"},

Also here, `computephaseslensets`

> phases.py:226
>          else:
> -            # slow path - enumerate all revisions
> -            phase = self.phase
> -            revs = (r for r in repo if phase(repo, r) in phases)
> -            return smartset.generatorset(revs, iterasc=True)
> +            assert not repo.changelog.filteredrevs
> +            phases = set(allphases).difference(phases)

The `assert` is incorrect - repo could be "filtered". The old code support filtered repo implicitly by using `for r in repo`, the iterator of `repo` will remove filtered revisions.

> phases.py:226-237
> +            assert not repo.changelog.filteredrevs
> +            phases = set(allphases).difference(phases)
> +            if not phases:
> +                return smartset.fullreposet()
> +            if len(phases) == 1:
> +                [p] = phases
> +                revs = self._phasesets[p]

Sorry - I made a mistake using `repo.revs('public()')` to test the logic here. Actually that exercises a different code path <https://www.mercurial-scm.org/repo/hg/file/aa905f9cdcda/mercurial/revset.py#l1616>.

There are practically no users exercising this `public in phases` code path. So issues like `fullreposet` requires a `repo` argument is not exposed.

I think we can either:

- Pass an optional `subset` here then replace the revset implementation to use it:

  def getrevset(self, repo, phases, subset=None):
      if subset is None:
          subset = smartset.fullreposet(repo)
      if ...
          # fast path
          ...
          return subset & smartset.baseset(revs)
      else:
          ...



- Remove support for calculating `public()` here by just `raise error.ProgrammingError('public() is unsupported')`

I guess the former is preferred eventually. But if you want to land this faster, the latter is easier. I can help writing a follow-up.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1606

To: joerg.sonnenberger, #hg-reviewers, quark
Cc: quark, mercurial-devel
phabricator - Dec. 7, 2017, 3:38 p.m.
joerg.sonnenberger added inline comments.

INLINE COMMENTS

> quark wrote in phases.py:226-237
> Sorry - I made a mistake using `repo.revs('public()')` to test the logic here. Actually that exercises a different code path <https://www.mercurial-scm.org/repo/hg/file/aa905f9cdcda/mercurial/revset.py#l1616>.
> 
> There are practically no users exercising this `public in phases` code path. So issues like `fullreposet` requires a `repo` argument is not exposed.
> 
> I think we can either:
> 
> - Pass an optional `subset` here then replace the revset implementation to use it:
> 
>   def getrevset(self, repo, phases, subset=None):
>       if subset is None:
>           subset = smartset.fullreposet(repo)
>       if ...
>           # fast path
>           ...
>           return subset & smartset.baseset(revs)
>       else:
>           ...
> 
> 
> 
> - Remove support for calculating `public()` here by just `raise error.ProgrammingError('public() is unsupported')`
> 
> I guess the former is preferred eventually. But if you want to land this faster, the latter is easier. I can help writing a follow-up.

I think this can just follow the logic of the other branch, i.e. short cut if `len(phases) == 1 and repo.changelog.filteredrevs`, otherwise, add `repo.changelog.filteredrevs` to the union as well.
I added the assert to make sure that the check wasn't necessary, missed the implicit dependency on repo here.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1606

To: joerg.sonnenberger, #hg-reviewers, quark
Cc: durin42, quark, mercurial-devel
phabricator - Dec. 7, 2017, 8:43 p.m.
quark added inline comments.

INLINE COMMENTS

> joerg.sonnenberger wrote in phases.py:226-237
> I think this can just follow the logic of the other branch, i.e. short cut if `len(phases) == 1 and repo.changelog.filteredrevs`, otherwise, add `repo.changelog.filteredrevs` to the union as well.
> I added the assert to make sure that the check wasn't necessary, missed the implicit dependency on repo here.

Agree. The only real difference with `revset.py:public` is the `subset` argument.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1606

To: joerg.sonnenberger, #hg-reviewers, quark
Cc: durin42, quark, mercurial-devel
phabricator - Dec. 8, 2017, 3 a.m.
quark accepted this revision.
quark added a comment.


  I think this is good enough. We can migrate `revset.py` `public()` implementation later.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1606

To: joerg.sonnenberger, #hg-reviewers, quark
Cc: durin42, quark, mercurial-devel
phabricator - Dec. 8, 2017, 2:54 p.m.
yuja accepted this revision.
yuja added a comment.
This revision is now accepted and ready to land.


  Queued, thanks. Can you send follow-up patches to address trivial issues?

INLINE COMMENTS

> revlog.c:692
>  	for (i = 0; i < len; i++) {
>  		PyObject *phaseval;
>  

Removed unused variable in flight.

> phases.py:205
>              self.phaseroots, self.dirty = _readroots(repo, phasedefaults)
> -            self._phaserevs = None
> +            self._phasemaxrev = nullrev
>              self._phasesets = None

Nit: this isn't the "max" revision, but the size. Perhaps the initial
value should be 0.

> phases.py:236
> +                return smartset.fullreposet(repo)
> +            return smartset.fullreposet(repo).filter(lambda r: r not in revs)
>  

Nit: could be written as `fullrepoest(repo) - baseset(revs)`.

> phases.py:264
>          repo = repo.unfiltered()
> -        revs = [public] * len(repo.changelog)
> -        self._phaserevs = revs
> -        self._populatephaseroots(repo)
> -        for phase in trackedphases:
> -            roots = list(map(repo.changelog.rev, self.phaseroots[phase]))
> -            if roots:
> -                for rev in roots:
> -                    revs[rev] = phase
> -                for rev in repo.changelog.descendants(roots):
> -                    revs[rev] = phase
> +        cl = repo.changelog
> +        self._phasesets = [set() for phase in allphases]

Nice, no dependency on repo. As a follow-up, maybe we
can extract a pure function as truly drop-in replacement for
the native function.

> phases.py:266
> +        self._phasesets = [set() for phase in allphases]
> +        roots = map(cl.rev, self.phaseroots[secret])
> +        if roots:

s/map/pycompat.maplist/ for Python 3 compatibility. Fixed in flight.

> phases.py:295
>      def phase(self, repo, rev):
>          # We need a repo argument here to be able to build _phaserevs
>          # if necessary. The repository instance is not stored in

s/_phaserevs/_phasesets/. Fixed in flight.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1606

To: joerg.sonnenberger, #hg-reviewers, quark, yuja
Cc: yuja, durin42, quark, mercurial-devel
phabricator - Dec. 8, 2017, 10:58 p.m.
quark added inline comments.

INLINE COMMENTS

> yuja wrote in phases.py:236
> Nit: could be written as `fullrepoest(repo) - baseset(revs)`.

`baseset.__contains__` is slower than `set.__contains__` so the current code is faster.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1606

To: joerg.sonnenberger, #hg-reviewers, quark, yuja
Cc: yuja, durin42, quark, mercurial-devel
phabricator - Dec. 9, 2017, 12:58 a.m.
yuja added inline comments.

INLINE COMMENTS

> quark wrote in phases.py:236
> `baseset.__contains__` is slower than `set.__contains__` so the current code is faster.

Really? It's aliased to `self._set.__contains__`.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1606

To: joerg.sonnenberger, #hg-reviewers, quark, yuja
Cc: yuja, durin42, quark, mercurial-devel

Patch

diff --git a/mercurial/policy.py b/mercurial/policy.py
--- a/mercurial/policy.py
+++ b/mercurial/policy.py
@@ -75,7 +75,7 @@ 
     (r'cext', r'diffhelpers'): 1,
     (r'cext', r'mpatch'): 1,
     (r'cext', r'osutil'): 1,
-    (r'cext', r'parsers'): 3,
+    (r'cext', r'parsers'): 4,
 }
 
 # map import request to other package or module
diff --git a/mercurial/phases.py b/mercurial/phases.py
--- a/mercurial/phases.py
+++ b/mercurial/phases.py
@@ -202,46 +202,55 @@ 
         if _load:
             # Cheap trick to allow shallow-copy without copy module
             self.phaseroots, self.dirty = _readroots(repo, phasedefaults)
-            self._phaserevs = None
+            self._phasemaxrev = nullrev
             self._phasesets = None
             self.filterunknown(repo)
             self.opener = repo.svfs
 
     def getrevset(self, repo, phases):
         """return a smartset for the given phases"""
         self.loadphaserevs(repo) # ensure phase's sets are loaded
-
-        if self._phasesets and all(self._phasesets[p] is not None
-                                   for p in phases):
-            # fast path - use _phasesets
-            revs = self._phasesets[phases[0]]
-            if len(phases) > 1:
-                revs = revs.copy() # only copy when needed
-                for p in phases[1:]:
-                    revs.update(self._phasesets[p])
+        phases = set(phases)
+        if public not in phases:
+            # fast path: _phasesets contains the interesting sets,
+            # might only need a union and post-filtering.
+            if len(phases) == 1:
+                [p] = phases
+                revs = self._phasesets[p]
+            else:
+                revs = set.union(*[self._phasesets[p] for p in phases])
             if repo.changelog.filteredrevs:
                 revs = revs - repo.changelog.filteredrevs
             return smartset.baseset(revs)
         else:
-            # slow path - enumerate all revisions
-            phase = self.phase
-            revs = (r for r in repo if phase(repo, r) in phases)
-            return smartset.generatorset(revs, iterasc=True)
+            assert not repo.changelog.filteredrevs
+            phases = set(allphases).difference(phases)
+            if not phases:
+                return smartset.fullreposet()
+            if len(phases) == 1:
+                [p] = phases
+                revs = self._phasesets[p]
+            else:
+                revs = set.union(*[self._phasesets[p] for p in phases])
+            if not revs:
+                return smartset.fullreposet()
+            return smartset.fullreposet().filter(lambda r: r not in revs)
 
     def copy(self):
         # Shallow copy meant to ensure isolation in
         # advance/retractboundary(), nothing more.
         ph = self.__class__(None, None, _load=False)
         ph.phaseroots = self.phaseroots[:]
         ph.dirty = self.dirty
         ph.opener = self.opener
-        ph._phaserevs = self._phaserevs
+        ph._phasemaxrev = self._phasemaxrev
         ph._phasesets = self._phasesets
         return ph
 
     def replace(self, phcache):
         """replace all values in 'self' with content of phcache"""
-        for a in ('phaseroots', 'dirty', 'opener', '_phaserevs', '_phasesets'):
+        for a in ('phaseroots', 'dirty', 'opener', '_phasemaxrev',
+                  '_phasesets'):
             setattr(self, a, getattr(phcache, a))
 
     def _getphaserevsnative(self, repo):
@@ -253,40 +262,36 @@ 
 
     def _computephaserevspure(self, repo):
         repo = repo.unfiltered()
-        revs = [public] * len(repo.changelog)
-        self._phaserevs = revs
-        self._populatephaseroots(repo)
-        for phase in trackedphases:
-            roots = list(map(repo.changelog.rev, self.phaseroots[phase]))
-            if roots:
-                for rev in roots:
-                    revs[rev] = phase
-                for rev in repo.changelog.descendants(roots):
-                    revs[rev] = phase
+        cl = repo.changelog
+        self._phasesets = [set() for phase in allphases]
+        roots = map(cl.rev, self.phaseroots[secret])
+        if roots:
+            ps = set(cl.descendants(roots))
+            for root in roots:
+                ps.add(root)
+            self._phasesets[secret] = ps
+        roots = map(cl.rev, self.phaseroots[draft])
+        if roots:
+            ps = set(cl.descendants(roots))
+            for root in roots:
+                ps.add(root)
+            ps.difference_update(self._phasesets[secret])
+            self._phasesets[draft] = ps
+        self._phasemaxrev = len(cl)
 
     def loadphaserevs(self, repo):
         """ensure phase information is loaded in the object"""
-        if self._phaserevs is None:
+        if self._phasesets is None:
             try:
                 res = self._getphaserevsnative(repo)
-                self._phaserevs, self._phasesets = res
+                self._phasemaxrev, self._phasesets = res
             except AttributeError:
                 self._computephaserevspure(repo)
 
     def invalidate(self):
-        self._phaserevs = None
+        self._phasemaxrev = nullrev
         self._phasesets = None
 
-    def _populatephaseroots(self, repo):
-        """Fills the _phaserevs cache with phases for the roots.
-        """
-        cl = repo.changelog
-        phaserevs = self._phaserevs
-        for phase in trackedphases:
-            roots = map(cl.rev, self.phaseroots[phase])
-            for root in roots:
-                phaserevs[root] = phase
-
     def phase(self, repo, rev):
         # We need a repo argument here to be able to build _phaserevs
         # if necessary. The repository instance is not stored in
@@ -297,10 +302,13 @@ 
             return public
         if rev < nullrev:
             raise ValueError(_('cannot lookup negative revision'))
-        if self._phaserevs is None or rev >= len(self._phaserevs):
+        if rev >= self._phasemaxrev:
             self.invalidate()
             self.loadphaserevs(repo)
-        return self._phaserevs[rev]
+        for phase in trackedphases:
+            if rev in self._phasesets[phase]:
+                return phase
+        return public
 
     def write(self):
         if not self.dirty:
@@ -455,10 +463,10 @@ 
         if filtered:
             self.dirty = True
         # filterunknown is called by repo.destroyed, we may have no changes in
-        # root but phaserevs contents is certainly invalid (or at least we
+        # root but _phasesets contents is certainly invalid (or at least we
         # have not proper way to check that). related to issue 3858.
         #
-        # The other caller is __init__ that have no _phaserevs initialized
+        # The other caller is __init__ that have no _phasesets initialized
         # anyway. If this change we should consider adding a dedicated
         # "destroyed" function to phasecache or a proper cache key mechanism
         # (see branchmap one)
diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -704,8 +704,8 @@ 
     def _activebookmark(self):
         return self._bookmarks.active
 
-    # _phaserevs and _phasesets depend on changelog. what we need is to
-    # call _phasecache.invalidate() if '00changelog.i' was changed, but it
+    # _phasesets depend on changelog. what we need is to call 
+    # _phasecache.invalidate() if '00changelog.i' was changed, but it
     # can't be easily expressed in filecache mechanism.
     @storecache('phaseroots', '00changelog.i')
     def _phasecache(self):
diff --git a/mercurial/cext/revlog.c b/mercurial/cext/revlog.c
--- a/mercurial/cext/revlog.c
+++ b/mercurial/cext/revlog.c
@@ -628,7 +628,7 @@ 
 {
 	PyObject *roots = Py_None;
 	PyObject *ret = NULL;
-	PyObject *phaseslist = NULL;
+	PyObject *phasessize = NULL;
 	PyObject *phaseroots = NULL;
 	PyObject *phaseset = NULL;
 	PyObject *phasessetlist = NULL;
@@ -685,8 +685,8 @@ 
 		}
 	}
 	/* Transform phase list to a python list */
-	phaseslist = PyList_New(len);
-	if (phaseslist == NULL)
+	phasessize = PyInt_FromLong(len);
+	if (phasessize == NULL)
 		goto release;
 	for (i = 0; i < len; i++) {
 		PyObject *phaseval;
@@ -702,15 +702,11 @@ 
 			PySet_Add(phaseset, rev);
 			Py_XDECREF(rev);
 		}
-		phaseval = PyInt_FromLong(phase);
-		if (phaseval == NULL)
-			goto release;
-		PyList_SET_ITEM(phaseslist, i, phaseval);
 	}
-	ret = PyTuple_Pack(2, phaseslist, phasessetlist);
+	ret = PyTuple_Pack(2, phasessize, phasessetlist);
 
 release:
-	Py_XDECREF(phaseslist);
+	Py_XDECREF(phasessize);
 	Py_XDECREF(phasessetlist);
 done:
 	free(phases);
diff --git a/mercurial/cext/parsers.c b/mercurial/cext/parsers.c
--- a/mercurial/cext/parsers.c
+++ b/mercurial/cext/parsers.c
@@ -710,7 +710,7 @@ 
 void manifest_module_init(PyObject *mod);
 void revlog_module_init(PyObject *mod);
 
-static const int version = 3;
+static const int version = 4;
 
 static void module_init(PyObject *mod)
 {