Patchwork subrepo: do not push "clean" subrepos when the parent repo is pushed

login
register
mail settings
Submitter Angel Ezquerra
Date Feb. 14, 2013, 12:06 a.m.
Message ID <26276460d54aecdeb107.1360800378@Angel-PC.localdomain>
Download mbox | patch
Permalink /patch/987/
State Superseded, archived
Headers show

Comments

Angel Ezquerra - Feb. 14, 2013, 12:06 a.m.
# HG changeset patch
# User Angel Ezquerra <angel.ezquerra@gmail.com>
# Date 1360795816 -3600
# Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
# Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
subrepo: do not push "clean" subrepos when the parent repo is pushed

A clean subrepo is defined as one that has not had its dirstate, bookmarks or
phases modified.

This patch works by adding a "clean" method to subrepos. In the case of
mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
of the repository state at the time of push with a similar "stamp" that was
stored on a file when the subrepo was cloned or pushed to a given remote target.
If the stamps match the subrepo has no changes that must be pushed to the target
repository and thus the push can be skipped.

Note that we calculate the stamp file by calculating hashes for several key
repository files, such as the dirstate, the bookmarks file and the phaseroots
file. This means that our "clean" detection is not perfect, in the sense that
if the working directory has been updated to a different revision we will
assume that the subrepo is not clean. However, if we update to another revision
and back to the original revision the clean() method will correctly detec the
subrepo as being clean.

Also note that a subrepo being "clean" is not the opposite of it being "dirty".
A subrepo is dirty if it updated to a different revision that the one that is
pointed to by the subrepo parent or if its working directory is not clean. This
is a different concept.
Angel Ezquerra - Feb. 14, 2013, 12:13 a.m.
On Thu, Feb 14, 2013 at 1:06 AM, Angel Ezquerra
<angel.ezquerra@gmail.com> wrote:
> # HG changeset patch
> # User Angel Ezquerra <angel.ezquerra@gmail.com>
> # Date 1360795816 -3600
> # Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
> # Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
> subrepo: do not push "clean" subrepos when the parent repo is pushed
>
> A clean subrepo is defined as one that has not had its dirstate, bookmarks or
> phases modified.
>
> This patch works by adding a "clean" method to subrepos. In the case of
> mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
> of the repository state at the time of push with a similar "stamp" that was
> stored on a file when the subrepo was cloned or pushed to a given remote target.
> If the stamps match the subrepo has no changes that must be pushed to the target
> repository and thus the push can be skipped.
>
> Note that we calculate the stamp file by calculating hashes for several key
> repository files, such as the dirstate, the bookmarks file and the phaseroots
> file. This means that our "clean" detection is not perfect, in the sense that
> if the working directory has been updated to a different revision we will
> assume that the subrepo is not clean. However, if we update to another revision
> and back to the original revision the clean() method will correctly detec the
> subrepo as being clean.
>
> Also note that a subrepo being "clean" is not the opposite of it being "dirty".
> A subrepo is dirty if it updated to a different revision that the one that is
> pointed to by the subrepo parent or if its working directory is not clean. This
> is a different concept.
>
> diff --git a/mercurial/subrepo.py b/mercurial/subrepo.py
> --- a/mercurial/subrepo.py
> +++ b/mercurial/subrepo.py
> @@ -300,6 +300,16 @@
>
>  class abstractsubrepo(object):
>
> +    def clean(self):
> +        """
> +        returns true if the repository has not changed since it was last
> +        cloned or pulled.
> +        Note that this is very different and definitely not the opposite
> +        of the repository being "dirty", which is related to having changes
> +        on the working directory or the current revision.
> +        """
> +        return False
> +
>      def dirty(self, ignoreupdate=False):
>          """returns true if the dirstate of the subrepo is dirty or does not
>          match current stored state. If ignoreupdate is true, only check
> @@ -426,6 +436,73 @@
>          self._repo.ui.setconfig('ui', '_usedassubrepo', 'True')
>          self._initrepo(r, state[0], create)
>
> +    def clean(self, path):
> +        """
> +        returns true if the repository has not changed since it was last
> +        cloned or pulled.
> +        Note that this is very different and definitely not the opposite
> +        of the repository being "dirty", which is related to having changes
> +        on the working directory or the current revision.
> +        """
> +        return self._calcrepostamp(path) == self._readrepostamp(path)
> +
> +    def _getfilestamp(self, filename):
> +        data = ''
> +        if os.path.exists(filename):
> +            fd = open(filename)
> +            data = fd.read()
> +            fd.close()
> +        return util.sha1(data).hexdigest()
> +
> +    def _calcrepostamp(self, remotepath):
> +        '''calculate a unique "stamp" for the current repository state
> +
> +        This method is used to to detect when there are changes that may
> +        require a push to a given remote path.'''
> +        filelist = ('dirstate', 'bookmarks', 'store/phaseroots')
> +        stamp = ['# %s\n' % remotepath]
> +        lock = self._repo.lock()
> +        try:
> +            for relname in filelist:
> +                absname = os.path.normpath(self._repo.join(relname))
> +                stamp.append('%s = %s\n' % (absname, self._getfilestamp(absname)))
> +        finally:
> +            lock.release()
> +        return stamp
> +
> +    def _getstampfilename(self, remotepath):
> +        '''get a unique filename for the remote repo stamp'''
> +        fname = util.sha1(remotepath).hexdigest()
> +        return self._repo.join(os.path.join('stamps', fname))
> +
> +    def _readrepostamp(self, remotepath):
> +        '''read an existing remote repository stamp'''
> +        stampfile = self._getstampfilename(remotepath)
> +        if not os.path.exists(stampfile):
> +            return ''
> +        fd = open(stampfile, 'r')
> +        stamp = fd.readlines()
> +        fd.close()
> +        return stamp
> +
> +    def _updaterepostamp(self, remotepath):
> +        '''
> +        Calc the current repo stamp saving it into a remote repo stamp file
> +        Each remote repo requires its own stamp file, because a subrepo may
> +        be clean versus a given remote repo, but not versus another.
> +        '''
> +        # save it to the clean file
> +        # We should lock the repo
> +        stampfile = self._getstampfilename(remotepath)
> +        # [FIXME] should lock the repo? it is already locked by _calcrepostamp
> +        stamp = self._calcrepostamp(remotepath)
> +        stampdir = self._repo.join('stamps')
> +        if not os.path.exists(stampdir):
> +            util.makedir(stampdir, True)
> +        fd = open(stampfile, 'w')
> +        fd.writelines(stamp)
> +        fd.close()
> +
>      @annotatesubrepoerror
>      def _initrepo(self, parentrepo, source, create):
>          self._repo._subparent = parentrepo
> @@ -544,12 +621,17 @@
>                                           update=False)
>                  self._repo = cloned.local()
>                  self._initrepo(parentrepo, source, create=True)
> +                self._updaterepostamp(srcurl)
>              else:
>                  self._repo.ui.status(_('pulling subrepo %s from %s\n')
>                                       % (subrelpath(self), srcurl))
> +                cleansub = self.clean(srcurl)
>                  self._repo.pull(other)
>                  bookmarks.updatefromremote(self._repo.ui, self._repo, other,
>                                             srcurl)
> +                if cleansub:
> +                    # keep the repo clean after pull
> +                    self._updaterepostamp(srcurl)
>
>      @annotatesubrepoerror
>      def get(self, state, overwrite=False):
> @@ -557,6 +639,9 @@
>          source, revision, kind = state
>          self._repo.ui.debug("getting subrepo %s\n" % self._path)
>          hg.updaterepo(self._repo, revision, overwrite)
> +        srcurl = _abssource(self._repo)
> +        if self.clean(srcurl):
> +            self._updaterepostamp(srcurl)
>
>      @annotatesubrepoerror
>      def merge(self, state):
> @@ -599,10 +684,20 @@
>                  return False
>
>          dsturl = _abssource(self._repo, True)
> +        if not force:
> +            if self.clean(dsturl):
> +                self._repo.ui.status(
> +                    _('no changes made to subrepo %s since last push to %s\n')
> +                    % (subrelpath(self), dsturl))
> +                return None
>          self._repo.ui.status(_('pushing subrepo %s to %s\n') %
>              (subrelpath(self), dsturl))
>          other = hg.peer(self._repo, {'ssh': ssh}, dsturl)
> -        return self._repo.push(other, force, newbranch=newbranch)
> +        res = self._repo.push(other, force, newbranch=newbranch)
> +
> +        # the repo is now clean
> +        self._updaterepostamp(dsturl)
> +        return res
>
>      @annotatesubrepoerror
>      def outgoing(self, ui, dest, opts):
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>

This is step one in the plan that Matt, Martin and I discussed to
improve subrepos during the London sprint.

The "subrepo clean" detection proposed by this patch is not perfect,
but it should work in many cases and those should only be false
negatives (i.e. there should not be any false positives). It fails to
detect a clean subrepo if you update to a different revision than the
one pointed to by the parent repository. However it will work fine if
you update to a different revision and you update back to the original
revision.

One insight that I gained while making this patch is that we need a
repo hash per remote path, because a subrepo may be clean with respect
a remote repo but not with respect to another. Perhaps this was
obvious to others but it was not to me at first.

Now I'm thinking that perhaps I should have put this info on the
commit message :-)

I thought about splitting this patch a bit, but I did not find a very
good way to do so. Perhaps I could first create a patch that adds the
clean flag to the AbstractSubrepo and then one that adds the flag to
the hg subrepo and another one that uses the flag. If you think that
is what I should do I will resend of course.

I also am working on step two of the plan, which was to add a
"--subrepos" flag to hg push.

Cheers,

Angel
Matt Mackall - Feb. 14, 2013, 1:31 a.m.
On Thu, 2013-02-14 at 01:06 +0100, Angel Ezquerra wrote:
> # HG changeset patch
> # User Angel Ezquerra <angel.ezquerra@gmail.com>
> # Date 1360795816 -3600
> # Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
> # Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
> subrepo: do not push "clean" subrepos when the parent repo is pushed
> 
> A clean subrepo is defined as one that has not had its dirstate, bookmarks or
> phases modified.
> 
> This patch works by adding a "clean" method to subrepos. In the case of
> mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
> of the repository state at the time of push with a similar "stamp" that was
> stored on a file when the subrepo was cloned or pushed to a given remote target.
> If the stamps match the subrepo has no changes that must be pushed to the target
> repository and thus the push can be skipped.
> 
> Note that we calculate the stamp file by calculating hashes for several key
> repository files, such as the dirstate, the bookmarks file and the phaseroots
> file. This means that our "clean" detection is not perfect, in the sense that
> if the working directory has been updated to a different revision we will
> assume that the subrepo is not clean. However, if we update to another revision
> and back to the original revision the clean() method will correctly detec the
> subrepo as being clean.

Why is the dirstate interesting? I would posit that we're only
interested in things that we'd push or pull. Dirstate being modified
should not force us to push.

> Also note that a subrepo being "clean" is not the opposite of it being "dirty".
> A subrepo is dirty if it updated to a different revision that the one that is
> pointed to by the subrepo parent or if its working directory is not clean. This
> is a different concept.

Ok, let's give it a different name. How about storeclean() so it's
explicit it's about the store?

I think this needs to be (at least) a few different patches:

1) introduce the helper functions
2) record clean state on clone/pull
3) check clean state on push

> diff --git a/mercurial/subrepo.py b/mercurial/subrepo.py
> --- a/mercurial/subrepo.py
> +++ b/mercurial/subrepo.py
> @@ -300,6 +300,16 @@
>  
>  class abstractsubrepo(object):
>  
> +    def clean(self):
> +        """
> +        returns true if the repository has not changed since it was last
> +        cloned or pulled.
> +        Note that this is very different and definitely not the opposite
> +        of the repository being "dirty", which is related to having changes
> +        on the working directory or the current revision.
> +        """
> +        return False
> +
>      def dirty(self, ignoreupdate=False):
>          """returns true if the dirstate of the subrepo is dirty or does not
>          match current stored state. If ignoreupdate is true, only check
> @@ -426,6 +436,73 @@
>          self._repo.ui.setconfig('ui', '_usedassubrepo', 'True')
>          self._initrepo(r, state[0], create)
>  
> +    def clean(self, path):
> +        """
> +        returns true if the repository has not changed since it was last
> +        cloned or pulled.
> +        Note that this is very different and definitely not the opposite
> +        of the repository being "dirty", which is related to having changes
> +        on the working directory or the current revision.
> +        """

Duplicate docs.

> +        return self._calcrepostamp(path) == self._readrepostamp(path)

This is suboptimal, especially if any of the files are large. It'd be
better to be able to break after we find the first changed file.

> +    def _getfilestamp(self, filename):
> +        data = ''
> +        if os.path.exists(filename):
> +            fd = open(filename)
> +            data = fd.read()
> +            fd.close()
> +        return util.sha1(data).hexdigest()

Most of this doesn't want to be member functions. We'll need it for git.

> +    def _calcrepostamp(self, remotepath):
> +        '''calculate a unique "stamp" for the current repository state
> +
> +        This method is used to to detect when there are changes that may
> +        require a push to a given remote path.'''
> +        filelist = ('dirstate', 'bookmarks', 'store/phaseroots')
> +        stamp = ['# %s\n' % remotepath]
> +        lock = self._repo.lock()
> +        try:
> +            for relname in filelist:
> +                absname = os.path.normpath(self._repo.join(relname))
> +                stamp.append('%s = %s\n' % (absname, self._getfilestamp(absname)))
> +        finally:
> +            lock.release()
> +        return stamp
> +
> +    def _getstampfilename(self, remotepath):
> +        '''get a unique filename for the remote repo stamp'''
> +        fname = util.sha1(remotepath).hexdigest()
> +        return self._repo.join(os.path.join('stamps', fname))

Probably don't want a 40-character name here. This should go
in .hg/cache/ on hg repos and have a less generic name than stamp.

> +    def _readrepostamp(self, remotepath):
> +        '''read an existing remote repository stamp'''
> +        stampfile = self._getstampfilename(remotepath)
> +        if not os.path.exists(stampfile):
> +            return ''
> +        fd = open(stampfile, 'r')
> +        stamp = fd.readlines()
> +        fd.close()
> +        return stamp
> +
> +    def _updaterepostamp(self, remotepath):
> +        '''
> +        Calc the current repo stamp saving it into a remote repo stamp file
> +        Each remote repo requires its own stamp file, because a subrepo may
> +        be clean versus a given remote repo, but not versus another.
> +        '''
> +        # save it to the clean file
> +        # We should lock the repo
> +        stampfile = self._getstampfilename(remotepath)
> +        # [FIXME] should lock the repo? it is already locked by _calcrepostamp

No, the lock should be in the callers of _calcrepostamp.

> +        stamp = self._calcrepostamp(remotepath)
> +        stampdir = self._repo.join('stamps')
> +        if not os.path.exists(stampdir):
> +            util.makedir(stampdir, True)
> +        fd = open(stampfile, 'w')
> +        fd.writelines(stamp)
> +        fd.close()
> +
>      @annotatesubrepoerror
>      def _initrepo(self, parentrepo, source, create):
>          self._repo._subparent = parentrepo
> @@ -544,12 +621,17 @@
>                                           update=False)
>                  self._repo = cloned.local()
>                  self._initrepo(parentrepo, source, create=True)
> +                self._updaterepostamp(srcurl)
>              else:
>                  self._repo.ui.status(_('pulling subrepo %s from %s\n')
>                                       % (subrelpath(self), srcurl))
> +                cleansub = self.clean(srcurl)
>                  self._repo.pull(other)
>                  bookmarks.updatefromremote(self._repo.ui, self._repo, other,
>                                             srcurl)
> +                if cleansub:
> +                    # keep the repo clean after pull
> +                    self._updaterepostamp(srcurl)
>  
>      @annotatesubrepoerror
>      def get(self, state, overwrite=False):
> @@ -557,6 +639,9 @@
>          source, revision, kind = state
>          self._repo.ui.debug("getting subrepo %s\n" % self._path)
>          hg.updaterepo(self._repo, revision, overwrite)
> +        srcurl = _abssource(self._repo)
> +        if self.clean(srcurl):
> +            self._updaterepostamp(srcurl)
>  
>      @annotatesubrepoerror
>      def merge(self, state):
> @@ -599,10 +684,20 @@
>                  return False
>  
>          dsturl = _abssource(self._repo, True)
> +        if not force:
> +            if self.clean(dsturl):
> +                self._repo.ui.status(
> +                    _('no changes made to subrepo %s since last push to %s\n')
> +                    % (subrelpath(self), dsturl))
> +                return None
>          self._repo.ui.status(_('pushing subrepo %s to %s\n') %
>              (subrelpath(self), dsturl))
>          other = hg.peer(self._repo, {'ssh': ssh}, dsturl)
> -        return self._repo.push(other, force, newbranch=newbranch)
> +        res = self._repo.push(other, force, newbranch=newbranch)
> +
> +        # the repo is now clean
> +        self._updaterepostamp(dsturl)
> +        return res
>  
>      @annotatesubrepoerror
>      def outgoing(self, ui, dest, opts):
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Angel Ezquerra - Feb. 14, 2013, 8:39 a.m.
On Thu, Feb 14, 2013 at 2:31 AM, Matt Mackall <mpm@selenic.com> wrote:
> On Thu, 2013-02-14 at 01:06 +0100, Angel Ezquerra wrote:
>> # HG changeset patch
>> # User Angel Ezquerra <angel.ezquerra@gmail.com>
>> # Date 1360795816 -3600
>> # Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
>> # Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
>> subrepo: do not push "clean" subrepos when the parent repo is pushed
>>
>> A clean subrepo is defined as one that has not had its dirstate, bookmarks or
>> phases modified.
>>
>> This patch works by adding a "clean" method to subrepos. In the case of
>> mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
>> of the repository state at the time of push with a similar "stamp" that was
>> stored on a file when the subrepo was cloned or pushed to a given remote target.
>> If the stamps match the subrepo has no changes that must be pushed to the target
>> repository and thus the push can be skipped.
>>
>> Note that we calculate the stamp file by calculating hashes for several key
>> repository files, such as the dirstate, the bookmarks file and the phaseroots
>> file. This means that our "clean" detection is not perfect, in the sense that
>> if the working directory has been updated to a different revision we will
>> assume that the subrepo is not clean. However, if we update to another revision
>> and back to the original revision the clean() method will correctly detec the
>> subrepo as being clean.

> Why is the dirstate interesting? I would posit that we're only
> interested in things that we'd push or pull. Dirstate being modified
> should not force us to push.

I think I may have misunderstood you when we discussed this during the
sprint. I thought you mentioned that we should check the direstate,
the bookmarks and the phaseroots.

I don't know the mercurial store format well enough to tell which
files we need to hash. Would hashing "store/00changelog.i" and
"store/00manifest.i" be enough? An alternative would be to hash the
list of repository heads and consider the repo unclean if that
changed?

>> Also note that a subrepo being "clean" is not the opposite of it being "dirty".
>> A subrepo is dirty if it updated to a different revision that the one that is
>> pointed to by the subrepo parent or if its working directory is not clean. This
>> is a different concept.
>
> Ok, let's give it a different name. How about storeclean() so it's
> explicit it's about the store?

OK, as long as bookmarks and phases can be considered part of the store?

> I think this needs to be (at least) a few different patches:
>
> 1) introduce the helper functions
> 2) record clean state on clone/pull
> 3) check clean state on push

Will do.

>> diff --git a/mercurial/subrepo.py b/mercurial/subrepo.py
>> --- a/mercurial/subrepo.py
>> +++ b/mercurial/subrepo.py
>> @@ -300,6 +300,16 @@
>>
>>  class abstractsubrepo(object):
>>
>> +    def clean(self):
>> +        """
>> +        returns true if the repository has not changed since it was last
>> +        cloned or pulled.
>> +        Note that this is very different and definitely not the opposite
>> +        of the repository being "dirty", which is related to having changes
>> +        on the working directory or the current revision.
>> +        """
>> +        return False
>> +
>>      def dirty(self, ignoreupdate=False):
>>          """returns true if the dirstate of the subrepo is dirty or does not
>>          match current stored state. If ignoreupdate is true, only check
>> @@ -426,6 +436,73 @@
>>          self._repo.ui.setconfig('ui', '_usedassubrepo', 'True')
>>          self._initrepo(r, state[0], create)
>>
>> +    def clean(self, path):
>> +        """
>> +        returns true if the repository has not changed since it was last
>> +        cloned or pulled.
>> +        Note that this is very different and definitely not the opposite
>> +        of the repository being "dirty", which is related to having changes
>> +        on the working directory or the current revision.
>> +        """
>
> Duplicate docs.
>
>> +        return self._calcrepostamp(path) == self._readrepostamp(path)
>
> This is suboptimal, especially if any of the files are large. It'd be
> better to be able to break after we find the first changed file.

Perhaps _calcrepostamp could be a generator. clean() (i.e. cleanstore)
could then just check each individual line.

>> +    def _getfilestamp(self, filename):
>> +        data = ''
>> +        if os.path.exists(filename):
>> +            fd = open(filename)
>> +            data = fd.read()
>> +            fd.close()
>> +        return util.sha1(data).hexdigest()
>
> Most of this doesn't want to be member functions. We'll need it for git.

Sure, although I know even less about git's internals than mercurial's
to be able to know what should be hashed inside the .git folder.

Should these be kept in subrepos.py or perhaps moved to
mercurial.util? Also _calcrepostamp could be moved out as well by
making the "file list" a parameter.

>> +    def _calcrepostamp(self, remotepath):
>> +        '''calculate a unique "stamp" for the current repository state
>> +
>> +        This method is used to to detect when there are changes that may
>> +        require a push to a given remote path.'''
>> +        filelist = ('dirstate', 'bookmarks', 'store/phaseroots')
>> +        stamp = ['# %s\n' % remotepath]
>> +        lock = self._repo.lock()
>> +        try:
>> +            for relname in filelist:
>> +                absname = os.path.normpath(self._repo.join(relname))
>> +                stamp.append('%s = %s\n' % (absname, self._getfilestamp(absname)))
>> +        finally:
>> +            lock.release()
>> +        return stamp
>> +
>> +    def _getstampfilename(self, remotepath):
>> +        '''get a unique filename for the remote repo stamp'''
>> +        fname = util.sha1(remotepath).hexdigest()
>> +        return self._repo.join(os.path.join('stamps', fname))
>
> Probably don't want a 40-character name here. This should go
> in .hg/cache/ on hg repos and have a less generic name than stamp.

OK, I'll just keep the first 12 hash characters.

>> +    def _readrepostamp(self, remotepath):
>> +        '''read an existing remote repository stamp'''
>> +        stampfile = self._getstampfilename(remotepath)
>> +        if not os.path.exists(stampfile):
>> +            return ''
>> +        fd = open(stampfile, 'r')
>> +        stamp = fd.readlines()
>> +        fd.close()
>> +        return stamp
>> +
>> +    def _updaterepostamp(self, remotepath):
>> +        '''
>> +        Calc the current repo stamp saving it into a remote repo stamp file
>> +        Each remote repo requires its own stamp file, because a subrepo may
>> +        be clean versus a given remote repo, but not versus another.
>> +        '''
>> +        # save it to the clean file
>> +        # We should lock the repo
>> +        stampfile = self._getstampfilename(remotepath)
>> +        # [FIXME] should lock the repo? it is already locked by _calcrepostamp
>
> No, the lock should be in the callers of _calcrepostamp.

OK. How would that work on the case of git repos? Is it possible to
"lock" a git repo?

Matt, thanks for the detailed review!

Angel
Isaac Jurado - Feb. 14, 2013, 9:54 a.m.
On Thu, Feb 14, 2013 at 1:06 AM, Angel Ezquerra
<angel.ezquerra@gmail.com> wrote:
> # HG changeset patch
> # User Angel Ezquerra <angel.ezquerra@gmail.com>
> # Date 1360795816 -3600
> # Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
> # Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
> subrepo: do not push "clean" subrepos when the parent repo is pushed
>
> A clean subrepo is defined as one that has not had its dirstate, bookmarks or
> phases modified.
>
> This patch works by adding a "clean" method to subrepos. In the case of
> mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
> of the repository state at the time of push with a similar "stamp" that was
> stored on a file when the subrepo was cloned or pushed to a given remote target.
> If the stamps match the subrepo has no changes that must be pushed to the target
> repository and thus the push can be skipped.
>
> Note that we calculate the stamp file by calculating hashes for several key
> repository files, such as the dirstate, the bookmarks file and the phaseroots
> file. This means that our "clean" detection is not perfect, in the sense that
> if the working directory has been updated to a different revision we will
> assume that the subrepo is not clean. However, if we update to another revision
> and back to the original revision the clean() method will correctly detec the
> subrepo as being clean.
>
> Also note that a subrepo being "clean" is not the opposite of it being "dirty".
> A subrepo is dirty if it updated to a different revision that the one that is
> pointed to by the subrepo parent or if its working directory is not clean. This
> is a different concept.

Have you considered the following situations?

  1. The .hgsub file changes
  2. The [subpaths] section redirects pushes to a different location
from last push.

I couldn't deduce it from the code so I had to ask.

Cheers.
Angel Ezquerra - Feb. 14, 2013, 1:05 p.m.
On Thu, Feb 14, 2013 at 10:54 AM, Isaac Jurado <diptongo@gmail.com> wrote:
> On Thu, Feb 14, 2013 at 1:06 AM, Angel Ezquerra
> <angel.ezquerra@gmail.com> wrote:
>> # HG changeset patch
>> # User Angel Ezquerra <angel.ezquerra@gmail.com>
>> # Date 1360795816 -3600
>> # Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
>> # Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
>> subrepo: do not push "clean" subrepos when the parent repo is pushed
>>
>> A clean subrepo is defined as one that has not had its dirstate, bookmarks or
>> phases modified.
>>
>> This patch works by adding a "clean" method to subrepos. In the case of
>> mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
>> of the repository state at the time of push with a similar "stamp" that was
>> stored on a file when the subrepo was cloned or pushed to a given remote target.
>> If the stamps match the subrepo has no changes that must be pushed to the target
>> repository and thus the push can be skipped.
>>
>> Note that we calculate the stamp file by calculating hashes for several key
>> repository files, such as the dirstate, the bookmarks file and the phaseroots
>> file. This means that our "clean" detection is not perfect, in the sense that
>> if the working directory has been updated to a different revision we will
>> assume that the subrepo is not clean. However, if we update to another revision
>> and back to the original revision the clean() method will correctly detec the
>> subrepo as being clean.
>>
>> Also note that a subrepo being "clean" is not the opposite of it being "dirty".
>> A subrepo is dirty if it updated to a different revision that the one that is
>> pointed to by the subrepo parent or if its working directory is not clean. This
>> is a different concept.
>
> Have you considered the following situations?
>
>   1. The .hgsub file changes
>   2. The [subpaths] section redirects pushes to a different location
> from last push.
>
> I couldn't deduce it from the code so I had to ask.
>
> Cheers.

The patch certainly covers situation #1 because we keep track of the
clean status for each remote path. Changing the .hgsub file will
simply change the remote path (it could also remove a subrepo, in
which case this does not apply; or add it in which case the repo will
just be cloned).

Regarding situation #2 I have not tried it. It is great that you bring
this up. I believe it should probably work because the patch checks
the current remote path, and I believe that is obtained after the
subpath substitution has been done. That being said I'll have to check
that is the case.

Thanks for your comments,

Angel
Matt Harbison - Feb. 15, 2013, 3:49 a.m.
On Thu, 14 Feb 2013 01:13:44 +0100, Angel Ezquerra wrote:

> On Thu, Feb 14, 2013 at 1:06 AM, Angel Ezquerra
> <angel.ezquerra@gmail.com> wrote:

...

> This is step one in the plan that Matt, Martin and I discussed to
> improve subrepos during the London sprint.

Is there a brief overview of the plan somewhere?
 
> I also am working on step two of the plan, which was to add a
> "--subrepos" flag to hg push.

Is this your idea about passing (some?) parameters to subrepos [1]?  If 
so, does 'outgoing' need the same method of filtering the option dict [2] 
for consistency?  (I was a bit surprised that outgoing -S passes along the 
--rev option, which causes it to abort in the subrepo with a (parent) 
hash, or lie or abort if given a rev.)  There's also a couple bugs written 
about --addremove not being passed along, so what to pass or not seems 
like a wider (general?) problem.

FWIW, I've got code that seems to be able to honor the -r flag for 
'outgoing' and 'push' (issue2314) [3], including the case where the parent 
commits an older version of the subrepo after a newer version.  I wasn't 
happy with how I had to hack the option dict after translating --rev, but 
it sounds like you are working in this area anyway.  So I'll try to submit 
this as a (rough) RFC this weekend, and assuming the concept is OK, we 
will probably need to coordinate how options are handed off to subrepos 
for 'push' and 'outgoing'.

--Matt


[1] http://www.selenic.com/pipermail/mercurial-devel/2011-
September/034435.html

[2] http://www.selenic.com/pipermail/mercurial-devel/2011-
September/034453.html

[3] http://bz.selenic.com/show_bug.cgi?id=2314
Angel Ezquerra - Feb. 15, 2013, 11:18 a.m.
On Fri, Feb 15, 2013 at 4:49 AM, Matt Harbison <matt_harbison@yahoo.com> wrote:
> On Thu, 14 Feb 2013 01:13:44 +0100, Angel Ezquerra wrote:
>
>> On Thu, Feb 14, 2013 at 1:06 AM, Angel Ezquerra
>> <angel.ezquerra@gmail.com> wrote:
>
> ...
>
>> This is step one in the plan that Matt, Martin and I discussed to
>> improve subrepos during the London sprint.
>
> Is there a brief overview of the plan somewhere?

I wrote a summary on the titanpad that we used during the sprint:

http://titanpad.com/mercurial26

I've taken that and updated it a little with what has been discussed
in this thread and my own thinking since them:

* Discuss ways to improve some of the subrepo pain points - angel, mg
  - add way to pull subrepos with hg pull
  - just push changed subrepos
  - subrepos are eternal
  - We had a discussion with mpm on this and we believe we have a good plan:
      - Add -S/--subrepos flag to hg pull
      - Add cleanstore() method to subrepos which can tell pull/push
if a subrepo store has changes. If not we can ignore it during push
          - The cleanstore method would check a timestamp or sha1 of
the bookmarks, phaseroots and the changelog, but not the
dirstatedirstate
dirstate
        - Updated on clone and push; but also on pull if the store is
already clean
      - Only push not cleanstore() subrepos
      - subrepo.deletable(): if self.clean() and dirstate working dir is clean()
      - fix Merge bug which makes subrepos stay on the working
directory on update even if there are files on the parent that should
go on the folder occupied by the subrepo
          - using deletable() we could remove subrepos from the working dir...
              - however this could cause data loss on update on a
"central" repository containing relative subrepos. On non central
subrepos it would require recloning the subrepo when updating back to
a revision referring to that subrepo.
       - Introduce subrepo caching (this would fix the problem above
and improve pull time considerably when subrepos are deleted)
            - Cloning would need to be smart enough to look on the
subrepo cache. Otherwise it would not be possible to pull from central
repositories that were are at revision -1 (since all subrepos would be
on the cache, and none on the workind directory).

   - Document subpaths patterns that can make relative subrepos work
with bitbucket and google code

Perhaps it would be worth adding this to the wiki?

>> I also am working on step two of the plan, which was to add a
>> "--subrepos" flag to hg push.
>
> Is this your idea about passing (some?) parameters to subrepos [1]?  If
> so, does 'outgoing' need the same method of filtering the option dict [2]
> for consistency?  (I was a bit surprised that outgoing -S passes along the
> --rev option, which causes it to abort in the subrepo with a (parent)
> hash, or lie or abort if given a rev.)  There's also a couple bugs written
> about --addremove not being passed along, so what to pass or not seems
> like a wider (general?) problem.

The way I've implemented it is very similar to how the current
subrepo.get() works. Basically I want hg pull --subrepos to behave as
if you first did "hg pull" and then you did "hg update -r" for every
new revision that "hg pull" brought into the repository. The idea is
to make sure that you are able to update to any new revision without
needing to have any network access (i.e. that the repository is self
contained after doing hg pull --subrepos, as long as it already was
self contained before).

Martin Gesiler helped me with this patch during the sprint and he has
been helping me since then. I'll send a patch soon.

That being said, out of the options that pull takes there are a lot
that do not make sense when pulling from subrepos, particularly --rev,
--bookmark, --branch and specially --rebase. I think that if you need
to do something special while pulling a subrepo perhaps it would be
best to get into the subrepo and do a regular pull, or perhaps use the
onsub extension (which I wish we integrated into mercurial). One thing
that would help there would be to have a way to refer to the parent
paths when pulling from within a subrepo. This would be particularly
handy when using relative subrepo definitions, as you should.

> FWIW, I've got code that seems to be able to honor the -r flag for
> 'outgoing' and 'push' (issue2314) [3], including the case where the parent
> commits an older version of the subrepo after a newer version.  I wasn't
> happy with how I had to hack the option dict after translating --rev, but
> it sounds like you are working in this area anyway.  So I'll try to submit
> this as a (rough) RFC this weekend, and assuming the concept is OK, we
> will probably need to coordinate how options are handed off to subrepos
> for 'push' and 'outgoing'.

I'll try to have a look at your patches when I get some time.

Angel


> [2] http://www.selenic.com/pipermail/mercurial-devel/2011-
> September/034453.html
>
> [3] http://bz.selenic.com/show_bug.cgi?id=2314
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
Matt Mackall - Feb. 15, 2013, 8:40 p.m.
On Thu, 2013-02-14 at 09:39 +0100, Angel Ezquerra wrote:
> On Thu, Feb 14, 2013 at 2:31 AM, Matt Mackall <mpm@selenic.com> wrote:
> > On Thu, 2013-02-14 at 01:06 +0100, Angel Ezquerra wrote:
> >> # HG changeset patch
> >> # User Angel Ezquerra <angel.ezquerra@gmail.com>
> >> # Date 1360795816 -3600
> >> # Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
> >> # Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
> >> subrepo: do not push "clean" subrepos when the parent repo is pushed
> >>
> >> A clean subrepo is defined as one that has not had its dirstate, bookmarks or
> >> phases modified.
> >>
> >> This patch works by adding a "clean" method to subrepos. In the case of
> >> mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
> >> of the repository state at the time of push with a similar "stamp" that was
> >> stored on a file when the subrepo was cloned or pushed to a given remote target.
> >> If the stamps match the subrepo has no changes that must be pushed to the target
> >> repository and thus the push can be skipped.
> >>
> >> Note that we calculate the stamp file by calculating hashes for several key
> >> repository files, such as the dirstate, the bookmarks file and the phaseroots
> >> file. This means that our "clean" detection is not perfect, in the sense that
> >> if the working directory has been updated to a different revision we will
> >> assume that the subrepo is not clean. However, if we update to another revision
> >> and back to the original revision the clean() method will correctly detec the
> >> subrepo as being clean.
> 
> > Why is the dirstate interesting? I would posit that we're only
> > interested in things that we'd push or pull. Dirstate being modified
> > should not force us to push.
> 
> I think I may have misunderstood you when we discussed this during the
> sprint. I thought you mentioned that we should check the direstate,
> the bookmarks and the phaseroots.

I may have misspoke. Pretty sure I talked primarily about the changelog.

> I don't know the mercurial store format well enough to tell which
> files we need to hash. Would hashing "store/00changelog.i" and
> "store/00manifest.i" be enough?

Hashing 00changelog.i is sufficient.

> >> Also note that a subrepo being "clean" is not the opposite of it being "dirty".
> >> A subrepo is dirty if it updated to a different revision that the one that is
> >> pointed to by the subrepo parent or if its working directory is not clean. This
> >> is a different concept.
> >
> > Ok, let's give it a different name. How about storeclean() so it's
> > explicit it's about the store?
> 
> OK, as long as bookmarks and phases can be considered part of the store?

Phases definitely are. Bookmarks are a bit on the edge.

> > I think this needs to be (at least) a few different patches:
> >
> > 1) introduce the helper functions
> > 2) record clean state on clone/pull
> > 3) check clean state on push
> 
> Will do.
> 
> >> diff --git a/mercurial/subrepo.py b/mercurial/subrepo.py
> >> --- a/mercurial/subrepo.py
> >> +++ b/mercurial/subrepo.py
> >> @@ -300,6 +300,16 @@
> >>
> >>  class abstractsubrepo(object):
> >>
> >> +    def clean(self):
> >> +        """
> >> +        returns true if the repository has not changed since it was last
> >> +        cloned or pulled.
> >> +        Note that this is very different and definitely not the opposite
> >> +        of the repository being "dirty", which is related to having changes
> >> +        on the working directory or the current revision.
> >> +        """
> >> +        return False
> >> +
> >>      def dirty(self, ignoreupdate=False):
> >>          """returns true if the dirstate of the subrepo is dirty or does not
> >>          match current stored state. If ignoreupdate is true, only check
> >> @@ -426,6 +436,73 @@
> >>          self._repo.ui.setconfig('ui', '_usedassubrepo', 'True')
> >>          self._initrepo(r, state[0], create)
> >>
> >> +    def clean(self, path):
> >> +        """
> >> +        returns true if the repository has not changed since it was last
> >> +        cloned or pulled.
> >> +        Note that this is very different and definitely not the opposite
> >> +        of the repository being "dirty", which is related to having changes
> >> +        on the working directory or the current revision.
> >> +        """
> >
> > Duplicate docs.
> >
> >> +        return self._calcrepostamp(path) == self._readrepostamp(path)
> >
> > This is suboptimal, especially if any of the files are large. It'd be
> > better to be able to break after we find the first changed file.
> 
> Perhaps _calcrepostamp could be a generator. clean() (i.e. cleanstore)
> could then just check each individual line.

Sure.

> >> +    def _getfilestamp(self, filename):
> >> +        data = ''
> >> +        if os.path.exists(filename):
> >> +            fd = open(filename)
> >> +            data = fd.read()
> >> +            fd.close()
> >> +        return util.sha1(data).hexdigest()
> >
> > Most of this doesn't want to be member functions. We'll need it for git.
> 
> Sure, although I know even less about git's internals than mercurial's
> to be able to know what should be hashed inside the .git folder.

The point is that we know today we're going to eventually want to handle
git and that's probably going to involve a similar approach. Ergo, these
should not be methods on the hg subrepo class.

> Should these be kept in subrepos.py or perhaps moved to
> mercurial.util? Also _calcrepostamp could be moved out as well by
> making the "file list" a parameter.

Save that for later, I think.

> >> +    def _calcrepostamp(self, remotepath):
> >> +        '''calculate a unique "stamp" for the current repository state
> >> +
> >> +        This method is used to to detect when there are changes that may
> >> +        require a push to a given remote path.'''
> >> +        filelist = ('dirstate', 'bookmarks', 'store/phaseroots')
> >> +        stamp = ['# %s\n' % remotepath]
> >> +        lock = self._repo.lock()
> >> +        try:
> >> +            for relname in filelist:
> >> +                absname = os.path.normpath(self._repo.join(relname))
> >> +                stamp.append('%s = %s\n' % (absname, self._getfilestamp(absname)))
> >> +        finally:
> >> +            lock.release()
> >> +        return stamp
> >> +
> >> +    def _getstampfilename(self, remotepath):
> >> +        '''get a unique filename for the remote repo stamp'''
> >> +        fname = util.sha1(remotepath).hexdigest()
> >> +        return self._repo.join(os.path.join('stamps', fname))
> >
> > Probably don't want a 40-character name here. This should go
> > in .hg/cache/ on hg repos and have a less generic name than stamp.
> 
> OK, I'll just keep the first 12 hash characters.
> 
> >> +    def _readrepostamp(self, remotepath):
> >> +        '''read an existing remote repository stamp'''
> >> +        stampfile = self._getstampfilename(remotepath)
> >> +        if not os.path.exists(stampfile):
> >> +            return ''
> >> +        fd = open(stampfile, 'r')
> >> +        stamp = fd.readlines()
> >> +        fd.close()
> >> +        return stamp
> >> +
> >> +    def _updaterepostamp(self, remotepath):
> >> +        '''
> >> +        Calc the current repo stamp saving it into a remote repo stamp file
> >> +        Each remote repo requires its own stamp file, because a subrepo may
> >> +        be clean versus a given remote repo, but not versus another.
> >> +        '''
> >> +        # save it to the clean file
> >> +        # We should lock the repo
> >> +        stampfile = self._getstampfilename(remotepath)
> >> +        # [FIXME] should lock the repo? it is already locked by _calcrepostamp
> >
> > No, the lock should be in the callers of _calcrepostamp.
> 
> OK. How would that work on the case of git repos? Is it possible to
> "lock" a git repo?

Defer that question.
Matt Harbison - Feb. 16, 2013, 5:03 a.m.
Angel Ezquerra wrote:
> On Fri, Feb 15, 2013 at 4:49 AM, Matt Harbison<matt_harbison@yahoo.com>  wrote:
>> On Thu, 14 Feb 2013 01:13:44 +0100, Angel Ezquerra wrote:
>>
>>> On Thu, Feb 14, 2013 at 1:06 AM, Angel Ezquerra
>>> <angel.ezquerra@gmail.com>  wrote:
>> ...
>>
>>> This is step one in the plan that Matt, Martin and I discussed to
>>> improve subrepos during the London sprint.
>> Is there a brief overview of the plan somewhere?
>
> I wrote a summary on the titanpad that we used during the sprint:
>
> http://titanpad.com/mercurial26
>
> I've taken that and updated it a little with what has been discussed
> in this thread and my own thinking since them:

Thanks for the writeup.  I think I understand most of this, up to the 
caching and deletable parts.  But that seems a bit further into the 
future, and maybe it will become more clear as this evolves.

I'm not sure if deletable helps this, and I haven't looked into it yet, 
but any plans on being able to update between revs where a file exists 
but in the other rev, a subrepo exists instead in the same directory? 
(I think this is what issue3131 is about.)

> * Discuss ways to improve some of the subrepo pain points - angel, mg
>    - add way to pull subrepos with hg pull
>    - just push changed subrepos
>    - subrepos are eternal
>    - We had a discussion with mpm on this and we believe we have a good plan:
>        - Add -S/--subrepos flag to hg pull
>        - Add cleanstore() method to subrepos which can tell pull/push
> if a subrepo store has changes. If not we can ignore it during push
>            - The cleanstore method would check a timestamp or sha1 of
> the bookmarks, phaseroots and the changelog, but not the
> dirstatedirstate
> dirstate
>          - Updated on clone and push; but also on pull if the store is
> already clean
>        - Only push not cleanstore() subrepos
>        - subrepo.deletable(): if self.clean() and dirstate working dir is clean()
>        - fix Merge bug which makes subrepos stay on the working
> directory on update even if there are files on the parent that should
> go on the folder occupied by the subrepo
>            - using deletable() we could remove subrepos from the working dir...
>                - however this could cause data loss on update on a
> "central" repository containing relative subrepos. On non central
> subrepos it would require recloning the subrepo when updating back to
> a revision referring to that subrepo.
>         - Introduce subrepo caching (this would fix the problem above
> and improve pull time considerably when subrepos are deleted)
>              - Cloning would need to be smart enough to look on the
> subrepo cache. Otherwise it would not be possible to pull from central
> repositories that were are at revision -1 (since all subrepos would be
> on the cache, and none on the workind directory).
>
>     - Document subpaths patterns that can make relative subrepos work
> with bitbucket and google code
>
> Perhaps it would be worth adding this to the wiki?
>
>>> I also am working on step two of the plan, which was to add a
>>> "--subrepos" flag to hg push.
>> Is this your idea about passing (some?) parameters to subrepos [1]?  If
>> so, does 'outgoing' need the same method of filtering the option dict [2]
>> for consistency?  (I was a bit surprised that outgoing -S passes along the
>> --rev option, which causes it to abort in the subrepo with a (parent)
>> hash, or lie or abort if given a rev.)  There's also a couple bugs written
>> about --addremove not being passed along, so what to pass or not seems
>> like a wider (general?) problem.
>
> The way I've implemented it is very similar to how the current
> subrepo.get() works. Basically I want hg pull --subrepos to behave as
> if you first did "hg pull" and then you did "hg update -r" for every
> new revision that "hg pull" brought into the repository. The idea is
> to make sure that you are able to update to any new revision without
> needing to have any network access (i.e. that the repository is self
> contained after doing hg pull --subrepos, as long as it already was
> self contained before).

I really like this capability.

> Martin Gesiler helped me with this patch during the sprint and he has
> been helping me since then. I'll send a patch soon.
>
> That being said, out of the options that pull takes there are a lot
> that do not make sense when pulling from subrepos, particularly --rev,
> --bookmark, --branch and specially --rebase.

Aren't --bookmark and --branch simply opts that predate the bookmark() 
and branch() revsets, without any special semantics?  (I'm wondering 
specifically about bookmarks because I don't use them much, and IDK if 
--bookmark + --update does anything special that --rev + --update 
doesn't.)

One of the things that seemed like it would be useful when trying to fix 
push -r is to have the methods in commands.py translate these options to 
a list of revs there, and pass that along.  That way, every repo (even 
without subrepos) gets a list of revs instead of these various opts. 
The subrepo layer would have to translate to child revs before passing 
them on, but that doesn't seem terribly difficult.

I don't use --rebase, but I can see how that would be bad.

> I think that if you need
> to do something special while pulling a subrepo perhaps it would be
> best to get into the subrepo and do a regular pull, or perhaps use the
> onsub extension (which I wish we integrated into mercurial). One thing
> that would help there would be to have a way to refer to the parent
> paths when pulling from within a subrepo. This would be particularly
> handy when using relative subrepo definitions, as you should.

I'm not sure what you mean about refering to parent paths.  Do you have 
a use case in mind?

>> FWIW, I've got code that seems to be able to honor the -r flag for
>> 'outgoing' and 'push' (issue2314) [3], including the case where the parent
>> commits an older version of the subrepo after a newer version.  I wasn't
>> happy with how I had to hack the option dict after translating --rev, but
>> it sounds like you are working in this area anyway.  So I'll try to submit
>> this as a (rough) RFC this weekend, and assuming the concept is OK, we
>> will probably need to coordinate how options are handed off to subrepos
>> for 'push' and 'outgoing'.
>
> I'll try to have a look at your patches when I get some time.
>
> Angel
>
>
>> [2] http://www.selenic.com/pipermail/mercurial-devel/2011-
>> September/034453.html
>>
>> [3] http://bz.selenic.com/show_bug.cgi?id=2314
>>
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@selenic.com
>> http://selenic.com/mailman/listinfo/mercurial-devel
>
Angel Ezquerra - Feb. 16, 2013, 10:34 a.m.
On Sat, Feb 16, 2013 at 6:03 AM, Matt Harbison <matt_harbison@yahoo.com> wrote:
> Angel Ezquerra wrote:
>>
>> On Fri, Feb 15, 2013 at 4:49 AM, Matt Harbison<matt_harbison@yahoo.com>
>> wrote:
>>>
>>> On Thu, 14 Feb 2013 01:13:44 +0100, Angel Ezquerra wrote:
>>>
>>>> On Thu, Feb 14, 2013 at 1:06 AM, Angel Ezquerra
>>>> <angel.ezquerra@gmail.com>  wrote:
>>>
>>> ...
>>>
>>>> This is step one in the plan that Matt, Martin and I discussed to
>>>> improve subrepos during the London sprint.
>>>
>>> Is there a brief overview of the plan somewhere?
>>
>>
>> I wrote a summary on the titanpad that we used during the sprint:
>>
>> http://titanpad.com/mercurial26
>>
>> I've taken that and updated it a little with what has been discussed
>> in this thread and my own thinking since them:
>
>
> Thanks for the writeup.  I think I understand most of this, up to the
> caching and deletable parts.  But that seems a bit further into the future,
> and maybe it will become more clear as this evolves.
>
> I'm not sure if deletable helps this, and I haven't looked into it yet, but
> any plans on being able to update between revs where a file exists but in
> the other rev, a subrepo exists instead in the same directory? (I think this
> is what issue3131 is about.)

Yes, that is one of the issues that this is trying to address.

Let me try to explain the "caching and deletable" part a little better:

The idea is that updating to a revision that does not refer to a subrepo should
remove the subrepo form the working directory. That would resolve issue3131
among other things.

This requires fixing several issues:

1. Mercurial should not delete a subrepo that has untracked changes.
2. Mercurial should not delete a subrepo that has "unsynchronized" changes
3. Mercurial should not really delete a subrepo, but "cache it". Otherwise
running hg update 000 on your central repo would delete all your subrepos
as well!
4. Mercurial should be able to pull from a cached subrepo when the actual
subrepo is not found on the remote working directory.

This is what the "deletable" and "caching" will try to fix.

>> * Discuss ways to improve some of the subrepo pain points - angel, mg
>>    - add way to pull subrepos with hg pull
>>    - just push changed subrepos
>>    - subrepos are eternal
>>    - We had a discussion with mpm on this and we believe we have a good
>> plan:
>>        - Add -S/--subrepos flag to hg pull
>>        - Add cleanstore() method to subrepos which can tell pull/push
>> if a subrepo store has changes. If not we can ignore it during push
>>            - The cleanstore method would check a timestamp or sha1 of
>> the bookmarks, phaseroots and the changelog, but not the
>> dirstatedirstate
>> dirstate
>>          - Updated on clone and push; but also on pull if the store is
>> already clean
>>        - Only push not cleanstore() subrepos
>>        - subrepo.deletable(): if self.clean() and dirstate working dir is
>> clean()
>>        - fix Merge bug which makes subrepos stay on the working
>> directory on update even if there are files on the parent that should
>> go on the folder occupied by the subrepo
>>            - using deletable() we could remove subrepos from the working
>> dir...
>>                - however this could cause data loss on update on a
>> "central" repository containing relative subrepos. On non central
>> subrepos it would require recloning the subrepo when updating back to
>> a revision referring to that subrepo.
>>         - Introduce subrepo caching (this would fix the problem above
>> and improve pull time considerably when subrepos are deleted)
>>              - Cloning would need to be smart enough to look on the
>> subrepo cache. Otherwise it would not be possible to pull from central
>> repositories that were are at revision -1 (since all subrepos would be
>> on the cache, and none on the workind directory).
>>
>>     - Document subpaths patterns that can make relative subrepos work
>> with bitbucket and google code
>>
>> Perhaps it would be worth adding this to the wiki?
>>
>>>> I also am working on step two of the plan, which was to add a
>>>> "--subrepos" flag to hg push.
>>>
>>> Is this your idea about passing (some?) parameters to subrepos [1]?  If
>>> so, does 'outgoing' need the same method of filtering the option dict [2]
>>> for consistency?  (I was a bit surprised that outgoing -S passes along
>>> the
>>> --rev option, which causes it to abort in the subrepo with a (parent)
>>> hash, or lie or abort if given a rev.)  There's also a couple bugs
>>> written
>>> about --addremove not being passed along, so what to pass or not seems
>>> like a wider (general?) problem.
>>
>>
>> The way I've implemented it is very similar to how the current
>> subrepo.get() works. Basically I want hg pull --subrepos to behave as
>> if you first did "hg pull" and then you did "hg update -r" for every
>> new revision that "hg pull" brought into the repository. The idea is
>> to make sure that you are able to update to any new revision without
>> needing to have any network access (i.e. that the repository is self
>> contained after doing hg pull --subrepos, as long as it already was
>> self contained before).
>
>
> I really like this capability.

Glad you like it. I think it will be very handy for heavy subrepo users.


>> Martin Gesiler helped me with this patch during the sprint and he has
>> been helping me since then. I'll send a patch soon.
>>
>> That being said, out of the options that pull takes there are a lot
>> that do not make sense when pulling from subrepos, particularly --rev,
>> --bookmark, --branch and specially --rebase.
>
>
> Aren't --bookmark and --branch simply opts that predate the bookmark() and
> branch() revsets, without any special semantics?  (I'm wondering
> specifically about bookmarks because I don't use them much, and IDK if
> --bookmark + --update does anything special that --rev + --update doesn't.)
>
> One of the things that seemed like it would be useful when trying to fix
> push -r is to have the methods in commands.py translate these options to a
> list of revs there, and pass that along.  That way, every repo (even without
> subrepos) gets a list of revs instead of these various opts. The subrepo
> layer would have to translate to child revs before passing them on, but that
> doesn't seem terribly difficult.

I think that the subrepo update should be linked to the revisions that are
pointed to on the parent repository revision). If you want to update your
subrepos independently of the parent repo the best solution IMHO is to
use the onsub extension. Then you can pass a revset as you suggest, which
will be interpreted by every individual command that is executed on each
subrepo.

> I don't use --rebase, but I can see how that would be bad.
>
>
>> I think that if you need
>> to do something special while pulling a subrepo perhaps it would be
>> best to get into the subrepo and do a regular pull, or perhaps use the
>> onsub extension (which I wish we integrated into mercurial). One thing
>> that would help there would be to have a way to refer to the parent
>> paths when pulling from within a subrepo. This would be particularly
>> handy when using relative subrepo definitions, as you should.
>
> I'm not sure what you mean about refering to parent paths.  Do you have a
> use case in mind?

This is a bit unrelated to the rest of the thread. I guess you need to know
how the onsub extension works to understand what I meant. The onsub
extension simply executes a command on every subrepo.

For example:

    hg onsub "hg pull"

This will run hg pull on every subrepo, using the default path of every subrepo.

When a subrepo is first created, its hgrc file is created automatically, and
a default path (and possibly a default-push path) is added to it. This default
path is the one the subrepo was cloned from, as specified on the parent
repository .hgsub file. The example command above would use that default
path when running hg pull on a given subrepo.

The problem is that if the subrepo is defined with a relative sync URL (as
it should be), and then you change the default path on the parent repo,
the pull source when you pull a subrepo from its parent repo, and the pull
source when you pull from within the subrepo itself (as the onsub extension
does) are no longer the same.

That is why I would like the onsub extension to get a way to "access" the
parent repository sync paths. The extension already provides a
$HG_SUBURL environment variable, but it does not provide a way to get
the parent repo default (or other) path.

Cheers,

Angel

Patch

# HG changeset patch
# User Angel Ezquerra <angel.ezquerra@gmail.com>
# Date 1360795816 -3600
# Node ID 26276460d54aecdeb107c82c4e3f2ca7c0c6a8b3
# Parent  55b9b294b7544a6a144f627f71f4b770907d5a98
subrepo: do not push "clean" subrepos when the parent repo is pushed

A clean subrepo is defined as one that has not had its dirstate, bookmarks or
phases modified.

This patch works by adding a "clean" method to subrepos. In the case of
mercurial subrepos, this method calculates a "stamp" (i.e. a set of file hashes)
of the repository state at the time of push with a similar "stamp" that was
stored on a file when the subrepo was cloned or pushed to a given remote target.
If the stamps match the subrepo has no changes that must be pushed to the target
repository and thus the push can be skipped.

Note that we calculate the stamp file by calculating hashes for several key
repository files, such as the dirstate, the bookmarks file and the phaseroots
file. This means that our "clean" detection is not perfect, in the sense that
if the working directory has been updated to a different revision we will
assume that the subrepo is not clean. However, if we update to another revision
and back to the original revision the clean() method will correctly detec the
subrepo as being clean.

Also note that a subrepo being "clean" is not the opposite of it being "dirty".
A subrepo is dirty if it updated to a different revision that the one that is
pointed to by the subrepo parent or if its working directory is not clean. This
is a different concept.

diff --git a/mercurial/subrepo.py b/mercurial/subrepo.py
--- a/mercurial/subrepo.py
+++ b/mercurial/subrepo.py
@@ -300,6 +300,16 @@ 
 
 class abstractsubrepo(object):
 
+    def clean(self):
+        """
+        returns true if the repository has not changed since it was last
+        cloned or pulled.
+        Note that this is very different and definitely not the opposite
+        of the repository being "dirty", which is related to having changes
+        on the working directory or the current revision.
+        """
+        return False
+
     def dirty(self, ignoreupdate=False):
         """returns true if the dirstate of the subrepo is dirty or does not
         match current stored state. If ignoreupdate is true, only check
@@ -426,6 +436,73 @@ 
         self._repo.ui.setconfig('ui', '_usedassubrepo', 'True')
         self._initrepo(r, state[0], create)
 
+    def clean(self, path):
+        """
+        returns true if the repository has not changed since it was last
+        cloned or pulled.
+        Note that this is very different and definitely not the opposite
+        of the repository being "dirty", which is related to having changes
+        on the working directory or the current revision.
+        """
+        return self._calcrepostamp(path) == self._readrepostamp(path)
+
+    def _getfilestamp(self, filename):
+        data = ''
+        if os.path.exists(filename):
+            fd = open(filename)
+            data = fd.read()
+            fd.close()
+        return util.sha1(data).hexdigest()
+
+    def _calcrepostamp(self, remotepath):
+        '''calculate a unique "stamp" for the current repository state
+
+        This method is used to to detect when there are changes that may
+        require a push to a given remote path.'''
+        filelist = ('dirstate', 'bookmarks', 'store/phaseroots')
+        stamp = ['# %s\n' % remotepath]
+        lock = self._repo.lock()
+        try:
+            for relname in filelist:
+                absname = os.path.normpath(self._repo.join(relname))
+                stamp.append('%s = %s\n' % (absname, self._getfilestamp(absname)))
+        finally:
+            lock.release()
+        return stamp
+
+    def _getstampfilename(self, remotepath):
+        '''get a unique filename for the remote repo stamp'''
+        fname = util.sha1(remotepath).hexdigest()
+        return self._repo.join(os.path.join('stamps', fname))
+
+    def _readrepostamp(self, remotepath):
+        '''read an existing remote repository stamp'''
+        stampfile = self._getstampfilename(remotepath)
+        if not os.path.exists(stampfile):
+            return ''
+        fd = open(stampfile, 'r')
+        stamp = fd.readlines()
+        fd.close()
+        return stamp
+
+    def _updaterepostamp(self, remotepath):
+        '''
+        Calc the current repo stamp saving it into a remote repo stamp file
+        Each remote repo requires its own stamp file, because a subrepo may
+        be clean versus a given remote repo, but not versus another.
+        '''
+        # save it to the clean file
+        # We should lock the repo
+        stampfile = self._getstampfilename(remotepath)
+        # [FIXME] should lock the repo? it is already locked by _calcrepostamp
+        stamp = self._calcrepostamp(remotepath)
+        stampdir = self._repo.join('stamps')
+        if not os.path.exists(stampdir):
+            util.makedir(stampdir, True)
+        fd = open(stampfile, 'w')
+        fd.writelines(stamp)
+        fd.close()
+
     @annotatesubrepoerror
     def _initrepo(self, parentrepo, source, create):
         self._repo._subparent = parentrepo
@@ -544,12 +621,17 @@ 
                                          update=False)
                 self._repo = cloned.local()
                 self._initrepo(parentrepo, source, create=True)
+                self._updaterepostamp(srcurl)
             else:
                 self._repo.ui.status(_('pulling subrepo %s from %s\n')
                                      % (subrelpath(self), srcurl))
+                cleansub = self.clean(srcurl)
                 self._repo.pull(other)
                 bookmarks.updatefromremote(self._repo.ui, self._repo, other,
                                            srcurl)
+                if cleansub:
+                    # keep the repo clean after pull
+                    self._updaterepostamp(srcurl)
 
     @annotatesubrepoerror
     def get(self, state, overwrite=False):
@@ -557,6 +639,9 @@ 
         source, revision, kind = state
         self._repo.ui.debug("getting subrepo %s\n" % self._path)
         hg.updaterepo(self._repo, revision, overwrite)
+        srcurl = _abssource(self._repo)
+        if self.clean(srcurl):
+            self._updaterepostamp(srcurl)
 
     @annotatesubrepoerror
     def merge(self, state):
@@ -599,10 +684,20 @@ 
                 return False
 
         dsturl = _abssource(self._repo, True)
+        if not force:
+            if self.clean(dsturl):
+                self._repo.ui.status(
+                    _('no changes made to subrepo %s since last push to %s\n')
+                    % (subrelpath(self), dsturl))
+                return None
         self._repo.ui.status(_('pushing subrepo %s to %s\n') %
             (subrelpath(self), dsturl))
         other = hg.peer(self._repo, {'ssh': ssh}, dsturl)
-        return self._repo.push(other, force, newbranch=newbranch)
+        res = self._repo.push(other, force, newbranch=newbranch)
+
+        # the repo is now clean
+        self._updaterepostamp(dsturl)
+        return res
 
     @annotatesubrepoerror
     def outgoing(self, ui, dest, opts):