Patchwork D6183: copies: add config option for writing copy metadata to file and/or changset

login
register
mail settings
Submitter phabricator
Date April 2, 2019, 7:31 p.m.
Message ID <differential-rev-PHID-DREV-z5ukrljwvjnta3cpmcke-req@phab.mercurial-scm.org>
Download mbox | patch
Permalink /patch/39448/
State Superseded
Headers show

Comments

phabricator - April 2, 2019, 7:31 p.m.
martinvonz created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  This introduces a config option that lets you choose to write copy
  metadata to the changeset extras instead of to filelog. There's also
  an option to write it to both places. I imagine that may possibly be
  useful when transitioning an existing repo.
  
  The copy metadata is stored as two fields in extras: one for copies
  since p1 and one for copies since p2.
  
  I may need to add more information later in order to make copy tracing
  faster. Specifically, I'm thinking out recording which files were
  added or removed so that copies._chaincopies() doesn't have to look at
  the manifest for that. But that would just be an optimization and that
  can be added once we know if it's necessary.
  
  I have also considered saving space by using replacing the destination
  file path by an index into the "files" list, but that can also be
  changed later (but before the feature is ready to release).

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

AFFECTED FILES
  mercurial/changelog.py
  mercurial/configitems.py
  mercurial/localrepo.py
  tests/test-annotate.t
  tests/test-copies-in-changeset.t
  tests/test-fastannotate-hg.t

CHANGE DETAILS




To: martinvonz, #hg-reviewers
Cc: mercurial-devel
phabricator - April 5, 2019, 8:36 a.m.
marmoute added a comment.


  I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d  file, taking possibily hundred of megabytes). In addition, stoing copy in the changeset is kind of a "schema breakage" making its adoption slower.
  
  Instead I would advertise for keeping the copy data inside the filelog, using a changeset centric cache summing up the information. The entries from this changeset centric cache can be exchanged over the wire alongside their associated changesets, solving your remote-filelog usecase.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: marmoute, mercurial-devel
phabricator - April 5, 2019, 1:01 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote:
  
  > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d  file, taking possibily hundred of megabytes).
  
  
  It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)
  
  > In addition, stoing copy in the changeset is kind of a "schema breakage" making its adoption slower.
  > 
  > Instead I would advertise for keeping the copy data inside the filelog, using a changeset centric cache summing up the information. The entries from this changeset centric cache can be exchanged over the wire alongside their associated changesets, solving your remote-filelog usecase.
  
  That sounds more complicated for unclear benefit.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: marmoute, mercurial-devel
phabricator - April 8, 2019, 3:20 p.m.
gracinet added a comment.


  Hi Martin,
  
  Thanks for taking on copy tracing, it's been on our mind for a while, too.
  
  Some of our users would be very interested in the expected speedups of the copy tracing system, however the impact of putting that data in the changeset itself would not be acceptable to them in practice. For instance, if I understand correctly, it would affect all existing hashes and could lead to subtle problems while exchanging data. It's possible that some of that could be worked around over time, but from our perspective, it looks as if a cache-based system would avoid them entirely. This would make the benefits of your work available to all users in the short term.
  
  We understand it looks to be more complex to you, but we're willing to help. I'm pretty confident that, working together, we can nail this before the freeze in a way that would lift all concerns. This matter is important enough to us that we're ready to make working with you on that a priority in our schedule.
  
  If that suits you, we could have a video chat this week to get over the details - of which we could post a summary here to keep the whole community in the loop.
  
  What do you think?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: gracinet, marmoute, mercurial-devel
phabricator - April 8, 2019, 3:43 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#90327, @martinvonz wrote:
  
  > In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote:
  >
  > > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d  file, taking possibily hundred of megabytes).
  >
  >
  > It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)
  
  
  It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:
  
              size           |   in filelog | in changeset | increase |
    .hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
    .hg/                     |   2866813806 |   2804688010 |   -2.17% |
  
  The performance impact is terrible, however. `hg st --rev last-mozilla-central --rev GECKO_2_1_BASE` (~30k commits apart) went from about 5 seconds to about 6 minutes. That's because the current code reads manifests. We should be able to remove that.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: gracinet, marmoute, mercurial-devel
phabricator - April 8, 2019, 3:59 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#90470, @gracinet wrote:
  
  > Hi Martin,
  >
  > Thanks for taking on copy tracing, it's been on our mind for a while, too.
  >
  > Some of our users would be very interested in the expected speedups of the copy tracing system, however the impact of putting that data in the changeset itself would not be acceptable to them in practice. For instance, if I understand correctly, it would affect all existing hashes and could lead to subtle problems while exchanging data. It's possible that some of that could be worked around over time, but from our perspective, it looks as if a cache-based system would avoid them entirely. This would make the benefits of your work available to all users in the short term.
  >
  > We understand it looks to be more complex to you, but we're willing to help. I'm pretty confident that, working together, we can nail this before the freeze in a way that would lift all concerns. This matter is important enough to us that we're ready to make working with you on that a priority in our schedule.
  >
  > If that suits you, we could have a video chat this week to get over the details - of which we could post a summary here to keep the whole community in the loop.
  >
  > What do you think?
  
  
  I see benefits of both solutions. As you said, the cache solution's primary benefit is that it works on existing repos. For new repos (in environments where everyone using the repo has a new Mercurial version), it seems simpler to store and exchange the copy information in the changeset instead of writing it to the filelog and also to a cache.
  
  I think the way I've written these patches, it shouldn't be too hard for you guys to extend it (in core) by adding another option for `experimental.copies.read-from` to make it read from the cache. I'm happy to talk about that. I'm honestly not happy to talk about doing only the cache solution. Okay with you?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: gracinet, marmoute, mercurial-devel
phabricator - April 8, 2019, 4:06 p.m.
pulkit added a comment.


  In https://phab.mercurial-scm.org/D6183#90481, @martinvonz wrote:
  
  > In https://phab.mercurial-scm.org/D6183#90327, @martinvonz wrote:
  >
  > > In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote:
  > >
  > > > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d  file, taking possibily hundred of megabytes).
  > >
  > >
  > > It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)
  >
  >
  > It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:
  >
  >             size           |   in filelog | in changeset | increase |
  >   .hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
  >   .hg/                     |   2866813806 |   2804688010 |   -2.17% |
  >
  
  
  I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 8, 2019, 4:12 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#90486, @pulkit wrote:
  
  > In https://phab.mercurial-scm.org/D6183#90481, @martinvonz wrote:
  >
  > > In https://phab.mercurial-scm.org/D6183#90327, @martinvonz wrote:
  > >
  > > > In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote:
  > > >
  > > > > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d  file, taking possibily hundred of megabytes).
  > > >
  > > >
  > > > It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)
  > >
  > >
  > > It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:
  > >
  > >             size           |   in filelog | in changeset | increase |
  > >   .hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
  > >   .hg/                     |   2866813806 |   2804688010 |   -2.17% |
  > >
  >
  >
  > I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?
  
  
  You'll need to patch in this series and also https://phab.mercurial-scm.org/D6219. Then run `hg convert --config experimental.copies.write-to=changeset-only`. However, note that the performance is most likely going to be a lot *worse* for now, so there's not much reason to try it IMO (except to verify that my numbers above are valid, or to see that it won't be much worse in your repo). It took over 30 hours to convert the mozilla-unified repo for me.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 8, 2019, 4:45 p.m.
pulkit added a comment.


  In https://phab.mercurial-scm.org/D6183#90498, @martinvonz wrote:
  
  > In https://phab.mercurial-scm.org/D6183#90486, @pulkit wrote:
  >
  > > In https://phab.mercurial-scm.org/D6183#90481, @martinvonz wrote:
  > >
  > > > In https://phab.mercurial-scm.org/D6183#90327, @martinvonz wrote:
  > > >
  > > > > In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote:
  > > > >
  > > > > > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d  file, taking possibily hundred of megabytes).
  > > > >
  > > > >
  > > > > It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.)
  > > >
  > > >
  > > > It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result:
  > > >
  > > >             size           |   in filelog | in changeset | increase |
  > > >   .hg/store/00changelog.d  |    127067298 |    128173208 |    0.87% |
  > > >   .hg/                     |   2866813806 |   2804688010 |   -2.17% |
  > > >
  > >
  > >
  > > I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo?
  >
  >
  > You'll need to patch in this series and also https://phab.mercurial-scm.org/D6219. Then run `hg convert --config experimental.copies.write-to=changeset-only`. However, note that the performance is most likely going to be a lot *worse* for now, so there's not much reason to try it IMO (except to verify that my numbers above are valid, or to see that it won't be much worse in your repo). It took over 30 hours to convert the mozilla-unified repo for me.
  
  
  Thanks, I will check the change in size of .hg/ and changelog size.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 11, 2019, 9:22 a.m.
gracinet added a comment.


  We had half an hour of direct chat about this yesterday, with @martinvonz and @marmoute. Here's a summary
  
  - The architecture of the proposed change makes the storage and conveying of the information orthogonal to its usage. Therefore, it can be built upon to introduce other storage strategies, such as the caching system suggested by @marmoute.
  - The documentation for the new config option should clearly state that `changeset-only` is applicable only in cases where all the peers enable it and prior changesets hashes don't matter. I'm not 100% sure of the exact technical condition at this point, but I suppose someone can find a simple sentence that would be correct enough.
  - Octobus hopes that the caching strategy could eventually become useful for the general public.
  - @martinvonz helped us understand his implementation by answering further technical questions by @marmoute.
  
  Overall, my feeling is that it's been a very productive talk and that we have a clear path forward.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 12, 2019, 4:26 a.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#90573, @gracinet wrote:
  
  > We had half an hour of direct chat about this yesterday, with @martinvonz and @marmoute. Here's a summary
  >
  > - The architecture of the proposed change makes the storage and conveying of the information orthogonal to its usage. Therefore, it can be built upon to introduce other storage strategies, such as the caching system suggested by @marmoute.
  > - The documentation for the new config option should clearly state that `changeset-only` is applicable only in cases where all the peers enable it and prior changesets hashes don't matter. I'm not 100% sure of the exact technical condition at this point, but I suppose someone can find a simple sentence that would be correct enough.
  > - Octobus hopes that the caching strategy could eventually become useful for the general public.
  > - @martinvonz helped us understand his implementation by answering further technical questions by @marmoute.
  >
  >   Overall, my feeling is that it's been a very productive talk and that we have a clear path forward.
  
  
  Thanks for the summary, Georges! Reviewers, feel free to queue this if you think it looks good.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: pulkit, gracinet, marmoute, mercurial-devel
Yuya Nishihara - April 13, 2019, 11:43 a.m.
> +def encodecopies(copies):
> +    items = [
> +        '%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
> +        for k in sorted(copies)
> +    ]
> +    return "\n".join(items)

It might be nitpicky, but I think it's better to not embed `\0` into the
extras field. Almost all extras data are texts, and IIRC we regret that
transplant sources are stored in binary form.
phabricator - April 13, 2019, 11:45 a.m.
yuja added a comment.


  > +def encodecopies(copies):
  >  +    items = [
  >  +        '%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
  >  +        for k in sorted(copies)
  >  +    ]
  >  +    return "\n".join(items)
  
  It might be nitpicky, but I think it's better to not embed `\0` into the
  extras field. Almost all extras data are texts, and IIRC we regret that
  transplant sources are stored in binary form.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 13, 2019, 4:32 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#90698, @yuja wrote:
  
  > > +def encodecopies(copies):
  > >  +    items = [
  > >  +        '%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
  > >  +        for k in sorted(copies)
  > >  +    ]
  > >  +    return "\n".join(items)
  >
  > It might be nitpicky, but I think it's better to not embed `\0` into the
  >  extras field. Almost all extras data are texts, and IIRC we regret that
  >  transplant sources are stored in binary form.
  
  
  Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
Yuya Nishihara - April 14, 2019, 12:25 a.m.
>   > It might be nitpicky, but I think it's better to not embed `\0` into the
>   >  extras field. Almost all extras data are texts, and IIRC we regret that
>   >  transplant sources are stored in binary form.
>   
>   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.

I don't remember, but we do store even boolean value as text, not in binary
`\0`/`\1` form. `transplant_source` is the solo exception.

https://www.mercurial-scm.org/wiki/ChangesetExtra

And if we pick \0/\n separators, _string_escape() wouldn't be needed
at the encodecopies() layer.
phabricator - April 14, 2019, 12:26 a.m.
yuja added a comment.


  >   > It might be nitpicky, but I think it's better to not embed `\0` into the
  >   >  extras field. Almost all extras data are texts, and IIRC we regret that
  >   >  transplant sources are stored in binary form.
  >   
  >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
  
  I don't remember, but we do store even boolean value as text, not in binary
  `\0`/`\1` form. `transplant_source` is the solo exception.
  
  https://www.mercurial-scm.org/wiki/ChangesetExtra
  
  And if we pick \0/\n separators, _string_escape() wouldn't be needed
  at the encodecopies() layer.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 14, 2019, 5:22 a.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#90722, @yuja wrote:
  
  > >   > It might be nitpicky, but I think it's better to not embed `\0` into the
  > >   >  extras field. Almost all extras data are texts, and IIRC we regret that
  > >   >  transplant sources are stored in binary form.
  > >   
  > >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
  >
  > I don't remember, but we do store even boolean value as text, not in binary
  >  `\0`/`\1` form. `transplant_source` is the solo exception.
  
  
  Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)
  
  > https://www.mercurial-scm.org/wiki/ChangesetExtra
  > 
  > And if we pick \0/\n separators, _string_escape() wouldn't be needed
  >  at the encodecopies() layer.
  
  Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
Yuya Nishihara - April 14, 2019, 11:01 p.m.
>   > >   >  extras field. Almost all extras data are texts, and IIRC we regret that
>   > >   >  transplant sources are stored in binary form.
>   > >   
>   > >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
>   >
>   > I don't remember, but we do store even boolean value as text, not in binary
>   >  `\0`/`\1` form. `transplant_source` is the solo exception.
>   
>   
>   Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)

Ok. That might be the reason, and I'm fine with the `\0` separator.

>   > https://www.mercurial-scm.org/wiki/ChangesetExtra
>   > 
>   > And if we pick \0/\n separators, _string_escape() wouldn't be needed
>   >  at the encodecopies() layer.
>   
>   Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that.

Not really. I meant `_string_escape()` could be removed entirely if we store
copies in binary (valid_filename + invalid_filename_separator) form. The extra
dict will be encoded later.
phabricator - April 14, 2019, 11:02 p.m.
yuja added a comment.


  >   > >   >  extras field. Almost all extras data are texts, and IIRC we regret that
  >   > >   >  transplant sources are stored in binary form.
  >   > >   
  >   > >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
  >   >
  >   > I don't remember, but we do store even boolean value as text, not in binary
  >   >  `\0`/`\1` form. `transplant_source` is the solo exception.
  >   
  >   
  >   Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)
  
  Ok. That might be the reason, and I'm fine with the `\0` separator.
  
  >   > https://www.mercurial-scm.org/wiki/ChangesetExtra
  >   > 
  >   > And if we pick \0/\n separators, _string_escape() wouldn't be needed
  >   >  at the encodecopies() layer.
  >   
  >   Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that.
  
  Not really. I meant `_string_escape()` could be removed entirely if we store
  copies in binary (valid_filename + invalid_filename_separator) form. The extra
  dict will be encoded later.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 16, 2019, 4:10 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#90738, @yuja wrote:
  
  > >   > >   >  extras field. Almost all extras data are texts, and IIRC we regret that
  > >   > >   >  transplant sources are stored in binary form.
  > >   > >   
  > >   > >   Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way.
  > >   >
  > >   > I don't remember, but we do store even boolean value as text, not in binary
  > >   >  `\0`/`\1` form. `transplant_source` is the solo exception.
  > >   
  > >   
  > >   Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.)
  >
  > Ok. That might be the reason, and I'm fine with the `\0` separator.
  >
  > >   > https://www.mercurial-scm.org/wiki/ChangesetExtra
  > >   > 
  > >   > And if we pick \0/\n separators, _string_escape() wouldn't be needed
  > >   >  at the encodecopies() layer.
  > >   
  > >   Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that.
  >
  > Not really. I meant `_string_escape()` could be removed entirely if we store
  >  copies in binary (valid_filename + invalid_filename_separator) form. The extra
  >  dict will be encoded later.
  
  
  Sure, if we're okay with the `\0` and `\n` separators being printed to the terminal when the user uses the `{extras}` template, then we can just drop the encoding. Sounds like you're okay with that, and I also don't care too much, so I'll drop the encoding.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 17, 2019, 10:38 a.m.
indygreg added a comment.


  I support experimenting with putting copy metadata in the changelog. And the patches before this one did a lot of work to allow copy metadata to be read from alternate sources, which is great, since it can allow flexibility in the future (think copy caches, copy modifications outside of a commit, etc).
  
  I haven't looked at all these patches in detail, but it seems to me there should be a repo requirement in the case(s) where (all) copy metadata is not in the filelogs. Without a repo requirement, an old client may attempt to open a repo and not be able to find the copy metadata. It is OK to duplicate copy metadata in the changelog and have newer clients use copy metadata from the changelog if it is available. But if all copy metadata isn't available in the filelogs, there needs to be a requirement to lock out old clients.
  
  That being said, we may want to be aggressive than this! If a new client is writing copy metadata to filelogs and the changelog, an old client may commit to the repo with the copy metadata just in the filelogs. I'm not sure about the code behavior, but presumably a new client configured to use changelog copy metadata would forego reading the filelog metadata since it is expecting to read it from the changelog. This could result in a new client missing copy metadata written by an old client. So we would need a repo requirement to lock out old clients from writing to the repo.
  
  Then there's the wire protocol aspect. How does the copy metadata writing setting get propagated to the client? If it fails to get propagated, it is a similar situation to the local repo situation. Again, there needs to be some kind of requirement/capability detection here and the server setting needs to find its way to the client or else bad things can happen.
  
  Anyway, this is exciting work! It is still an experimental feature, so the implementation doesn't have to be perfect. But we will need to cross the repo requirements/capabilities bridge at some point. Can't wait to see the benefits of this work!

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: indygreg, yuja, pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 17, 2019, 10:43 a.m.
indygreg added a comment.


  An idea to consider (which may have been proposed already) is to write a *no copy metadata* entry into extras when writing copy metadata to the changelog. If we did things this way, a new client could know definitively that no copy metadata is available and to not fall back to reading from the filelogs. I haven't fully thought this through, but that should provide better compatibility between older and newer clients. Obviously the tradeoff is you could have a mixed repo (some changesets wouldn't have copy metadata in changelog) and you would need to duplicate copy metadata across changelog and filelogs to maintain compatibility. Something to contemplate...

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: indygreg, yuja, pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 18, 2019, 5:59 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#91060, @indygreg wrote:
  
  > I support experimenting with putting copy metadata in the changelog. And the patches before this one did a lot of work to allow copy metadata to be read from alternate sources, which is great, since it can allow flexibility in the future (think copy caches, copy modifications outside of a commit, etc).
  
  
  Yes, Pierre-Yves and Georges wanted to work on adding them to a cache. As you hint at, that would also allow copy detection to be run at a later point to update the cache. And yes, the previous patches should have decoupled the algorithms from the storage well (the new assumption is that we can cheaply get copy metadata for a whole changeset).
  
  > I haven't looked at all these patches in detail, but it seems to me there should be a repo requirement in the case(s) where (all) copy metadata is not in the filelogs. Without a repo requirement, an old client may attempt to open a repo and not be able to find the copy metadata. It is OK to duplicate copy metadata in the changelog and have newer clients use copy metadata from the changelog if it is available. But if all copy metadata isn't available in the filelogs, there needs to be a requirement to lock out old clients.
  
  I had considered that, but figured that copy information isn't essential enough to warrant a repo requirement. But I agree with your next paragraph.
  
  > That being said, we may want to be aggressive than this! If a new client is writing copy metadata to filelogs and the changelog, an old client may commit to the repo with the copy metadata just in the filelogs. I'm not sure about the code behavior, but presumably a new client configured to use changelog copy metadata would forego reading the filelog metadata since it is expecting to read it from the changelog. This could result in a new client missing copy metadata written by an old client. So we would need a repo requirement to lock out old clients from writing to the repo.
  
  That's still not a disaster, but I agree that it still seems better to lock them out.
  
  > Then there's the wire protocol aspect. How does the copy metadata writing setting get propagated to the client? If it fails to get propagated, it is a similar situation to the local repo situation. Again, there needs to be some kind of requirement/capability detection here and the server setting needs to find its way to the client or else bad things can happen.
  
  Good point!
  
  > Anyway, this is exciting work! It is still an experimental feature, so the implementation doesn't have to be perfect. But we will need to cross the repo requirements/capabilities bridge at some point. Can't wait to see the benefits of this work!

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: indygreg, yuja, pulkit, gracinet, marmoute, mercurial-devel
phabricator - April 18, 2019, 6:09 p.m.
martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6183#91062, @indygreg wrote:
  
  > An idea to consider (which may have been proposed already) is to write a *no copy metadata* entry into extras when writing copy metadata to the changelog. If we did things this way, a new client could know definitively that no copy metadata is available and to not fall back to reading from the filelogs. I haven't fully thought this through, but that should provide better compatibility between older and newer clients. Obviously the tradeoff is you could have a mixed repo (some changesets wouldn't have copy metadata in changelog) and you would need to duplicate copy metadata across changelog and filelogs to maintain compatibility. Something to contemplate...
  
  
  That's actually what I did initially, and `context._copies()` is still written to work that way (not consult filelogs if an empty p1copies entry was recorded in the changeset), but I haven't added a mode where we write empty entries. I should do that. Thanks for pointing that out.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6183

To: martinvonz, #hg-reviewers
Cc: indygreg, yuja, pulkit, gracinet, marmoute, mercurial-devel

Patch

diff --git a/tests/test-fastannotate-hg.t b/tests/test-fastannotate-hg.t
--- a/tests/test-fastannotate-hg.t
+++ b/tests/test-fastannotate-hg.t
@@ -443,7 +443,7 @@ 
   > def reposetup(ui, repo):
   >     class legacyrepo(repo.__class__):
   >         def _filecommit(self, fctx, manifest1, manifest2,
-  >                         linkrev, tr, changelist):
+  >                         linkrev, tr, changelist, includecopymeta):
   >             fname = fctx.path()
   >             text = fctx.data()
   >             flog = self.file(fname)
diff --git a/tests/test-copies-in-changeset.t b/tests/test-copies-in-changeset.t
new file mode 100644
--- /dev/null
+++ b/tests/test-copies-in-changeset.t
@@ -0,0 +1,113 @@ 
+
+  $ cat >> $HGRCPATH << EOF
+  > [experimental]
+  > copies.write-to=changeset-only
+  > [alias]
+  > changesetcopies = log -r . -T "files: {files}
+  >   p1copies: {get(extras,'p1copies')}
+  >   p2copies: {get(extras,'p2copies')}
+  >   "
+  > EOF
+
+Check that copies are recorded correctly
+
+  $ hg init repo
+  $ cd repo
+  $ echo a > a
+  $ hg add a
+  $ hg ci -m initial
+  $ hg cp a b
+  $ hg cp a c
+  $ hg cp a d
+  $ hg ci -m 'copy a to b, c, and d'
+  $ hg changesetcopies
+  files: b c d
+  p1copies: b\x00a (esc)
+  c\x00a (esc)
+  d\x00a (esc)
+  p2copies: 
+
+Check that renames are recorded correctly
+
+  $ hg mv b b2
+  $ hg ci -m 'rename b to b2'
+  $ hg changesetcopies
+  files: b b2
+  p1copies: b2\x00b (esc)
+  p2copies: 
+
+Rename onto existing file. This should get recorded in the changeset files list and in the extras,
+even though there is no filelog entry.
+
+  $ hg cp b2 c --force
+  $ hg st --copies
+  M c
+    b2
+  $ hg debugindex c
+     rev linkrev nodeid       p1           p2
+       0       1 b789fdd96dc2 000000000000 000000000000
+  $ hg ci -m 'move b onto d'
+  $ hg changesetcopies
+  files: c
+  p1copies: c\x00b2 (esc)
+  p2copies: 
+  $ hg debugindex c
+     rev linkrev nodeid       p1           p2
+       0       1 b789fdd96dc2 000000000000 000000000000
+
+Create a merge commit with copying done during merge.
+
+  $ hg co 0
+  0 files updated, 0 files merged, 3 files removed, 0 files unresolved
+  $ hg cp a e
+  $ hg cp a f
+  $ hg ci -m 'copy a to e and f'
+  created new head
+  $ hg merge 3
+  3 files updated, 0 files merged, 0 files removed, 0 files unresolved
+  (branch merge, don't forget to commit)
+File 'a' exists on both sides, so 'g' could be recorded as being from p1 or p2, but we currently
+always record it as being from p1
+  $ hg cp a g
+File 'd' exists only in p2, so 'h' should be from p2
+  $ hg cp d h
+File 'f' exists only in p1, so 'i' should be from p1
+  $ hg cp f i
+  $ hg ci -m 'merge'
+  $ hg changesetcopies
+  files: g h i
+  p1copies: g\x00a (esc)
+  i\x00f (esc)
+  p2copies: h\x00d (esc)
+
+Test writing to both changeset and filelog
+
+  $ hg cp a j
+  $ hg ci -m 'copy a to j' --config experimental.copies.write-to=compatibility
+  $ hg changesetcopies
+  files: j
+  p1copies: j\x00a (esc)
+  p2copies: 
+  $ hg debugdata j 0
+  \x01 (esc)
+  copy: a
+  copyrev: b789fdd96dc2f3bd229c1dd8eedf0fc60e2b68e3
+  \x01 (esc)
+  a
+
+Test writing only to filelog
+
+  $ hg cp a k
+  $ hg ci -m 'copy a to k' --config experimental.copies.write-to=filelog-only
+  $ hg changesetcopies
+  files: k
+  p1copies: 
+  p2copies: 
+  $ hg debugdata k 0
+  \x01 (esc)
+  copy: a
+  copyrev: b789fdd96dc2f3bd229c1dd8eedf0fc60e2b68e3
+  \x01 (esc)
+  a
+
+  $ cd ..
diff --git a/tests/test-annotate.t b/tests/test-annotate.t
--- a/tests/test-annotate.t
+++ b/tests/test-annotate.t
@@ -438,7 +438,7 @@ 
   > def reposetup(ui, repo):
   >     class legacyrepo(repo.__class__):
   >         def _filecommit(self, fctx, manifest1, manifest2,
-  >                         linkrev, tr, changelist):
+  >                         linkrev, tr, changelist, includecopymeta):
   >             fname = fctx.path()
   >             text = fctx.data()
   >             flog = self.file(fname)
diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py
--- a/mercurial/localrepo.py
+++ b/mercurial/localrepo.py
@@ -2292,7 +2292,8 @@ 
         """Returns the wlock if it's held, or None if it's not."""
         return self._currentlock(self._wlockref)
 
-    def _filecommit(self, fctx, manifest1, manifest2, linkrev, tr, changelist):
+    def _filecommit(self, fctx, manifest1, manifest2, linkrev, tr, changelist,
+                    includecopymeta):
         """
         commit an individual file as part of a larger transaction
         """
@@ -2337,7 +2338,7 @@ 
             if manifest2: # branch merge
                 if fparent2 == nullid or cnode is None: # copied on remote side
                     if cfname in manifest2:
-                        cnode= manifest2[cfname]
+                        cnode = manifest2[cfname]
                         newfparent = fparent1
 
             # Here, we used to search backwards through history to try to find
@@ -2351,8 +2352,9 @@ 
 
             if cnode:
                 self.ui.debug(" %s: copy %s:%s\n" % (fname, cfname, hex(cnode)))
-                meta["copy"] = cfname
-                meta["copyrev"] = hex(cnode)
+                if includecopymeta:
+                    meta["copy"] = cfname
+                    meta["copyrev"] = hex(cnode)
                 fparent1, fparent2 = nullid, newfparent
             else:
                 self.ui.warn(_("warning: can't find ancestor for '%s' "
@@ -2520,6 +2522,12 @@ 
         p1, p2 = ctx.p1(), ctx.p2()
         user = ctx.user()
 
+        writecopiesto = self.ui.config('experimental', 'copies.write-to')
+        writefilecopymeta = writecopiesto != 'changeset-only'
+        p1copies, p2copies = None, None
+        if writecopiesto in ('changeset-only', 'compatibility'):
+            p1copies = ctx.p1copies()
+            p2copies = ctx.p2copies()
         with self.lock(), self.transaction("commit") as tr:
             trp = weakref.proxy(tr)
 
@@ -2553,7 +2561,8 @@ 
                         else:
                             added.append(f)
                             m[f] = self._filecommit(fctx, m1, m2, linkrev,
-                                                    trp, changed)
+                                                    trp, changed,
+                                                    writefilecopymeta)
                             m.setflag(f, fctx.flags())
                     except OSError:
                         self.ui.warn(_("trouble committing %s!\n") %
@@ -2607,7 +2616,8 @@ 
             self.changelog.delayupdate(tr)
             n = self.changelog.add(mn, files, ctx.description(),
                                    trp, p1.node(), p2.node(),
-                                   user, ctx.date(), ctx.extra().copy())
+                                   user, ctx.date(), ctx.extra().copy(),
+                                   p1copies, p2copies)
             xp1, xp2 = p1.hex(), p2 and p2.hex() or ''
             self.hook('pretxncommit', throw=True, node=hex(n), parent1=xp1,
                       parent2=xp2)
diff --git a/mercurial/configitems.py b/mercurial/configitems.py
--- a/mercurial/configitems.py
+++ b/mercurial/configitems.py
@@ -488,6 +488,9 @@ 
 coreconfigitem('experimental', 'copies.read-from',
     default="filelog-only",
 )
+coreconfigitem('experimental', 'copies.write-to',
+    default='file',
+)
 coreconfigitem('experimental', 'crecordtest',
     default=None,
 )
diff --git a/mercurial/changelog.py b/mercurial/changelog.py
--- a/mercurial/changelog.py
+++ b/mercurial/changelog.py
@@ -77,6 +77,13 @@ 
     ]
     return "\0".join(items)
 
+def encodecopies(copies):
+    items = [
+        '%s\0%s' % (_string_escape(k), _string_escape(copies[k]))
+        for k in sorted(copies)
+    ]
+    return "\n".join(items)
+
 def stripdesc(desc):
     """strip trailing whitespace and leading and trailing empty lines"""
     return '\n'.join([l.rstrip() for l in desc.splitlines()]).strip('\n')
@@ -530,7 +537,7 @@ 
         return l[3:]
 
     def add(self, manifest, files, desc, transaction, p1, p2,
-                  user, date=None, extra=None):
+                  user, date=None, extra=None, p1copies=None, p2copies=None):
         # Convert to UTF-8 encoded bytestrings as the very first
         # thing: calling any method on a localstr object will turn it
         # into a str object and the cached UTF-8 string is thus lost.
@@ -559,6 +566,13 @@ 
             elif branch in (".", "null", "tip"):
                 raise error.StorageError(_('the name \'%s\' is reserved')
                                          % branch)
+        if (p1copies or p2copies) and extra is None:
+            extra = {}
+        if p1copies:
+            extra['p1copies'] = encodecopies(p1copies)
+        if p2copies:
+            extra['p2copies'] = encodecopies(p2copies)
+
         if extra:
             extra = encodeextra(extra)
             parseddate = "%s %s" % (parseddate, extra)