Submitter | phabricator |
---|---|
Date | April 2, 2019, 7:31 p.m. |
Message ID | <differential-rev-PHID-DREV-z5ukrljwvjnta3cpmcke-req@phab.mercurial-scm.org> |
Download | mbox | patch |
Permalink | /patch/39448/ |
State | Superseded |
Headers | show |
Comments
marmoute added a comment. I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes). In addition, stoing copy in the changeset is kind of a "schema breakage" making its adoption slower. Instead I would advertise for keeping the copy data inside the filelog, using a changeset centric cache summing up the information. The entries from this changeset centric cache can be exchanged over the wire alongside their associated changesets, solving your remote-filelog usecase. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote: > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes). It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.) > In addition, stoing copy in the changeset is kind of a "schema breakage" making its adoption slower. > > Instead I would advertise for keeping the copy data inside the filelog, using a changeset centric cache summing up the information. The entries from this changeset centric cache can be exchanged over the wire alongside their associated changesets, solving your remote-filelog usecase. That sounds more complicated for unclear benefit. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: marmoute, mercurial-devel
gracinet added a comment. Hi Martin, Thanks for taking on copy tracing, it's been on our mind for a while, too. Some of our users would be very interested in the expected speedups of the copy tracing system, however the impact of putting that data in the changeset itself would not be acceptable to them in practice. For instance, if I understand correctly, it would affect all existing hashes and could lead to subtle problems while exchanging data. It's possible that some of that could be worked around over time, but from our perspective, it looks as if a cache-based system would avoid them entirely. This would make the benefits of your work available to all users in the short term. We understand it looks to be more complex to you, but we're willing to help. I'm pretty confident that, working together, we can nail this before the freeze in a way that would lift all concerns. This matter is important enough to us that we're ready to make working with you on that a priority in our schedule. If that suits you, we could have a video chat this week to get over the details - of which we could post a summary here to keep the whole community in the loop. What do you think? REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: gracinet, marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#90327, @martinvonz wrote: > In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote: > > > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes). > > > It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.) It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result: size | in filelog | in changeset | increase | .hg/store/00changelog.d | 127067298 | 128173208 | 0.87% | .hg/ | 2866813806 | 2804688010 | -2.17% | The performance impact is terrible, however. `hg st --rev last-mozilla-central --rev GECKO_2_1_BASE` (~30k commits apart) went from about 5 seconds to about 6 minutes. That's because the current code reads manifests. We should be able to remove that. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: gracinet, marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#90470, @gracinet wrote: > Hi Martin, > > Thanks for taking on copy tracing, it's been on our mind for a while, too. > > Some of our users would be very interested in the expected speedups of the copy tracing system, however the impact of putting that data in the changeset itself would not be acceptable to them in practice. For instance, if I understand correctly, it would affect all existing hashes and could lead to subtle problems while exchanging data. It's possible that some of that could be worked around over time, but from our perspective, it looks as if a cache-based system would avoid them entirely. This would make the benefits of your work available to all users in the short term. > > We understand it looks to be more complex to you, but we're willing to help. I'm pretty confident that, working together, we can nail this before the freeze in a way that would lift all concerns. This matter is important enough to us that we're ready to make working with you on that a priority in our schedule. > > If that suits you, we could have a video chat this week to get over the details - of which we could post a summary here to keep the whole community in the loop. > > What do you think? I see benefits of both solutions. As you said, the cache solution's primary benefit is that it works on existing repos. For new repos (in environments where everyone using the repo has a new Mercurial version), it seems simpler to store and exchange the copy information in the changeset instead of writing it to the filelog and also to a cache. I think the way I've written these patches, it shouldn't be too hard for you guys to extend it (in core) by adding another option for `experimental.copies.read-from` to make it read from the cache. I'm happy to talk about that. I'm honestly not happy to talk about doing only the cache solution. Okay with you? REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: gracinet, marmoute, mercurial-devel
pulkit added a comment. In https://phab.mercurial-scm.org/D6183#90481, @martinvonz wrote: > In https://phab.mercurial-scm.org/D6183#90327, @martinvonz wrote: > > > In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote: > > > > > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes). > > > > > > It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.) > > > It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result: > > size | in filelog | in changeset | increase | > .hg/store/00changelog.d | 127067298 | 128173208 | 0.87% | > .hg/ | 2866813806 | 2804688010 | -2.17% | > I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo? REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: pulkit, gracinet, marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#90486, @pulkit wrote: > In https://phab.mercurial-scm.org/D6183#90481, @martinvonz wrote: > > > In https://phab.mercurial-scm.org/D6183#90327, @martinvonz wrote: > > > > > In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote: > > > > > > > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes). > > > > > > > > > It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.) > > > > > > It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result: > > > > size | in filelog | in changeset | increase | > > .hg/store/00changelog.d | 127067298 | 128173208 | 0.87% | > > .hg/ | 2866813806 | 2804688010 | -2.17% | > > > > > I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo? You'll need to patch in this series and also https://phab.mercurial-scm.org/D6219. Then run `hg convert --config experimental.copies.write-to=changeset-only`. However, note that the performance is most likely going to be a lot *worse* for now, so there's not much reason to try it IMO (except to verify that my numbers above are valid, or to see that it won't be much worse in your repo). It took over 30 hours to convert the mozilla-unified repo for me. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: pulkit, gracinet, marmoute, mercurial-devel
pulkit added a comment. In https://phab.mercurial-scm.org/D6183#90498, @martinvonz wrote: > In https://phab.mercurial-scm.org/D6183#90486, @pulkit wrote: > > > In https://phab.mercurial-scm.org/D6183#90481, @martinvonz wrote: > > > > > In https://phab.mercurial-scm.org/D6183#90327, @martinvonz wrote: > > > > > > > In https://phab.mercurial-scm.org/D6183#90300, @marmoute wrote: > > > > > > > > > I am quite enthousiastic for a non-filelog based copy tracing using commit level information. However, I am very unenthousiastic at the idea of storing more copy data in the changeset itself. The "files" field in the changelog is already quite problematic (about 95% for changelog.d file, taking possibily hundred of megabytes). > > > > > > > > > > > > It seems like it should save a similar amount of data from the filelog, so what's your concern? That things that need to scan the changelog will need to read more data from disk? Most commits don't have copy information, so I'm not too worried about this. Feel free to experiment on a large repo and insert copy information in changesets there and see how much larger changelog.d becomes. (I was planning to do that later for performance testing.) > > > > > > > > > It took a few days to convert (an old version of) the mozilla-unified repo. I converted it once with copies in changeset and once with copies in filelogs (to remove any influence from different delta base selection in new versions of hg). Here's the result: > > > > > > size | in filelog | in changeset | increase | > > > .hg/store/00changelog.d | 127067298 | 128173208 | 0.87% | > > > .hg/ | 2866813806 | 2804688010 | -2.17% | > > > > > > > > > I am following this discussion closely because copytracing is very painful for us too. The above numbers looks nice. How can I try this myself on some internal repo? > > > You'll need to patch in this series and also https://phab.mercurial-scm.org/D6219. Then run `hg convert --config experimental.copies.write-to=changeset-only`. However, note that the performance is most likely going to be a lot *worse* for now, so there's not much reason to try it IMO (except to verify that my numbers above are valid, or to see that it won't be much worse in your repo). It took over 30 hours to convert the mozilla-unified repo for me. Thanks, I will check the change in size of .hg/ and changelog size. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: pulkit, gracinet, marmoute, mercurial-devel
gracinet added a comment. We had half an hour of direct chat about this yesterday, with @martinvonz and @marmoute. Here's a summary - The architecture of the proposed change makes the storage and conveying of the information orthogonal to its usage. Therefore, it can be built upon to introduce other storage strategies, such as the caching system suggested by @marmoute. - The documentation for the new config option should clearly state that `changeset-only` is applicable only in cases where all the peers enable it and prior changesets hashes don't matter. I'm not 100% sure of the exact technical condition at this point, but I suppose someone can find a simple sentence that would be correct enough. - Octobus hopes that the caching strategy could eventually become useful for the general public. - @martinvonz helped us understand his implementation by answering further technical questions by @marmoute. Overall, my feeling is that it's been a very productive talk and that we have a clear path forward. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: pulkit, gracinet, marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#90573, @gracinet wrote: > We had half an hour of direct chat about this yesterday, with @martinvonz and @marmoute. Here's a summary > > - The architecture of the proposed change makes the storage and conveying of the information orthogonal to its usage. Therefore, it can be built upon to introduce other storage strategies, such as the caching system suggested by @marmoute. > - The documentation for the new config option should clearly state that `changeset-only` is applicable only in cases where all the peers enable it and prior changesets hashes don't matter. I'm not 100% sure of the exact technical condition at this point, but I suppose someone can find a simple sentence that would be correct enough. > - Octobus hopes that the caching strategy could eventually become useful for the general public. > - @martinvonz helped us understand his implementation by answering further technical questions by @marmoute. > > Overall, my feeling is that it's been a very productive talk and that we have a clear path forward. Thanks for the summary, Georges! Reviewers, feel free to queue this if you think it looks good. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: pulkit, gracinet, marmoute, mercurial-devel
> +def encodecopies(copies): > + items = [ > + '%s\0%s' % (_string_escape(k), _string_escape(copies[k])) > + for k in sorted(copies) > + ] > + return "\n".join(items) It might be nitpicky, but I think it's better to not embed `\0` into the extras field. Almost all extras data are texts, and IIRC we regret that transplant sources are stored in binary form.
yuja added a comment. > +def encodecopies(copies): > + items = [ > + '%s\0%s' % (_string_escape(k), _string_escape(copies[k])) > + for k in sorted(copies) > + ] > + return "\n".join(items) It might be nitpicky, but I think it's better to not embed `\0` into the extras field. Almost all extras data are texts, and IIRC we regret that transplant sources are stored in binary form. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#90698, @yuja wrote: > > +def encodecopies(copies): > > + items = [ > > + '%s\0%s' % (_string_escape(k), _string_escape(copies[k])) > > + for k in sorted(copies) > > + ] > > + return "\n".join(items) > > It might be nitpicky, but I think it's better to not embed `\0` into the > extras field. Almost all extras data are texts, and IIRC we regret that > transplant sources are stored in binary form. Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
> > It might be nitpicky, but I think it's better to not embed `\0` into the > > extras field. Almost all extras data are texts, and IIRC we regret that > > transplant sources are stored in binary form. > > Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way. I don't remember, but we do store even boolean value as text, not in binary `\0`/`\1` form. `transplant_source` is the solo exception. https://www.mercurial-scm.org/wiki/ChangesetExtra And if we pick \0/\n separators, _string_escape() wouldn't be needed at the encodecopies() layer.
yuja added a comment. > > It might be nitpicky, but I think it's better to not embed `\0` into the > > extras field. Almost all extras data are texts, and IIRC we regret that > > transplant sources are stored in binary form. > > Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way. I don't remember, but we do store even boolean value as text, not in binary `\0`/`\1` form. `transplant_source` is the solo exception. https://www.mercurial-scm.org/wiki/ChangesetExtra And if we pick \0/\n separators, _string_escape() wouldn't be needed at the encodecopies() layer. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#90722, @yuja wrote: > > > It might be nitpicky, but I think it's better to not embed `\0` into the > > > extras field. Almost all extras data are texts, and IIRC we regret that > > > transplant sources are stored in binary form. > > > > Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way. > > I don't remember, but we do store even boolean value as text, not in binary > `\0`/`\1` form. `transplant_source` is the solo exception. Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.) > https://www.mercurial-scm.org/wiki/ChangesetExtra > > And if we pick \0/\n separators, _string_escape() wouldn't be needed > at the encodecopies() layer. Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
> > > > extras field. Almost all extras data are texts, and IIRC we regret that > > > > transplant sources are stored in binary form. > > > > > > Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way. > > > > I don't remember, but we do store even boolean value as text, not in binary > > `\0`/`\1` form. `transplant_source` is the solo exception. > > > Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.) Ok. That might be the reason, and I'm fine with the `\0` separator. > > https://www.mercurial-scm.org/wiki/ChangesetExtra > > > > And if we pick \0/\n separators, _string_escape() wouldn't be needed > > at the encodecopies() layer. > > Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that. Not really. I meant `_string_escape()` could be removed entirely if we store copies in binary (valid_filename + invalid_filename_separator) form. The extra dict will be encoded later.
yuja added a comment. > > > > extras field. Almost all extras data are texts, and IIRC we regret that > > > > transplant sources are stored in binary form. > > > > > > Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way. > > > > I don't remember, but we do store even boolean value as text, not in binary > > `\0`/`\1` form. `transplant_source` is the solo exception. > > > Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.) Ok. That might be the reason, and I'm fine with the `\0` separator. > > https://www.mercurial-scm.org/wiki/ChangesetExtra > > > > And if we pick \0/\n separators, _string_escape() wouldn't be needed > > at the encodecopies() layer. > > Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that. Not really. I meant `_string_escape()` could be removed entirely if we store copies in binary (valid_filename + invalid_filename_separator) form. The extra dict will be encoded later. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#90738, @yuja wrote: > > > > > extras field. Almost all extras data are texts, and IIRC we regret that > > > > > transplant sources are stored in binary form. > > > > > > > > Why not? I picked \0 and \n because they won't appear in filenames, so it's convenient in that way. > > > > > > I don't remember, but we do store even boolean value as text, not in binary > > > `\0`/`\1` form. `transplant_source` is the solo exception. > > > > > > Perhaps it's just so `{extras}` doesn't print ANSI escape codes and such? (I assume that can still happen if put escape characters in your filenames, for example.) > > Ok. That might be the reason, and I'm fine with the `\0` separator. > > > > https://www.mercurial-scm.org/wiki/ChangesetExtra > > > > > > And if we pick \0/\n separators, _string_escape() wouldn't be needed > > > at the encodecopies() layer. > > > > Oh, now I see what you're saying! That's embarrassing. So maybe we should `_string_escape()` the whole thing? I'll do that. > > Not really. I meant `_string_escape()` could be removed entirely if we store > copies in binary (valid_filename + invalid_filename_separator) form. The extra > dict will be encoded later. Sure, if we're okay with the `\0` and `\n` separators being printed to the terminal when the user uses the `{extras}` template, then we can just drop the encoding. Sounds like you're okay with that, and I also don't care too much, so I'll drop the encoding. REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: yuja, pulkit, gracinet, marmoute, mercurial-devel
indygreg added a comment. I support experimenting with putting copy metadata in the changelog. And the patches before this one did a lot of work to allow copy metadata to be read from alternate sources, which is great, since it can allow flexibility in the future (think copy caches, copy modifications outside of a commit, etc). I haven't looked at all these patches in detail, but it seems to me there should be a repo requirement in the case(s) where (all) copy metadata is not in the filelogs. Without a repo requirement, an old client may attempt to open a repo and not be able to find the copy metadata. It is OK to duplicate copy metadata in the changelog and have newer clients use copy metadata from the changelog if it is available. But if all copy metadata isn't available in the filelogs, there needs to be a requirement to lock out old clients. That being said, we may want to be aggressive than this! If a new client is writing copy metadata to filelogs and the changelog, an old client may commit to the repo with the copy metadata just in the filelogs. I'm not sure about the code behavior, but presumably a new client configured to use changelog copy metadata would forego reading the filelog metadata since it is expecting to read it from the changelog. This could result in a new client missing copy metadata written by an old client. So we would need a repo requirement to lock out old clients from writing to the repo. Then there's the wire protocol aspect. How does the copy metadata writing setting get propagated to the client? If it fails to get propagated, it is a similar situation to the local repo situation. Again, there needs to be some kind of requirement/capability detection here and the server setting needs to find its way to the client or else bad things can happen. Anyway, this is exciting work! It is still an experimental feature, so the implementation doesn't have to be perfect. But we will need to cross the repo requirements/capabilities bridge at some point. Can't wait to see the benefits of this work! REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: indygreg, yuja, pulkit, gracinet, marmoute, mercurial-devel
indygreg added a comment. An idea to consider (which may have been proposed already) is to write a *no copy metadata* entry into extras when writing copy metadata to the changelog. If we did things this way, a new client could know definitively that no copy metadata is available and to not fall back to reading from the filelogs. I haven't fully thought this through, but that should provide better compatibility between older and newer clients. Obviously the tradeoff is you could have a mixed repo (some changesets wouldn't have copy metadata in changelog) and you would need to duplicate copy metadata across changelog and filelogs to maintain compatibility. Something to contemplate... REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: indygreg, yuja, pulkit, gracinet, marmoute, mercurial-devel
martinvonz added a comment. In https://phab.mercurial-scm.org/D6183#91060, @indygreg wrote: > I support experimenting with putting copy metadata in the changelog. And the patches before this one did a lot of work to allow copy metadata to be read from alternate sources, which is great, since it can allow flexibility in the future (think copy caches, copy modifications outside of a commit, etc). Yes, Pierre-Yves and Georges wanted to work on adding them to a cache. As you hint at, that would also allow copy detection to be run at a later point to update the cache. And yes, the previous patches should have decoupled the algorithms from the storage well (the new assumption is that we can cheaply get copy metadata for a whole changeset). > I haven't looked at all these patches in detail, but it seems to me there should be a repo requirement in the case(s) where (all) copy metadata is not in the filelogs. Without a repo requirement, an old client may attempt to open a repo and not be able to find the copy metadata. It is OK to duplicate copy metadata in the changelog and have newer clients use copy metadata from the changelog if it is available. But if all copy metadata isn't available in the filelogs, there needs to be a requirement to lock out old clients. I had considered that, but figured that copy information isn't essential enough to warrant a repo requirement. But I agree with your next paragraph. > That being said, we may want to be aggressive than this! If a new client is writing copy metadata to filelogs and the changelog, an old client may commit to the repo with the copy metadata just in the filelogs. I'm not sure about the code behavior, but presumably a new client configured to use changelog copy metadata would forego reading the filelog metadata since it is expecting to read it from the changelog. This could result in a new client missing copy metadata written by an old client. So we would need a repo requirement to lock out old clients from writing to the repo. That's still not a disaster, but I agree that it still seems better to lock them out. > Then there's the wire protocol aspect. How does the copy metadata writing setting get propagated to the client? If it fails to get propagated, it is a similar situation to the local repo situation. Again, there needs to be some kind of requirement/capability detection here and the server setting needs to find its way to the client or else bad things can happen. Good point! > Anyway, this is exciting work! It is still an experimental feature, so the implementation doesn't have to be perfect. But we will need to cross the repo requirements/capabilities bridge at some point. Can't wait to see the benefits of this work! REPOSITORY rHG Mercurial REVISION DETAIL https://phab.mercurial-scm.org/D6183 To: martinvonz, #hg-reviewers Cc: indygreg, yuja, pulkit, gracinet, marmoute, mercurial-devel
martinvonz added a comment.
In https://phab.mercurial-scm.org/D6183#91062, @indygreg wrote:
> An idea to consider (which may have been proposed already) is to write a *no copy metadata* entry into extras when writing copy metadata to the changelog. If we did things this way, a new client could know definitively that no copy metadata is available and to not fall back to reading from the filelogs. I haven't fully thought this through, but that should provide better compatibility between older and newer clients. Obviously the tradeoff is you could have a mixed repo (some changesets wouldn't have copy metadata in changelog) and you would need to duplicate copy metadata across changelog and filelogs to maintain compatibility. Something to contemplate...
That's actually what I did initially, and `context._copies()` is still written to work that way (not consult filelogs if an empty p1copies entry was recorded in the changeset), but I haven't added a mode where we write empty entries. I should do that. Thanks for pointing that out.
REPOSITORY
rHG Mercurial
REVISION DETAIL
https://phab.mercurial-scm.org/D6183
To: martinvonz, #hg-reviewers
Cc: indygreg, yuja, pulkit, gracinet, marmoute, mercurial-devel
Patch
diff --git a/tests/test-fastannotate-hg.t b/tests/test-fastannotate-hg.t --- a/tests/test-fastannotate-hg.t +++ b/tests/test-fastannotate-hg.t @@ -443,7 +443,7 @@ > def reposetup(ui, repo): > class legacyrepo(repo.__class__): > def _filecommit(self, fctx, manifest1, manifest2, - > linkrev, tr, changelist): + > linkrev, tr, changelist, includecopymeta): > fname = fctx.path() > text = fctx.data() > flog = self.file(fname) diff --git a/tests/test-copies-in-changeset.t b/tests/test-copies-in-changeset.t new file mode 100644 --- /dev/null +++ b/tests/test-copies-in-changeset.t @@ -0,0 +1,113 @@ + + $ cat >> $HGRCPATH << EOF + > [experimental] + > copies.write-to=changeset-only + > [alias] + > changesetcopies = log -r . -T "files: {files} + > p1copies: {get(extras,'p1copies')} + > p2copies: {get(extras,'p2copies')} + > " + > EOF + +Check that copies are recorded correctly + + $ hg init repo + $ cd repo + $ echo a > a + $ hg add a + $ hg ci -m initial + $ hg cp a b + $ hg cp a c + $ hg cp a d + $ hg ci -m 'copy a to b, c, and d' + $ hg changesetcopies + files: b c d + p1copies: b\x00a (esc) + c\x00a (esc) + d\x00a (esc) + p2copies: + +Check that renames are recorded correctly + + $ hg mv b b2 + $ hg ci -m 'rename b to b2' + $ hg changesetcopies + files: b b2 + p1copies: b2\x00b (esc) + p2copies: + +Rename onto existing file. This should get recorded in the changeset files list and in the extras, +even though there is no filelog entry. + + $ hg cp b2 c --force + $ hg st --copies + M c + b2 + $ hg debugindex c + rev linkrev nodeid p1 p2 + 0 1 b789fdd96dc2 000000000000 000000000000 + $ hg ci -m 'move b onto d' + $ hg changesetcopies + files: c + p1copies: c\x00b2 (esc) + p2copies: + $ hg debugindex c + rev linkrev nodeid p1 p2 + 0 1 b789fdd96dc2 000000000000 000000000000 + +Create a merge commit with copying done during merge. + + $ hg co 0 + 0 files updated, 0 files merged, 3 files removed, 0 files unresolved + $ hg cp a e + $ hg cp a f + $ hg ci -m 'copy a to e and f' + created new head + $ hg merge 3 + 3 files updated, 0 files merged, 0 files removed, 0 files unresolved + (branch merge, don't forget to commit) +File 'a' exists on both sides, so 'g' could be recorded as being from p1 or p2, but we currently +always record it as being from p1 + $ hg cp a g +File 'd' exists only in p2, so 'h' should be from p2 + $ hg cp d h +File 'f' exists only in p1, so 'i' should be from p1 + $ hg cp f i + $ hg ci -m 'merge' + $ hg changesetcopies + files: g h i + p1copies: g\x00a (esc) + i\x00f (esc) + p2copies: h\x00d (esc) + +Test writing to both changeset and filelog + + $ hg cp a j + $ hg ci -m 'copy a to j' --config experimental.copies.write-to=compatibility + $ hg changesetcopies + files: j + p1copies: j\x00a (esc) + p2copies: + $ hg debugdata j 0 + \x01 (esc) + copy: a + copyrev: b789fdd96dc2f3bd229c1dd8eedf0fc60e2b68e3 + \x01 (esc) + a + +Test writing only to filelog + + $ hg cp a k + $ hg ci -m 'copy a to k' --config experimental.copies.write-to=filelog-only + $ hg changesetcopies + files: k + p1copies: + p2copies: + $ hg debugdata k 0 + \x01 (esc) + copy: a + copyrev: b789fdd96dc2f3bd229c1dd8eedf0fc60e2b68e3 + \x01 (esc) + a + + $ cd .. diff --git a/tests/test-annotate.t b/tests/test-annotate.t --- a/tests/test-annotate.t +++ b/tests/test-annotate.t @@ -438,7 +438,7 @@ > def reposetup(ui, repo): > class legacyrepo(repo.__class__): > def _filecommit(self, fctx, manifest1, manifest2, - > linkrev, tr, changelist): + > linkrev, tr, changelist, includecopymeta): > fname = fctx.path() > text = fctx.data() > flog = self.file(fname) diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py --- a/mercurial/localrepo.py +++ b/mercurial/localrepo.py @@ -2292,7 +2292,8 @@ """Returns the wlock if it's held, or None if it's not.""" return self._currentlock(self._wlockref) - def _filecommit(self, fctx, manifest1, manifest2, linkrev, tr, changelist): + def _filecommit(self, fctx, manifest1, manifest2, linkrev, tr, changelist, + includecopymeta): """ commit an individual file as part of a larger transaction """ @@ -2337,7 +2338,7 @@ if manifest2: # branch merge if fparent2 == nullid or cnode is None: # copied on remote side if cfname in manifest2: - cnode= manifest2[cfname] + cnode = manifest2[cfname] newfparent = fparent1 # Here, we used to search backwards through history to try to find @@ -2351,8 +2352,9 @@ if cnode: self.ui.debug(" %s: copy %s:%s\n" % (fname, cfname, hex(cnode))) - meta["copy"] = cfname - meta["copyrev"] = hex(cnode) + if includecopymeta: + meta["copy"] = cfname + meta["copyrev"] = hex(cnode) fparent1, fparent2 = nullid, newfparent else: self.ui.warn(_("warning: can't find ancestor for '%s' " @@ -2520,6 +2522,12 @@ p1, p2 = ctx.p1(), ctx.p2() user = ctx.user() + writecopiesto = self.ui.config('experimental', 'copies.write-to') + writefilecopymeta = writecopiesto != 'changeset-only' + p1copies, p2copies = None, None + if writecopiesto in ('changeset-only', 'compatibility'): + p1copies = ctx.p1copies() + p2copies = ctx.p2copies() with self.lock(), self.transaction("commit") as tr: trp = weakref.proxy(tr) @@ -2553,7 +2561,8 @@ else: added.append(f) m[f] = self._filecommit(fctx, m1, m2, linkrev, - trp, changed) + trp, changed, + writefilecopymeta) m.setflag(f, fctx.flags()) except OSError: self.ui.warn(_("trouble committing %s!\n") % @@ -2607,7 +2616,8 @@ self.changelog.delayupdate(tr) n = self.changelog.add(mn, files, ctx.description(), trp, p1.node(), p2.node(), - user, ctx.date(), ctx.extra().copy()) + user, ctx.date(), ctx.extra().copy(), + p1copies, p2copies) xp1, xp2 = p1.hex(), p2 and p2.hex() or '' self.hook('pretxncommit', throw=True, node=hex(n), parent1=xp1, parent2=xp2) diff --git a/mercurial/configitems.py b/mercurial/configitems.py --- a/mercurial/configitems.py +++ b/mercurial/configitems.py @@ -488,6 +488,9 @@ coreconfigitem('experimental', 'copies.read-from', default="filelog-only", ) +coreconfigitem('experimental', 'copies.write-to', + default='file', +) coreconfigitem('experimental', 'crecordtest', default=None, ) diff --git a/mercurial/changelog.py b/mercurial/changelog.py --- a/mercurial/changelog.py +++ b/mercurial/changelog.py @@ -77,6 +77,13 @@ ] return "\0".join(items) +def encodecopies(copies): + items = [ + '%s\0%s' % (_string_escape(k), _string_escape(copies[k])) + for k in sorted(copies) + ] + return "\n".join(items) + def stripdesc(desc): """strip trailing whitespace and leading and trailing empty lines""" return '\n'.join([l.rstrip() for l in desc.splitlines()]).strip('\n') @@ -530,7 +537,7 @@ return l[3:] def add(self, manifest, files, desc, transaction, p1, p2, - user, date=None, extra=None): + user, date=None, extra=None, p1copies=None, p2copies=None): # Convert to UTF-8 encoded bytestrings as the very first # thing: calling any method on a localstr object will turn it # into a str object and the cached UTF-8 string is thus lost. @@ -559,6 +566,13 @@ elif branch in (".", "null", "tip"): raise error.StorageError(_('the name \'%s\' is reserved') % branch) + if (p1copies or p2copies) and extra is None: + extra = {} + if p1copies: + extra['p1copies'] = encodecopies(p1copies) + if p2copies: + extra['p2copies'] = encodecopies(p2copies) + if extra: extra = encodeextra(extra) parseddate = "%s %s" % (parseddate, extra)