Patchwork [4,of,6] verify: invoke the file prefetch hook

login
register
mail settings
Submitter Matt Harbison
Date April 15, 2018, 6:44 a.m.
Message ID <3b0c3d4939b56ca038db.1523774648@Envy>
Download mbox | patch
Permalink /patch/31076/
State Superseded
Headers show

Comments

Matt Harbison - April 15, 2018, 6:44 a.m.
# HG changeset patch
# User Matt Harbison <matt_harbison@yahoo.com>
# Date 1523752111 14400
#      Sat Apr 14 20:28:31 2018 -0400
# Node ID 3b0c3d4939b56ca038dbbba17da424699a6b339d
# Parent  691a7d3f2df80908e52a170897a4492a528c3533
verify: invoke the file prefetch hook

It's unfortunate that verify wants to download everything.  Maybe we can create
a custom transfer handler for LFS.  But it shouldn't be painful in the meantime,
and it's likely that blobs will be served up from more than just hgweb.
Yuya Nishihara - April 16, 2018, 11:35 a.m.
On Sun, 15 Apr 2018 02:44:08 -0400, Matt Harbison wrote:
> # HG changeset patch
> # User Matt Harbison <matt_harbison@yahoo.com>
> # Date 1523752111 14400
> #      Sat Apr 14 20:28:31 2018 -0400
> # Node ID 3b0c3d4939b56ca038dbbba17da424699a6b339d
> # Parent  691a7d3f2df80908e52a170897a4492a528c3533
> verify: invoke the file prefetch hook
> 
> It's unfortunate that verify wants to download everything.  Maybe we can create
> a custom transfer handler for LFS.  But it shouldn't be painful in the meantime,
> and it's likely that blobs will be served up from more than just hgweb.
> 
> diff --git a/mercurial/verify.py b/mercurial/verify.py
> --- a/mercurial/verify.py
> +++ b/mercurial/verify.py
> @@ -25,6 +25,7 @@ from . import (
>  
>  def verify(repo):
>      with repo.lock():
> +        scmutil.fileprefetchhooks(repo, repo.set('all()'))
>          return verifier(repo).verify()

I don't think "hg verify" should go that abstraction level because the
repository might be inconsistent state.
Matt Harbison - April 16, 2018, 12:25 p.m.
> On Apr 16, 2018, at 7:35 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> 
>> On Sun, 15 Apr 2018 02:44:08 -0400, Matt Harbison wrote:
>> # HG changeset patch
>> # User Matt Harbison <matt_harbison@yahoo.com>
>> # Date 1523752111 14400
>> #      Sat Apr 14 20:28:31 2018 -0400
>> # Node ID 3b0c3d4939b56ca038dbbba17da424699a6b339d
>> # Parent  691a7d3f2df80908e52a170897a4492a528c3533
>> verify: invoke the file prefetch hook
>> 
>> It's unfortunate that verify wants to download everything.  Maybe we can create
>> a custom transfer handler for LFS.  But it shouldn't be painful in the meantime,
>> and it's likely that blobs will be served up from more than just hgweb.
>> 
>> diff --git a/mercurial/verify.py b/mercurial/verify.py
>> --- a/mercurial/verify.py
>> +++ b/mercurial/verify.py
>> @@ -25,6 +25,7 @@ from . import (
>> 
>> def verify(repo):
>>     with repo.lock():
>> +        scmutil.fileprefetchhooks(repo, repo.set('all()'))
>>         return verifier(repo).verify()
> 
> I don't think "hg verify" should go that abstraction level because the
> repository might be inconsistent state.

Do you mean using a revset, or prefetching at all?  Grabbing one file at a time will likely be painfully slow.

If the revset bit is the bad part, what is the alternative with a corrupt repo?
Yuya Nishihara - April 16, 2018, 12:34 p.m.
On Mon, 16 Apr 2018 08:25:07 -0400, Matt Harbison wrote:
> 
> > On Apr 16, 2018, at 7:35 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> > 
> >> On Sun, 15 Apr 2018 02:44:08 -0400, Matt Harbison wrote:
> >> # HG changeset patch
> >> # User Matt Harbison <matt_harbison@yahoo.com>
> >> # Date 1523752111 14400
> >> #      Sat Apr 14 20:28:31 2018 -0400
> >> # Node ID 3b0c3d4939b56ca038dbbba17da424699a6b339d
> >> # Parent  691a7d3f2df80908e52a170897a4492a528c3533
> >> verify: invoke the file prefetch hook
> >> 
> >> It's unfortunate that verify wants to download everything.  Maybe we can create
> >> a custom transfer handler for LFS.  But it shouldn't be painful in the meantime,
> >> and it's likely that blobs will be served up from more than just hgweb.
> >> 
> >> diff --git a/mercurial/verify.py b/mercurial/verify.py
> >> --- a/mercurial/verify.py
> >> +++ b/mercurial/verify.py
> >> @@ -25,6 +25,7 @@ from . import (
> >> 
> >> def verify(repo):
> >>     with repo.lock():
> >> +        scmutil.fileprefetchhooks(repo, repo.set('all()'))
> >>         return verifier(repo).verify()
> > 
> > I don't think "hg verify" should go that abstraction level because the
> > repository might be inconsistent state.
> 
> Do you mean using a revset, or prefetching at all?  Grabbing one file at a time will likely be painfully slow.
> 
> If the revset bit is the bad part, what is the alternative with a corrupt repo?

I think we should avoid any write operation on damaged repo. Is there any way
to disable fetching at all?
Matt Harbison - April 16, 2018, 1:15 p.m.
> On Apr 16, 2018, at 8:34 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> 
>> On Mon, 16 Apr 2018 08:25:07 -0400, Matt Harbison wrote:
>> 
>>>> On Apr 16, 2018, at 7:35 AM, Yuya Nishihara <yuya@tcha.org> wrote:
>>>> 
>>>> On Sun, 15 Apr 2018 02:44:08 -0400, Matt Harbison wrote:
>>>> # HG changeset patch
>>>> # User Matt Harbison <matt_harbison@yahoo.com>
>>>> # Date 1523752111 14400
>>>> #      Sat Apr 14 20:28:31 2018 -0400
>>>> # Node ID 3b0c3d4939b56ca038dbbba17da424699a6b339d
>>>> # Parent  691a7d3f2df80908e52a170897a4492a528c3533
>>>> verify: invoke the file prefetch hook
>>>> 
>>>> It's unfortunate that verify wants to download everything.  Maybe we can create
>>>> a custom transfer handler for LFS.  But it shouldn't be painful in the meantime,
>>>> and it's likely that blobs will be served up from more than just hgweb.
>>>> 
>>>> diff --git a/mercurial/verify.py b/mercurial/verify.py
>>>> --- a/mercurial/verify.py
>>>> +++ b/mercurial/verify.py
>>>> @@ -25,6 +25,7 @@ from . import (
>>>> 
>>>> def verify(repo):
>>>>    with repo.lock():
>>>> +        scmutil.fileprefetchhooks(repo, repo.set('all()'))
>>>>        return verifier(repo).verify()
>>> 
>>> I don't think "hg verify" should go that abstraction level because the
>>> repository might be inconsistent state.
>> 
>> Do you mean using a revset, or prefetching at all?  Grabbing one file at a time will likely be painfully slow.
>> 
>> If the revset bit is the bad part, what is the alternative with a corrupt repo?
> 
> I think we should avoid any write operation on damaged repo. Is there any way
> to disable fetching at all?

Maybe we could wrap hg.verify() to stuff a flag is svfs?  But I think then the revlog flag processor will flag each revision as corrupt anyway, hiding the real error.

Avoiding writes on a corrupt repo makes sense, but in this case I think it is safe.  The storage is external, so nothing gets more corrupt, and the blobs are checksummed on their own before adding them to the store.

As a point of reference, largefiles will verify the external files (by asking the server to do it IIRC).  I’d like to do something like that here, but obviously that extension won’t be supported anywhere else.  So from a UX point of view, IDK how we can sanely do that if it doesn’t fallback to fetching them.
Yuya Nishihara - April 16, 2018, 1:39 p.m.
On Mon, 16 Apr 2018 09:15:48 -0400, Matt Harbison wrote:
> 
> > On Apr 16, 2018, at 8:34 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> > 
> >> On Mon, 16 Apr 2018 08:25:07 -0400, Matt Harbison wrote:
> >> 
> >>>> On Apr 16, 2018, at 7:35 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> >>>> 
> >>>> On Sun, 15 Apr 2018 02:44:08 -0400, Matt Harbison wrote:
> >>>> # HG changeset patch
> >>>> # User Matt Harbison <matt_harbison@yahoo.com>
> >>>> # Date 1523752111 14400
> >>>> #      Sat Apr 14 20:28:31 2018 -0400
> >>>> # Node ID 3b0c3d4939b56ca038dbbba17da424699a6b339d
> >>>> # Parent  691a7d3f2df80908e52a170897a4492a528c3533
> >>>> verify: invoke the file prefetch hook
> >>>> 
> >>>> It's unfortunate that verify wants to download everything.  Maybe we can create
> >>>> a custom transfer handler for LFS.  But it shouldn't be painful in the meantime,
> >>>> and it's likely that blobs will be served up from more than just hgweb.
> >>>> 
> >>>> diff --git a/mercurial/verify.py b/mercurial/verify.py
> >>>> --- a/mercurial/verify.py
> >>>> +++ b/mercurial/verify.py
> >>>> @@ -25,6 +25,7 @@ from . import (
> >>>> 
> >>>> def verify(repo):
> >>>>    with repo.lock():
> >>>> +        scmutil.fileprefetchhooks(repo, repo.set('all()'))
> >>>>        return verifier(repo).verify()
> >>> 
> >>> I don't think "hg verify" should go that abstraction level because the
> >>> repository might be inconsistent state.
> >> 
> >> Do you mean using a revset, or prefetching at all?  Grabbing one file at a time will likely be painfully slow.
> >> 
> >> If the revset bit is the bad part, what is the alternative with a corrupt repo?
> > 
> > I think we should avoid any write operation on damaged repo. Is there any way
> > to disable fetching at all?
> 
> Maybe we could wrap hg.verify() to stuff a flag is svfs?  But I think then the revlog flag processor will flag each revision as corrupt anyway, hiding the real error.
> 
> Avoiding writes on a corrupt repo makes sense, but in this case I think it is safe.  The storage is external, so nothing gets more corrupt, and the blobs are checksummed on their own before adding them to the store.

That's true for lfs, but not all prefetch-able storage would behave as
such. Another concern is we're doing prefetch *before* verifying revlogs.
If a revlog is damaged, the prefetch would fail and no useful error would
be reported.

Perhaps we'll need "hg debugprefetch && hg verify" ?

> As a point of reference, largefiles will verify the external files (by asking the server to do it IIRC).  I’d like to do something like that here, but obviously that extension won’t be supported anywhere else.  So from a UX point of view, IDK how we can sanely do that if it doesn’t fallback to fetching them.

IMHO, that is one of the most confusing features the largefiles has. Largefiles
doesn't need to extend the verify command.
Matt Harbison - April 16, 2018, 1:59 p.m.
> On Apr 16, 2018, at 9:39 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> 
>> On Mon, 16 Apr 2018 09:15:48 -0400, Matt Harbison wrote:
>> 
>>>> On Apr 16, 2018, at 8:34 AM, Yuya Nishihara <yuya@tcha.org> wrote:
>>>> 
>>>> On Mon, 16 Apr 2018 08:25:07 -0400, Matt Harbison wrote:
>>>> 
>>>>>> On Apr 16, 2018, at 7:35 AM, Yuya Nishihara <yuya@tcha.org> wrote:
>>>>>> 
>>>>>> On Sun, 15 Apr 2018 02:44:08 -0400, Matt Harbison wrote:
>>>>>> # HG changeset patch
>>>>>> # User Matt Harbison <matt_harbison@yahoo.com>
>>>>>> # Date 1523752111 14400
>>>>>> #      Sat Apr 14 20:28:31 2018 -0400
>>>>>> # Node ID 3b0c3d4939b56ca038dbbba17da424699a6b339d
>>>>>> # Parent  691a7d3f2df80908e52a170897a4492a528c3533
>>>>>> verify: invoke the file prefetch hook
>>>>>> 
>>>>>> It's unfortunate that verify wants to download everything.  Maybe we can create
>>>>>> a custom transfer handler for LFS.  But it shouldn't be painful in the meantime,
>>>>>> and it's likely that blobs will be served up from more than just hgweb.
>>>>>> 
>>>>>> diff --git a/mercurial/verify.py b/mercurial/verify.py
>>>>>> --- a/mercurial/verify.py
>>>>>> +++ b/mercurial/verify.py
>>>>>> @@ -25,6 +25,7 @@ from . import (
>>>>>> 
>>>>>> def verify(repo):
>>>>>>   with repo.lock():
>>>>>> +        scmutil.fileprefetchhooks(repo, repo.set('all()'))
>>>>>>       return verifier(repo).verify()
>>>>> 
>>>>> I don't think "hg verify" should go that abstraction level because the
>>>>> repository might be inconsistent state.
>>>> 
>>>> Do you mean using a revset, or prefetching at all?  Grabbing one file at a time will likely be painfully slow.
>>>> 
>>>> If the revset bit is the bad part, what is the alternative with a corrupt repo?
>>> 
>>> I think we should avoid any write operation on damaged repo. Is there any way
>>> to disable fetching at all?
>> 
>> Maybe we could wrap hg.verify() to stuff a flag is svfs?  But I think then the revlog flag processor will flag each revision as corrupt anyway, hiding the real error.
>> 
>> Avoiding writes on a corrupt repo makes sense, but in this case I think it is safe.  The storage is external, so nothing gets more corrupt, and the blobs are checksummed on their own before adding them to the store.
> 
> That's true for lfs, but not all prefetch-able storage would behave as
> such. Another concern is we're doing prefetch *before* verifying revlogs.
> If a revlog is damaged, the prefetch would fail and no useful error would
> be reported.
> 
> Perhaps we'll need "hg debugprefetch && hg verify" ?

But that would also prefetch before verifying the revlog?

>> As a point of reference, largefiles will verify the external files (by asking the server to do it IIRC).  I’d like to do something like that here, but obviously that extension won’t be supported anywhere else.  So from a UX point of view, IDK how we can sanely do that if it doesn’t fallback to fetching them.
> 
> IMHO, that is one of the most confusing features the largefiles has. Largefiles
> doesn't need to extend the verify command.

I like that it’s easy to tell if a blob is missing from the server, and that you don’t need to script something with an external tool to verify the local blobs.  I do remember being confused by the latest round of work in that area, but I thought maybe it was awkward because of BC concerns.

I’ll drop this patch for now, because it won’t get resolved this cycle, even if there was a clear path forward.
Yuya Nishihara - April 16, 2018, 2:11 p.m.
On Mon, 16 Apr 2018 09:59:49 -0400, Matt Harbison wrote:
> 
> > On Apr 16, 2018, at 9:39 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> > 
> >> On Mon, 16 Apr 2018 09:15:48 -0400, Matt Harbison wrote:
> >> 
> >>>> On Apr 16, 2018, at 8:34 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> >>>> 
> >>>> On Mon, 16 Apr 2018 08:25:07 -0400, Matt Harbison wrote:
> >>>> 
> >>>>>> On Apr 16, 2018, at 7:35 AM, Yuya Nishihara <yuya@tcha.org> wrote:
> >>>>>> 
> >>>>>> On Sun, 15 Apr 2018 02:44:08 -0400, Matt Harbison wrote:
> >>>>>> # HG changeset patch
> >>>>>> # User Matt Harbison <matt_harbison@yahoo.com>
> >>>>>> # Date 1523752111 14400
> >>>>>> #      Sat Apr 14 20:28:31 2018 -0400
> >>>>>> # Node ID 3b0c3d4939b56ca038dbbba17da424699a6b339d
> >>>>>> # Parent  691a7d3f2df80908e52a170897a4492a528c3533
> >>>>>> verify: invoke the file prefetch hook
> >>>>>> 
> >>>>>> It's unfortunate that verify wants to download everything.  Maybe we can create
> >>>>>> a custom transfer handler for LFS.  But it shouldn't be painful in the meantime,
> >>>>>> and it's likely that blobs will be served up from more than just hgweb.
> >>>>>> 
> >>>>>> diff --git a/mercurial/verify.py b/mercurial/verify.py
> >>>>>> --- a/mercurial/verify.py
> >>>>>> +++ b/mercurial/verify.py
> >>>>>> @@ -25,6 +25,7 @@ from . import (
> >>>>>> 
> >>>>>> def verify(repo):
> >>>>>>   with repo.lock():
> >>>>>> +        scmutil.fileprefetchhooks(repo, repo.set('all()'))
> >>>>>>       return verifier(repo).verify()
> >>>>> 
> >>>>> I don't think "hg verify" should go that abstraction level because the
> >>>>> repository might be inconsistent state.
> >>>> 
> >>>> Do you mean using a revset, or prefetching at all?  Grabbing one file at a time will likely be painfully slow.
> >>>> 
> >>>> If the revset bit is the bad part, what is the alternative with a corrupt repo?
> >>> 
> >>> I think we should avoid any write operation on damaged repo. Is there any way
> >>> to disable fetching at all?
> >> 
> >> Maybe we could wrap hg.verify() to stuff a flag is svfs?  But I think then the revlog flag processor will flag each revision as corrupt anyway, hiding the real error.
> >> 
> >> Avoiding writes on a corrupt repo makes sense, but in this case I think it is safe.  The storage is external, so nothing gets more corrupt, and the blobs are checksummed on their own before adding them to the store.
> > 
> > That's true for lfs, but not all prefetch-able storage would behave as
> > such. Another concern is we're doing prefetch *before* verifying revlogs.
> > If a revlog is damaged, the prefetch would fail and no useful error would
> > be reported.
> > 
> > Perhaps we'll need "hg debugprefetch && hg verify" ?
> 
> But that would also prefetch before verifying the revlog?

Yes. The point is that "verify" never crashes because of prefetch. If we've
already downloaded all lfs data, revlog corruption can be verified by running
"hg verify."

"hg verify" has to be robust for data corruption.

Patch

diff --git a/mercurial/verify.py b/mercurial/verify.py
--- a/mercurial/verify.py
+++ b/mercurial/verify.py
@@ -25,6 +25,7 @@  from . import (
 
 def verify(repo):
     with repo.lock():
+        scmutil.fileprefetchhooks(repo, repo.set('all()'))
         return verifier(repo).verify()
 
 def _normpath(f):
diff --git a/tests/test-lfs.t b/tests/test-lfs.t
--- a/tests/test-lfs.t
+++ b/tests/test-lfs.t
@@ -728,21 +728,21 @@  because they aren't accessed.
 
 Verify will copy/link all lfs objects into the local store that aren't already
 present.  Bypass the corrupted usercache to show that verify works when fed by
-the (uncorrupted) remote store.
+the (uncorrupted) remote store.  Also, verify will prefetch all files needed.
 
   $ hg -R fromcorrupt --config lfs.usercache=emptycache verify -v
+  lfs: adding 66100b384bf761271b407d79fc30cdd0554f3b2c5d944836e936d584b88ce88e to the usercache
+  lfs: adding 89b6070915a3d573ff3599d1cda305bc5e38549b15c4847ab034169da66e1ca8 to the usercache
+  lfs: adding b1a6ea88da0017a0e77db139a54618986e9a2489bee24af9fe596de9daac498c to the usercache
   repository uses revlog format 1
   checking changesets
   checking manifests
   crosschecking files in changesets and manifests
   checking files
-  lfs: adding 66100b384bf761271b407d79fc30cdd0554f3b2c5d944836e936d584b88ce88e to the usercache
   lfs: found 66100b384bf761271b407d79fc30cdd0554f3b2c5d944836e936d584b88ce88e in the local lfs store
   lfs: found 22f66a3fc0b9bf3f012c814303995ec07099b3a9ce02a7af84b5970811074a3b in the local lfs store
   lfs: found 66100b384bf761271b407d79fc30cdd0554f3b2c5d944836e936d584b88ce88e in the local lfs store
-  lfs: adding 89b6070915a3d573ff3599d1cda305bc5e38549b15c4847ab034169da66e1ca8 to the usercache
   lfs: found 89b6070915a3d573ff3599d1cda305bc5e38549b15c4847ab034169da66e1ca8 in the local lfs store
-  lfs: adding b1a6ea88da0017a0e77db139a54618986e9a2489bee24af9fe596de9daac498c to the usercache
   lfs: found b1a6ea88da0017a0e77db139a54618986e9a2489bee24af9fe596de9daac498c in the local lfs store
   4 files, 5 changesets, 10 total revisions