Patchwork largefiles: RFC - customizable largefiles locations

login
register
mail settings
Submitter Mads Kiilerich
Date March 19, 2016, 4:05 p.m.
Message ID <d0d9eab4d5fb107e00c6.1458403527@localhost.localdomain>
Download mbox | patch
Permalink /patch/13954/
State RFC, archived
Headers show

Comments

Mads Kiilerich - March 19, 2016, 4:05 p.m.
# HG changeset patch
# User Mads Kiilerich <madski@unity3d.com>
# Date 1458403499 25200
#      Sat Mar 19 09:04:59 2016 -0700
# Node ID d0d9eab4d5fb107e00c6b3beba53bb5817ab84d3
# Parent  b3d3dd7eeeb7077d756eeab794457458251929e1
largefiles: RFC - customizable largefiles locations

My users request more flexibility for largefile storage.

Some use cases:

They might want to use different disks for repos and cache, and hardlinking
will thus not work. They might thus not want the .hg/largefiles cache but only
store everything in the user cache.

They might have a store on another file system that they want to use - for
example when dualbooting and having different file systems but wanting to reuse
largefiles that already have been downloaded. It could also be a shared network
drive or something.

Will the functionality outlined in the documentation work for you? Would it be
a good idea?


Additional comments:

Observation: default-push - not relevant, only indirectly relevant when chosen
as push destination.

q: can we always assume all largefiles will be present in a "local" store? or
can it be streamed on demand when on fast reliable network? or how should
"fetch before merge" work? or should we just have better handling of missing
largefiles?

Observation: there are like 3 levels of largefile "presence":
* stat - to check if a remote server has the largefile ... but then why not just
try to fetch it and let it fail?
* pull - make sure everything is present locally so we don't find it missing
later at an akward moment ... but why not just get rid of having it awkeard to
be missing?
* use - copy it to the working directory or serve it
Katsunori FUJIWARA - March 21, 2016, 1:03 a.m.
At Sat, 19 Mar 2016 09:05:27 -0700,
Mads Kiilerich wrote:
> 
> # HG changeset patch
> # User Mads Kiilerich <madski@unity3d.com>
> # Date 1458403499 25200
> #      Sat Mar 19 09:04:59 2016 -0700
> # Node ID d0d9eab4d5fb107e00c6b3beba53bb5817ab84d3
> # Parent  b3d3dd7eeeb7077d756eeab794457458251929e1
> largefiles: RFC - customizable largefiles locations
> 
> My users request more flexibility for largefile storage.
> 
> Some use cases:
> 
> They might want to use different disks for repos and cache, and hardlinking
> will thus not work. They might thus not want the .hg/largefiles cache but only
> store everything in the user cache.
> 
> They might have a store on another file system that they want to use - for
> example when dualbooting and having different file systems but wanting to reuse
> largefiles that already have been downloaded. It could also be a shared network
> drive or something.
> 
> Will the functionality outlined in the documentation work for you? Would it be
> a good idea?
> 
> 
> Additional comments:
> 
> Observation: default-push - not relevant, only indirectly relevant when chosen
> as push destination.
> 
> q: can we always assume all largefiles will be present in a "local" store? or
> can it be streamed on demand when on fast reliable network? or how should
> "fetch before merge" work? or should we just have better handling of missing
> largefiles?
> 
> Observation: there are like 3 levels of largefile "presence":
> * stat - to check if a remote server has the largefile ... but then why not just
> try to fetch it and let it fail?
> * pull - make sure everything is present locally so we don't find it missing
> later at an akward moment ... but why not just get rid of having it awkeard to
> be missing?
> * use - copy it to the working directory or serve it
> 
> diff --git a/hgext/largefiles/__init__.py b/hgext/largefiles/__init__.py
> --- a/hgext/largefiles/__init__.py
> +++ b/hgext/largefiles/__init__.py
> @@ -103,6 +103,37 @@ will be ignored for any repositories not
>  largefile. To add the first largefile to a repository, you must
>  explicitly do so with the --large flag passed to the :hg:`add`
>  command.
> +
> +
> +PROPOSED FEATURE:
> +
> +When a new largefile is found or committed, ``largefiles.writepaths`` will be
> +used to get a list of locations where the largefile should be stored, indexed
> +by content. This serves to make sure the largefile not are stored and
> +transferred more times than necessary while also making the largefiles
> +available when pushing or when sharing just the repository at the file system
> +level.
> +
> +Default: ``.hg/largefiles`` ``usercache``
> +
> +``.hg/largefiles`` designates the repo speicifc folder inside ``.hg``
> +``usercache`` designates the OS and user specific "global" cache
> +
> +When a largefile is needed, ``largefiles.readpaths`` will be used to get an
> +additional list of locations. The write locations and these read locations will
> +be searched in order when looking for a largefile.
> +
> +Default: ``lfpullsource`` ``default``
> +
> +``lfpullsource`` is only set and used when doing ``pull -u``
> +``default`` is a reference to the configured ``paths`` path.
> +
> +In both lists, it is also possible to specify other file system paths, path
> +aliases or remote URLs.
> +
> +
> +When pushing, all largefiles referenced in the pushed revisions will be pushed
> +from one of the readpaths to the push location (unless they already are there).
>  '''
>  
>  from mercurial import hg, localrepo

I asked Mads about how each paths are used in each cases, and
summarized it as below.

        ========== ========= ======
        writepaths readpaths remote
        ========== ========= ======
CLIENT:
commit   W           -        -
pull     R#1(/W)     R#3      R#2
push     R#1         R#2      W

SERVER:
stat     R#1         R#2      -
get      R#1(/W)     R#2      -
put      W           -        -
        ========== ========= ======

  - "N" of "R#N" is the order of looking largefiles up.

  - "(/W)" means that if largefile isn't found in writepaths, one
    found in subsequent looking up is stored into writepaths.


I thought of corner cases below, but please forget, if they are out of
usecase or can be avoided easily :-)

- at "pull" on client side, "far" (= slow) remote store is
  referred prior to read-only "near" (= fast) stores listed in
  readpaths

  (R#2 and R#3 at "pull" on client side)

- at "stat"/"get" request on server side, "far" stores in
  readpaths is referred unintentionally

  (R#2 at "stat"/"get" on server side)

  for example:

    - REPO-A and REPO-B on same host is derived from same repo, and

    - both have "far" store in readpaths, but

    - none of them lists shared "near" store up in writepaths. then,

    - "push REPO-B" on REPO-A causes "stat" request on REPO-B as
      a server, and

    - REPO-B looks largefiles up into "far" stores, but

    - REPO-B should immediately reply "no, I don't have it", because
      putting largefile from REPO-A to REPO-B is certainly faster

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy@lares.dti.ne.jp

Patch

diff --git a/hgext/largefiles/__init__.py b/hgext/largefiles/__init__.py
--- a/hgext/largefiles/__init__.py
+++ b/hgext/largefiles/__init__.py
@@ -103,6 +103,37 @@  will be ignored for any repositories not
 largefile. To add the first largefile to a repository, you must
 explicitly do so with the --large flag passed to the :hg:`add`
 command.
+
+
+PROPOSED FEATURE:
+
+When a new largefile is found or committed, ``largefiles.writepaths`` will be
+used to get a list of locations where the largefile should be stored, indexed
+by content. This serves to make sure the largefile not are stored and
+transferred more times than necessary while also making the largefiles
+available when pushing or when sharing just the repository at the file system
+level.
+
+Default: ``.hg/largefiles`` ``usercache``
+
+``.hg/largefiles`` designates the repo speicifc folder inside ``.hg``
+``usercache`` designates the OS and user specific "global" cache
+
+When a largefile is needed, ``largefiles.readpaths`` will be used to get an
+additional list of locations. The write locations and these read locations will
+be searched in order when looking for a largefile.
+
+Default: ``lfpullsource`` ``default``
+
+``lfpullsource`` is only set and used when doing ``pull -u``
+``default`` is a reference to the configured ``paths`` path.
+
+In both lists, it is also possible to specify other file system paths, path
+aliases or remote URLs.
+
+
+When pushing, all largefiles referenced in the pushed revisions will be pushed
+from one of the readpaths to the push location (unless they already are there).
 '''
 
 from mercurial import hg, localrepo