Patchwork [2,of,2,V2] lfs: control tracked file selection via a tracked file

login
register
mail settings
Submitter Matt Harbison
Date Jan. 17, 2018, 6:02 a.m.
Message ID <074417bc25368171950f.1516168947@Envy>
Download mbox | patch
Permalink /patch/26805/
State Accepted
Headers show

Comments

Matt Harbison - Jan. 17, 2018, 6:02 a.m.
# HG changeset patch
# User Matt Harbison <matt_harbison@yahoo.com>
# Date 1515971571 18000
#      Sun Jan 14 18:12:51 2018 -0500
# Node ID 074417bc25368171950f0e8e292b151f9641b5a0
# Parent  03d737594660bb74823317e02a6ad2e91456c12d
lfs: control tracked file selection via a tracked file

Since the lfs tracking policy can dramatically affect the repository, it makes
more sense to have the policy file checked in, than to rely on all developers
configuring their .hgrc properly.  The inspiration for this is the .hgeol file.
The configuration lives under '[track]', so that other things can be added in
the future.  Eventually, the config option should be limited to `convert` only.

If the file can't be parsed for any reason (including unrecognized elements of
the minifileset language), the commit will abort until the problem is corrected.
This seems more useful than the warning that hgeol emits, and has no effect on
reading the data, so there's no compatibility concerns.

My initial thought was to read the file and change each "key = value" line into
"((key) & (value))", so that each line could be ORed together, and make a single
pass at compiling.  Unfortunately, that prevents exclusions if there's a
catchall rule.  Consider what happens to a large *.c file here:

  [track]
  **.c = none()
  ** = size('>1MB')
  # ((**.c) & (none())) | ((**) & (size('>1MB'))) => anything > 1MB

I also thought about having separate [include] and [exclude] sections.  But that
just seems to open things up to user mistakes.  Consider:

  [include]
  **.zip = all()
  **.php = size('>10MB')

  [exclude]
  **.zip = all()  # Who wins?
  **.php = none() # Effectively 'all()' (i.e. nothing excluded), or >10MB ?

Therefore, it just compiles each key and value separately, and walks until the
key matches something.  I'm not sure how to enforce just file patterns on LHS
without leaking knowledge about the minifileset here.  That means this will
allow odd looking lines like this:

  [track]
  **.c | **.txt = none()

But that's also fewer lines to compile, so slightly more efficient?  Some things
like 'none()' won't work as expected on LHS though, because that won't match, so
that line is skipped.  For now, these quirks are not mentioned in the
documentation.

Jun previously expressed concern about efficiency when scaling to large repos,
so I tried avoiding 'repo[None]'.  (localrepo.commit() gets repo[None] already,
but doesn't tie it to the workingcommitctx used here.)  Therefore, I looked at
the passed context for 'AMR' status.  But that doesn't help with the normal case
where the policy file is tracked, but clean.  That requires looking up p1() to
read the file.  I don't see any way to get the content of one file without first
creating the full parent context.
Yuya Nishihara - Jan. 17, 2018, 2:07 p.m.
On Wed, 17 Jan 2018 01:02:27 -0500, Matt Harbison wrote:
> # HG changeset patch
> # User Matt Harbison <matt_harbison@yahoo.com>
> # Date 1515971571 18000
> #      Sun Jan 14 18:12:51 2018 -0500
> # Node ID 074417bc25368171950f0e8e292b151f9641b5a0
> # Parent  03d737594660bb74823317e02a6ad2e91456c12d
> lfs: control tracked file selection via a tracked file

Looks generally good, and there was no comment for the previous version.
Queued, thanks.

>  Configs::
>  
>      [lfs]
> @@ -35,6 +59,9 @@
>      # - (**.php & size(">2MB")) | (**.js & size(">5MB")) | **.tar.gz
>      #     | ("path:bin" & !"path:/bin/README") | size(">1GB")
>      # (default: none())
> +    #
> +    # This is ignored if there is a tracked '.hglfs' file, and this setting
> +    # will eventually be deprecated and removed.

[...]

> @@ -149,18 +185,58 @@
>          ui.setconfig('hooks', 'commit.lfs', checkrequireslfs, 'lfs')
>          ui.setconfig('hooks', 'pretxnchangegroup.lfs', checkrequireslfs, 'lfs')
>  
> -def _trackedmatcher(repo):
> +def _trackedmatcher(repo, ctx):
>      """Return a function (path, size) -> bool indicating whether or not to
>      track a given file with lfs."""
> -    trackspec = repo.ui.config('lfs', 'track')
> +    data = ''
> +
> +    if '.hglfs' in ctx.added() or '.hglfs' in ctx.modified():
> +        data = ctx['.hglfs'].data()
> +    elif '.hglfs' not in ctx.removed():
> +        p1 = repo['.']
> +
> +        if '.hglfs' not in p1:
> +            # No '.hglfs' in wdir or in parent.  Fallback to config
> +            # for now.
> +            trackspec = repo.ui.config('lfs', 'track')
> +
> +            # deprecated config: lfs.threshold
> +            threshold = repo.ui.configbytes('lfs', 'threshold')
> +            if threshold:
> +                fileset.parse(trackspec)  # make sure syntax errors are confined
> +                trackspec = "(%s) | size('>%d')" % (trackspec, threshold)

This seems okay if we'll drop the support for lfs.threshold. If we won't,
I think lfs.threshold should be OR-ed with the .hglfs rules.

Patch

diff --git a/hgext/lfs/__init__.py b/hgext/lfs/__init__.py
--- a/hgext/lfs/__init__.py
+++ b/hgext/lfs/__init__.py
@@ -7,6 +7,30 @@ 
 
 """lfs - large file support (EXPERIMENTAL)
 
+The extension reads its configuration from a versioned ``.hglfs``
+configuration file found in the root of the working directory. The
+``.hglfs`` file uses the same syntax as all other Mercurial
+configuration files. It uses a single section, ``[track]``.
+
+The ``[track]`` section specifies which files are stored as LFS (or
+not). Each line is keyed by a file pattern, with a predicate value.
+The first file pattern match is used, so put more specific patterns
+first.  The available predicates are ``all()``, ``none()``, and
+``size()``. See "hg help filesets.size" for the latter.
+
+Example versioned ``.hglfs`` file::
+
+  [track]
+  # No Makefile or python file, anywhere, will be LFS
+  **Makefile = none()
+  **.py = none()
+
+  **.zip = all()
+  **.exe = size(">1MB")
+
+  # Catchall for everything not matched above
+  ** = size(">10MB")
+
 Configs::
 
     [lfs]
@@ -35,6 +59,9 @@ 
     # - (**.php & size(">2MB")) | (**.js & size(">5MB")) | **.tar.gz
     #     | ("path:bin" & !"path:/bin/README") | size(">1GB")
     # (default: none())
+    #
+    # This is ignored if there is a tracked '.hglfs' file, and this setting
+    # will eventually be deprecated and removed.
     track = size(">10M")
 
     # how many times to retry before giving up on transferring an object
@@ -52,7 +79,9 @@ 
 from mercurial import (
     bundle2,
     changegroup,
+    config,
     context,
+    error,
     exchange,
     extensions,
     filelog,
@@ -123,13 +152,20 @@ 
     if not repo.local():
         return
 
-    repo.svfs.options['lfstrack'] = _trackedmatcher(repo)
     repo.svfs.lfslocalblobstore = blobstore.local(repo)
     repo.svfs.lfsremoteblobstore = blobstore.remote(repo)
 
     # Push hook
     repo.prepushoutgoinghooks.add('lfs', wrapper.prepush)
 
+    class lfsrepo(repo.__class__):
+        @localrepo.unfilteredmethod
+        def commitctx(self, ctx, error=False):
+            repo.svfs.options['lfstrack'] = _trackedmatcher(self, ctx)
+            return super(lfsrepo, self).commitctx(ctx, error)
+
+    repo.__class__ = lfsrepo
+
     if 'lfs' not in repo.requirements:
         def checkrequireslfs(ui, repo, **kwargs):
             if 'lfs' not in repo.requirements:
@@ -149,18 +185,58 @@ 
         ui.setconfig('hooks', 'commit.lfs', checkrequireslfs, 'lfs')
         ui.setconfig('hooks', 'pretxnchangegroup.lfs', checkrequireslfs, 'lfs')
 
-def _trackedmatcher(repo):
+def _trackedmatcher(repo, ctx):
     """Return a function (path, size) -> bool indicating whether or not to
     track a given file with lfs."""
-    trackspec = repo.ui.config('lfs', 'track')
+    data = ''
+
+    if '.hglfs' in ctx.added() or '.hglfs' in ctx.modified():
+        data = ctx['.hglfs'].data()
+    elif '.hglfs' not in ctx.removed():
+        p1 = repo['.']
+
+        if '.hglfs' not in p1:
+            # No '.hglfs' in wdir or in parent.  Fallback to config
+            # for now.
+            trackspec = repo.ui.config('lfs', 'track')
+
+            # deprecated config: lfs.threshold
+            threshold = repo.ui.configbytes('lfs', 'threshold')
+            if threshold:
+                fileset.parse(trackspec)  # make sure syntax errors are confined
+                trackspec = "(%s) | size('>%d')" % (trackspec, threshold)
+
+            return minifileset.compile(trackspec)
+
+        data = p1['.hglfs'].data()
 
-    # deprecated config: lfs.threshold
-    threshold = repo.ui.configbytes('lfs', 'threshold')
-    if threshold:
-        fileset.parse(trackspec)  # make sure syntax errors are confined
-        trackspec = "(%s) | size('>%d')" % (trackspec, threshold)
+    # In removed, or not in parent
+    if not data:
+        return lambda p, s: False
+
+    # Parse errors here will abort with a message that points to the .hglfs file
+    # and line number.
+    cfg = config.config()
+    cfg.parse('.hglfs', data)
 
-    return minifileset.compile(trackspec)
+    try:
+        rules = [(minifileset.compile(pattern), minifileset.compile(rule))
+                 for pattern, rule in cfg.items('track')]
+    except error.ParseError as e:
+        # The original exception gives no indicator that the error is in the
+        # .hglfs file, so add that.
+
+        # TODO: See if the line number of the file can be made available.
+        raise error.Abort(_('parse error in .hglfs: %s') % e)
+
+    def _match(path, size):
+        for pat, rule in rules:
+            if pat(path, size):
+                return rule(path, size)
+
+        return False
+
+    return _match
 
 def wrapfilelog(filelog):
     wrapfunction = extensions.wrapfunction
diff --git a/tests/test-lfs.t b/tests/test-lfs.t
--- a/tests/test-lfs.t
+++ b/tests/test-lfs.t
@@ -931,6 +931,79 @@ 
   $ hg commit -m 'add A' -A A
   $ hg rm A
   $ hg commit -m 'rm A'
+
+Bad .hglfs files will block the commit with a useful message
+
+  $ cat > .hglfs << EOF
+  > [track]
+  > **.test = size(">5B")
+  > bad file ... no commit
+  > EOF
+
+  $ echo x > file.txt
+  $ hg ci -Aqm 'should fail'
+  hg: parse error at .hglfs:3: bad file ... no commit
+  [255]
+
+  $ cat > .hglfs << EOF
+  > [track]
+  > **.test = size(">5B")
+  > ** = nonexistent()
+  > EOF
+
+  $ hg ci -Aqm 'should fail'
+  abort: parse error in .hglfs: unknown identifier: nonexistent
+  [255]
+
+'**' works out to mean all files.
+
+  $ cat > .hglfs << EOF
+  > [track]
+  > **.test = size(">5B")
+  > **.exclude = none()
+  > ** = size(">10B")
+  > EOF
+
+The LFS policy takes effect as the .hglfs file is committed
+
+  $ echo 'largefile' > lfs.test
+  $ echo '012345678901234567890' > nolfs.exclude
+  $ echo '01234567890123456' > lfs.catchall
+  $ hg ci -Aqm 'added .hglfs'
+  $ hg log -r . -T '{rev}: {lfs_files % "{file}: {oid}\n"}\n'
+  2: lfs.catchall: d4ec46c2869ba22eceb42a729377432052d9dd75d82fc40390ebaadecee87ee9
+  lfs.test: 5489e6ced8c36a7b267292bde9fd5242a5f80a7482e8f23fa0477393dfaa4d6c
+  
+The existing .hglfs file is used even when it is not in the 'A' or 'M' states
+
+  $ echo 'largefile2' > lfs.test
+  $ echo '012345678901234567890a' > nolfs.exclude
+  $ echo '01234567890123456a' > lfs.catchall
+  $ hg ci -qm 'unmodified .hglfs'
+  $ hg log -r . -T '{rev}: {lfs_files % "{file}: {oid}\n"}\n'
+  3: lfs.catchall: 31f43b9c62b540126b0ad5884dc013d21a61c9329b77de1fceeae2fc58511573
+  lfs.test: 8acd23467967bc7b8cc5a280056589b0ba0b17ff21dbd88a7b6474d6290378a6
+  
+Excluding the .hglfs file from the commit postpones the policy change
+
+  $ hg rm .hglfs
+  $ echo 'largefile3' > lfs.test
+  $ echo '012345678901234567890abc' > nolfs.exclude
+  $ echo '01234567890123456abc' > lfs.catchall
+  $ hg ci -qm 'file test' -X .hglfs
+  $ hg log -r . -T '{rev}: {lfs_files % "{file}: {oid}\n"}\n'
+  4: lfs.catchall: 6747cfb1b83965b4a884e7a6061813ae31e4122028bc6a88d2ac5e5f9e05c5af
+  lfs.test: 3f40b70c2294e91e0fa789ebcf73c5a1d1c7aef270f83e477e40cb0513237e8c
+  
+The policy change takes effect when the .hglfs is committed
+
+  $ echo 'largefile4' > lfs.test
+  $ echo '012345678901234567890abcdef' > nolfs.exclude
+  $ echo '01234567890123456abcdef' > lfs.catchall
+  $ hg ci -qm 'file test'
+  $ hg log -r . -T '{rev}: {lfs_files % "{file}: {oid}\n"}\n'
+  5: 
+
   $ cd ..
 
 Unbundling adds a requirement to a non-lfs repo, if necessary.