Patchwork D5296: store: don't read the whole fncache in memory

login
register
mail settings
Submitter phabricator
Date Nov. 22, 2018, 12:27 p.m.
Message ID <differential-rev-PHID-DREV-mah4iwncttgwdncaynd2-req@phab.mercurial-scm.org>
Download mbox | patch
Permalink /patch/36708/
State Superseded
Headers show

Comments

phabricator - Nov. 22, 2018, 12:27 p.m.
pulkit created this revision.
Herald added subscribers: mercurial-devel, mjpieters.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  In large repositories with lot of files, the fncache grows more than 100 MB and
  reading that whole thing into memory slows things down. Let's not read the whole
  thing into memory.
  
  Following all the performance optimizations observed:
  
  `hg perffncacheload` (best of 4 runs)
  
  before: wall 3.253480 comb 3.440000 user 2.500000 sys 0.940000 (best of 3)
  after: wall 2.850717 comb 3.010000 user 2.370000 sys 0.640000 (best of 4)

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

AFFECTED FILES
  mercurial/store.py

CHANGE DETAILS




To: pulkit, #hg-reviewers
Cc: mjpieters, mercurial-devel
Yuya Nishihara - Nov. 22, 2018, 1:27 p.m.
> diff --git a/mercurial/store.py b/mercurial/store.py
> --- a/mercurial/store.py
> +++ b/mercurial/store.py
> @@ -461,13 +461,13 @@
>              # skip nonexistent file
>              self.entries = set()
>              return
> -        self.entries = set(decodedir(fp.read()).splitlines())
> -        if '' in self.entries:
> -            fp.seek(0)
> -            for n, line in enumerate(util.iterfile(fp)):
> -                if not line.rstrip('\n'):
> -                    t = _('invalid entry in fncache, line %d') % (n + 1)
> -                    raise error.Abort(t)
> +        self.entries = set()
> +        for n, line in enumerate(util.iterfile(fp)):
> +            entry = line.rstrip('\n')
> +            if not entry:
> +                t = _('invalid entry in fncache, line %d') % (n + 1)
> +                raise error.Abort(t)
> +            self.entries.add(decodedir(entry))

This goes the opposite direction to 9fca5b056c0a. I guess the current code
would be faster if the fncache is around ~10MB. Can you benchmark it?
phabricator - Nov. 22, 2018, 1:32 p.m.
yuja added a comment.


  > diff --git a/mercurial/store.py b/mercurial/store.py
  > 
  >   - a/mercurial/store.py +++ b/mercurial/store.py @@ -461,13 +461,13 @@
  >     1. skip nonexistent file self.entries = set() return
  > - self.entries = set(decodedir(fp.read()).splitlines())
  > - if '' in self.entries:
  > - fp.seek(0)
  > - for n, line in enumerate(util.iterfile(fp)):
  > - if not line.rstrip('\n'):
  > - t = _('invalid entry in fncache, line %d') % (n + 1)
  > - raise error.Abort(t) +        self.entries = set() +        for n, line in enumerate(util.iterfile(fp)): +            entry = line.rstrip('\n') +            if not entry: +                t = _('invalid entry in fncache, line %d') % (n + 1) +                raise error.Abort(t) +            self.entries.add(decodedir(entry))
  
  This goes the opposite direction to https://phab.mercurial-scm.org/rHG9fca5b056c0a2f673aefa64f7ec7488bd9188d9d. I guess the current code
  would be faster if the fncache is around ~10MB. Can you benchmark it?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: yuja, mjpieters, mercurial-devel
phabricator - Nov. 26, 2018, 10:38 a.m.
pulkit added a comment.


  In https://phab.mercurial-scm.org/D5296#78834, @yuja wrote:
  
  > > diff --git a/mercurial/store.py b/mercurial/store.py
  > > 
  > >   - a/mercurial/store.py +++ b/mercurial/store.py @@ -461,13 +461,13 @@
  > >     1. skip nonexistent file self.entries = set() return
  > > - self.entries = set(decodedir(fp.read()).splitlines())
  > > - if '' in self.entries:
  > > - fp.seek(0)
  > > - for n, line in enumerate(util.iterfile(fp)):
  > > - if not line.rstrip('\n'):
  > > - t = _('invalid entry in fncache, line %d') % (n + 1)
  > > - raise error.Abort(t) +        self.entries = set() +        for n, line in enumerate(util.iterfile(fp)): +            entry = line.rstrip('\n') +            if not entry: +                t = _('invalid entry in fncache, line %d') % (n + 1) +                raise error.Abort(t) +            self.entries.add(decodedir(entry))
  >
  > This goes the opposite direction to https://phab.mercurial-scm.org/rHG9fca5b056c0a2f673aefa64f7ec7488bd9188d9d. I guess the current code
  >  would be faster if the fncache is around ~10MB. Can you benchmark it?
  
  
  Yeah the current code is much faster if fncache is not large.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: yuja, mjpieters, mercurial-devel
phabricator - Nov. 27, 2018, 10:48 a.m.
pulkit added a comment.


  Seeing the performance benefit it brings on our repo, I want to try other ways we can do this. Do we like having a conditional which checks the size of fncache and choose one of the approaches depending on how large it is?
  
  If size(fncache) < 50M, use the current approach
  else, use the approach described in this patch
  
  @yuja what do you think?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: yuja, mjpieters, mercurial-devel
Yuya Nishihara - Nov. 27, 2018, 12:23 p.m.
On Tue, 27 Nov 2018 10:48:44 +0000, pulkit (Pulkit Goyal) wrote:
>   Seeing the performance benefit it brings on our repo, I want to try other ways we can do this. Do we like having a conditional which checks the size of fncache and choose one of the approaches depending on how large it is?
>   
>   If size(fncache) < 50M, use the current approach
>   else, use the approach described in this patch

Suppose the current code is fast because it avoids running Python in loop,
I think it can be extended to not use too much memory.

```
chunk = b''
while True:
    chunk += fp.read(chunk_size)  # maybe ~10MB?
    p = chunk.rindex(b'\n')  # need to handle error
    decodedir(chunk[:p + 1])...
    chunk = chunk[p + 1:]
```

Just an idea. Not profiled.
phabricator - Feb. 25, 2019, 3:53 p.m.
pulkit added a comment.


  Ping!

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: yuja, mjpieters, mercurial-devel
phabricator - Feb. 26, 2019, 2:46 a.m.
indygreg added a comment.


  I suspect https://phab.mercurial-scm.org/rHG9fca5b056c0a2f673aefa64f7ec7488bd9188d9d made things faster because the code before was using 1 I/O operation for every entry. I would also not be surprised if CPython from that era did something very inefficient with regards to line reading.
  
  The current code is pretty bad because it buffers the entire file content in memory! I agree we should change it.
  
  I like this patch as written. If profiling shows it to be slow, I think there is room to optimize `util.iterfile()` or even to teach the vfs layer how to efficiently open files for line-based I/O. This is something I could help optimize if needed.
  
  While I'm here, the fncache file being a newline delimited list of full file paths is kinda ridiculous. We could do much better by using compression and/or a more complicated data structure. It is kinda silly that we have to load this decoded data structure into memory. So if your file on disk is ~100MB, you are going to have a Python set that also consumes ~100MB. That's really not great.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: indygreg, yuja, mjpieters, mercurial-devel
Yuya Nishihara - Feb. 26, 2019, 3:01 a.m.
(resend without the "On ... wrote:" line)

>   Seeing the performance benefit it brings on our repo, I want to try other ways we can do this. Do we like having a conditional which checks the size of fncache and choose one of the approaches depending on how large it is?
>   
>   If size(fncache) < 50M, use the current approach
>   else, use the approach described in this patch

Suppose the current code is fast because it avoids running Python in loop,
I think it can be extended to not use too much memory.

```
chunk = b''
while True:
    chunk += fp.read(chunk_size)  # maybe ~10MB?
    p = chunk.rindex(b'\n')  # need to handle error
    decodedir(chunk[:p + 1])...
    chunk = chunk[p + 1:]
```

Just an idea. Not profiled.
phabricator - Feb. 26, 2019, 3:03 a.m.
yuja added a comment.


  (resend without the "On ... wrote:" line)
  
  >   Seeing the performance benefit it brings on our repo, I want to try other ways we can do this. Do we like having a conditional which checks the size of fncache and choose one of the approaches depending on how large it is?
  >   
  >   If size(fncache) < 50M, use the current approach
  >   else, use the approach described in this patch
  
  Suppose the current code is fast because it avoids running Python in loop,
  I think it can be extended to not use too much memory.
  
    chunk = b''
    while True:
        chunk += fp.read(chunk_size)  # maybe ~10MB?
        p = chunk.rindex(b'\n')  # need to handle error
        decodedir(chunk[:p + 1])...
        chunk = chunk[p + 1:]
  
  Just an idea. Not profiled.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: indygreg, yuja, mjpieters, mercurial-devel
phabricator - Feb. 27, 2019, 3:43 p.m.
pulkit added a comment.


  In https://phab.mercurial-scm.org/D5296#87892, @yuja wrote:
  
  > (resend without the "On ... wrote:" line)
  >
  > >   Seeing the performance benefit it brings on our repo, I want to try other ways we can do this. Do we like having a conditional which checks the size of fncache and choose one of the approaches depending on how large it is?
  > >   
  > >   If size(fncache) < 50M, use the current approach
  > >   else, use the approach described in this patch
  >
  > Suppose the current code is fast because it avoids running Python in loop,
  >  I think it can be extended to not use too much memory.
  >
  >   chunk = b''
  >   while True:
  >       chunk += fp.read(chunk_size)  # maybe ~10MB?
  >       p = chunk.rindex(b'\n')  # need to handle error
  >       decodedir(chunk[:p + 1])...
  >       chunk = chunk[p + 1:]
  >
  >
  > Just an idea. Not profiled.
  
  
  I implemented this idea. It saves ~1 sec on perffncacheload for us when chunksize is taken 1 MB.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: indygreg, yuja, mjpieters, mercurial-devel
phabricator - Feb. 27, 2019, 3:46 p.m.
pulkit added a comment.


  In https://phab.mercurial-scm.org/D5296#87890, @indygreg wrote:
  
  > I suspect https://phab.mercurial-scm.org/rHG9fca5b056c0a2f673aefa64f7ec7488bd9188d9d made things faster because the code before was using 1 I/O operation for every entry. I would also not be surprised if CPython from that era did something very inefficient with regards to line reading.
  >
  > The current code is pretty bad because it buffers the entire file content in memory! I agree we should change it.
  >
  > I like this patch as written. If profiling shows it to be slow, I think there is room to optimize `util.iterfile()` or even to teach the vfs layer how to efficiently open files for line-based I/O. This is something I could help optimize if needed.
  >
  > While I'm here, the fncache file being a newline delimited list of full file paths is kinda ridiculous. We could do much better by using compression and/or a more complicated data structure. It is kinda silly that we have to load this decoded data structure into memory. So if your file on disk is ~100MB, you are going to have a Python set that also consumes ~100MB. That's really not great.
  
  
  I agree. We should either use compression or use a much complicated data structure here. Right now there is no way to check whether an entry exists or not without parsing the whole fncache. It does not affect much on small repositories but on large repositories such as mozilla, the difference of parsing the whole fncache is noticable. We have perffncacheload taking ~4 seconds :/

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: indygreg, yuja, mjpieters, mercurial-devel
phabricator - March 17, 2019, 12:37 a.m.
pulkit added a comment.


  In https://phab.mercurial-scm.org/D5296#88016, @yuja wrote:
  
  > > - self.entries = set(decodedir(fp.read()).splitlines()) + +        self.entries = [] +        totalsize = self.vfs.stat('fncache').st_size
  >
  > I don't think `totalsize` has to be stat()ed. We can just loop over until
  >  `fp.read()` reaches to EOF. It's unreliable to assume that the stat('fncache')
  >  see the identical file to the `fp` open()ed before.
  >
  > > +        chunksize = (10 ** 6) # 1 MB
  > >  +        chunk = b''
  > >  +        chunksize = min(totalsize, chunksize)
  >
  > Do we have any test covering `totalsize > chunksize` case?
  
  
  Right now this patch does not have any tests. How should I add them?
  
  1. add a debug config option and pass that to fncachestore and then to fncache
  2. have a function which returns the chunk_size, write an extenion in tests which wrap that function and enable that extension in tests
  
  I think 2) is better option here. I am unable to think of any solution expect this two. I think you will have better ideas here. What do you think?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: indygreg, yuja, mjpieters, mercurial-devel

Patch

diff --git a/mercurial/store.py b/mercurial/store.py
--- a/mercurial/store.py
+++ b/mercurial/store.py
@@ -461,13 +461,13 @@ 
             # skip nonexistent file
             self.entries = set()
             return
-        self.entries = set(decodedir(fp.read()).splitlines())
-        if '' in self.entries:
-            fp.seek(0)
-            for n, line in enumerate(util.iterfile(fp)):
-                if not line.rstrip('\n'):
-                    t = _('invalid entry in fncache, line %d') % (n + 1)
-                    raise error.Abort(t)
+        self.entries = set()
+        for n, line in enumerate(util.iterfile(fp)):
+            entry = line.rstrip('\n')
+            if not entry:
+                t = _('invalid entry in fncache, line %d') % (n + 1)
+                raise error.Abort(t)
+            self.entries.add(decodedir(entry))
         fp.close()
 
     def write(self, tr):