Patchwork D5296: store: don't read the whole fncache in memory

login
register
mail settings
Submitter phabricator
Date Feb. 27, 2019, 5:47 p.m.
Message ID <10cca117444ddc6233272806493fd60c@localhost.localdomain>
Download mbox | patch
Permalink /patch/38952/
State Not Applicable
Headers show

Comments

phabricator - Feb. 27, 2019, 5:47 p.m.
pulkit updated this revision to Diff 14256.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST UPDATE
  https://phab.mercurial-scm.org/D5296?vs=14255&id=14256

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

AFFECTED FILES
  mercurial/store.py

CHANGE DETAILS




To: pulkit, #hg-reviewers
Cc: indygreg, yuja, mjpieters, mercurial-devel
Yuya Nishihara - Feb. 28, 2019, 1:03 a.m.
> -        self.entries = set(decodedir(fp.read()).splitlines())
> +
> +        self.entries = []
> +        totalsize = self.vfs.stat('fncache').st_size

I don't think `totalsize` has to be stat()ed. We can just loop over until
`fp.read()` reaches to EOF. It's unreliable to assume that the stat('fncache')
see the identical file to the `fp` open()ed before.

> +        chunksize = (10 ** 6) # 1 MB
> +        chunk = b''
> +        chunksize = min(totalsize, chunksize)

Do we have any test covering `totalsize > chunksize` case?

> +        totalsize -= chunksize
> +        while chunksize:
> +            chunk += fp.read(chunksize)
> +            try:
> +                p = chunk.rindex(b'\n')
> +                self.entries.extend(decodedir(chunk[:p + 1]).splitlines())

Any reason to not build a `set` directly?

> +                chunk = chunk[p + 1:]
> +            except ValueError:
> +                # substring '\n' not found, maybe the entry is bigger than the
> +                # chunksize, so let's keep iterating
> +                pass
> +            chunksize = min(totalsize, chunksize)
> +            totalsize -= chunksize
> +
> +        self.entries = set(self.entries)
phabricator - Feb. 28, 2019, 1:03 a.m.
yuja added a comment.


  > - self.entries = set(decodedir(fp.read()).splitlines()) + +        self.entries = [] +        totalsize = self.vfs.stat('fncache').st_size
  
  I don't think `totalsize` has to be stat()ed. We can just loop over until
  `fp.read()` reaches to EOF. It's unreliable to assume that the stat('fncache')
  see the identical file to the `fp` open()ed before.
  
  > +        chunksize = (10 ** 6) # 1 MB
  >  +        chunk = b''
  >  +        chunksize = min(totalsize, chunksize)
  
  Do we have any test covering `totalsize > chunksize` case?
  
  > +        totalsize -= chunksize
  >  +        while chunksize:
  >  +            chunk += fp.read(chunksize)
  >  +            try:
  >  +                p = chunk.rindex(b'\n')
  >  +                self.entries.extend(decodedir(chunk[:p + 1]).splitlines())
  
  Any reason to not build a `set` directly?
  
  > +                chunk = chunk[p + 1:]
  >  +            except ValueError:
  >  +                # substring '\n' not found, maybe the entry is bigger than the
  >  +                # chunksize, so let's keep iterating
  >  +                pass
  >  +            chunksize = min(totalsize, chunksize)
  >  +            totalsize -= chunksize
  >  +
  >  +        self.entries = set(self.entries)

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D5296

To: pulkit, #hg-reviewers
Cc: indygreg, yuja, mjpieters, mercurial-devel

Patch

diff --git a/mercurial/store.py b/mercurial/store.py
--- a/mercurial/store.py
+++ b/mercurial/store.py
@@ -463,7 +463,27 @@ 
             # skip nonexistent file
             self.entries = set()
             return
-        self.entries = set(decodedir(fp.read()).splitlines())
+
+        self.entries = []
+        totalsize = self.vfs.stat('fncache').st_size
+        chunksize = (10 ** 6) # 1 MB
+        chunk = b''
+        chunksize = min(totalsize, chunksize)
+        totalsize -= chunksize
+        while chunksize:
+            chunk += fp.read(chunksize)
+            try:
+                p = chunk.rindex(b'\n')
+                self.entries.extend(decodedir(chunk[:p + 1]).splitlines())
+                chunk = chunk[p + 1:]
+            except ValueError:
+                # substring '\n' not found, maybe the entry is bigger than the
+                # chunksize, so let's keep iterating
+                pass
+            chunksize = min(totalsize, chunksize)
+            totalsize -= chunksize
+
+        self.entries = set(self.entries)
         self._checkentries(fp)
         fp.close()