Patchwork [V2] util: reimplement lrucachedict

login
register
mail settings
Submitter Gregory Szorc
Date Dec. 13, 2015, 5:36 p.m.
Message ID <cc53351b99acff453333.1450028166@ubuntu-main>
Download mbox | patch
Permalink /patch/12015/
State Accepted
Headers show

Comments

Gregory Szorc - Dec. 13, 2015, 5:36 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1449457450 28800
#      Sun Dec 06 19:04:10 2015 -0800
# Node ID cc53351b99acff4533332a7d75c0e205b9062d37
# Parent  944af8e2eb4cddf96ba5b8a96854528b40979715
util: reimplement lrucachedict

As part of attempting to more aggressively use the existing
lrucachedict, collections.deque operations were frequently
showing up in profiling output, negating benefits of caching.

Searching the internet seems to tell me that the most efficient
way to implement an LRU cache in Python is to have a dict indexing
the cached entries and then to use a doubly linked list to track
freshness of each entry. So, this patch replaces our existing
lrucachedict with a version using such a pattern.

The recently introduced perflrucachedict command reveals the
following timings for 10,000 operations for the following cache
sizes for the existing cache:

n=4     init=0.004079 gets=0.003632 sets=0.005188 mixed=0.005402
n=8     init=0.004045 gets=0.003998 sets=0.005064 mixed=0.005328
n=16    init=0.004011 gets=0.004496 sets=0.005021 mixed=0.005555
n=32    init=0.004064 gets=0.005611 sets=0.005188 mixed=0.006189
n=64    init=0.003975 gets=0.007684 sets=0.005178 mixed=0.007245
n=128   init=0.004121 gets=0.012005 sets=0.005422 mixed=0.009471
n=256   init=0.004143 gets=0.020295 sets=0.005227 mixed=0.013612
n=512   init=0.004039 gets=0.036703 sets=0.005243 mixed=0.020685
n=1024  init=0.004193 gets=0.068142 sets=0.005251 mixed=0.033064
n=2048  init=0.004070 gets=0.133383 sets=0.005160 mixed=0.050359
n=4096  init=0.004053 gets=0.265194 sets=0.004868 mixed=0.048352
n=8192  init=0.004087 gets=0.542218 sets=0.004562 mixed=0.032753
n=16384 init=0.004106 gets=1.064055 sets=0.004179 mixed=0.020367
n=32768 init=0.004034 gets=2.097620 sets=0.004260 mixed=0.013031
n=65536 init=0.004108 gets=4.106390 sets=0.004268 mixed=0.010191

As the data shows, the existing cache's retrieval performance
diminishes linearly with cache size. (Keep in mind the microbenchmark
is testing 100% cache hit rate.)

The new cache implementation reveals the following:

n=4     init=0.006665 gets=0.006541 sets=0.005733 mixed=0.006876
n=8     init=0.006649 gets=0.006374 sets=0.005663 mixed=0.006899
n=16    init=0.006570 gets=0.006504 sets=0.005799 mixed=0.007057
n=32    init=0.006854 gets=0.006459 sets=0.005747 mixed=0.007034
n=64    init=0.006580 gets=0.006495 sets=0.005740 mixed=0.006992
n=128   init=0.006534 gets=0.006739 sets=0.005648 mixed=0.007124
n=256   init=0.006669 gets=0.006773 sets=0.005824 mixed=0.007151
n=512   init=0.006701 gets=0.007061 sets=0.006042 mixed=0.007372
n=1024  init=0.006641 gets=0.007620 sets=0.006387 mixed=0.007464
n=2048  init=0.006517 gets=0.008598 sets=0.006871 mixed=0.008077
n=4096  init=0.006720 gets=0.010933 sets=0.007854 mixed=0.008663
n=8192  init=0.007383 gets=0.015969 sets=0.010288 mixed=0.008896
n=16384 init=0.006660 gets=0.025447 sets=0.011208 mixed=0.008826
n=32768 init=0.006658 gets=0.044390 sets=0.011192 mixed=0.008943
n=65536 init=0.006836 gets=0.082736 sets=0.011151 mixed=0.008826

Let's go through the results.

The new cache takes longer to construct. ~6.6ms vs ~4.1ms. However,
this is measuring 10,000 __init__ calls, so the difference is
~0.2us/instance. We currently only create lrucachedict for manifest
instances, so this regression is not likely relevant.

The new cache is slightly slower for retrievals for cache sizes
< 1024. It's worth noting that the only existing use of lurcachedict
is in manifest.py and the default cache size is 4. This regression
is worrisome. However, for n=4, the delta is ~2.9s for 10,000 lookups,
or ~0.29us/op. Again, this is a marginal regression and likely not
relevant in the real world. Timing `hg log -p -l 100` for
mozilla-central reveals that cache lookup times are dominated by
decompression and fulltext resolution (even with lz4 manifests).

The new cache is significantly faster for retrievals at larger
capacities. Whereas the old implementation has retrieval performance
linear with cache capacity, the new cache is constant time until much
larger values. And, when it does start to increase significantly, it
is a few magnitudes faster than the current cache.

The new cache does appear to be slower for sets when capacity is large.
However, performance is similar for smaller capacities. Of course,
caches should generally be optimized for retrieval performance because
if a cache is getting more sets than gets, it doesn't really make
sense to cache. If this regression is worrisome, again, taking the
largest regression at n=65536 of ~6.9ms for 10,000 results in a
regression of ~0.68us/op. This is not significant in the grand scheme
of things.

Overall, the new cache is performant at retrievals at much larger
capacity values which makes it a generally more useful cache backend.
While there are regressions, their absolute value is extremely small.
Since we aren't using lrucachedict aggressively today, these
regressions should not be relevant. The improved scalability of
lrucachedict should enable us to more aggressively utilize
lrucachedict for more granular caching (read: higher capacity caches)
in the near future. The impetus for this patch is to establish a cache
of decompressed revlog revisions, notably manifest revisions. And since
delta chains can grow to >10,000 and cache hit rate can be high, the
improved retrieval performance of lrucachedict should be relevant.
Matt Mackall - Dec. 16, 2015, 10:06 p.m.
On Sun, 2015-12-13 at 09:36 -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1449457450 28800
> #      Sun Dec 06 19:04:10 2015 -0800
> # Node ID cc53351b99acff4533332a7d75c0e205b9062d37
> # Parent  944af8e2eb4cddf96ba5b8a96854528b40979715
> util: reimplement lrucachedict

Queued for default, thanks.

-- 
Mathematics is the supreme nostalgia of our time.

Patch

diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -490,44 +490,188 @@  class sortdict(dict):
         return self._list.__iter__()
     def iteritems(self):
         for k in self._list:
             yield k, self[k]
     def insert(self, index, key, val):
         self._list.insert(index, key)
         dict.__setitem__(self, key, val)
 
+class _lrucachenode(object):
+    """A node in a doubly linked list.
+
+    Holds a reference to nodes on either side as well as a key-value
+    pair for the dictionary entry.
+    """
+    __slots__ = ('next', 'prev', 'key', 'value')
+
+    def __init__(self):
+        self.next = None
+        self.prev = None
+
+        self.key = _notset
+        self.value = None
+
+    def markempty(self):
+        """Mark the node as emptied."""
+        self.key = _notset
+
 class lrucachedict(object):
-    '''cache most recent gets from or sets to this dictionary'''
-    def __init__(self, maxsize):
+    """Dict that caches most recent accesses and sets.
+
+    The dict consists of an actual backing dict - indexed by original
+    key - and a doubly linked circular list defining the order of entries in
+    the cache.
+
+    The head node is the newest entry in the cache. If the cache is full,
+    we recycle head.prev and make it the new head. Cache accesses result in
+    the node being moved to before the existing head and being marked as the
+    new head node.
+
+    NOTE: construction of this class doesn't scale well if the cache size
+    is in the thousands. Avoid creating hundreds or thousands of instances
+    with large capacities.
+    """
+    def __init__(self, max):
         self._cache = {}
-        self._maxsize = maxsize
-        self._order = collections.deque()
 
-    def __getitem__(self, key):
-        value = self._cache[key]
-        self._order.remove(key)
-        self._order.append(key)
-        return value
+        self._head = head = _lrucachenode()
+        head.prev = head
+        head.next = head
+        self._size = 1
+        self._capacity = max
 
-    def __setitem__(self, key, value):
-        if key not in self._cache:
-            if len(self._cache) >= self._maxsize:
-                del self._cache[self._order.popleft()]
+    def __len__(self):
+        return len(self._cache)
+
+    def __contains__(self, k):
+        return k in self._cache
+
+    def __iter__(self):
+        # We don't have to iterate in cache order, but why not.
+        n = self._head
+        for i in range(len(self._cache)):
+            yield n.key
+            n = n.next
+
+    def __getitem__(self, k):
+        node = self._cache[k]
+        self._movetohead(node)
+        return node.value
+
+    def __setitem__(self, k, v):
+        node = self._cache.get(k)
+        # Replace existing value and mark as newest.
+        if node is not None:
+            node.value = v
+            self._movetohead(node)
+            return
+
+        if self._size < self._capacity:
+            node = self._addcapacity()
         else:
-            self._order.remove(key)
-        self._cache[key] = value
-        self._order.append(key)
+            # Grab the last/oldest item.
+            node = self._head.prev
 
-    def __contains__(self, key):
-        return key in self._cache
+        # At capacity. Kill the old entry.
+        if node.key is not _notset:
+            del self._cache[node.key]
+
+        node.key = k
+        node.value = v
+        self._cache[k] = node
+        # And mark it as newest entry. No need to adjust order since it
+        # is already self._head.prev.
+        self._head = node
+
+    def __delitem__(self, k):
+        node = self._cache.pop(k)
+        node.markempty()
+
+        # Temporarily mark as newest item before re-adjusting head to make
+        # this node the oldest item.
+        self._movetohead(node)
+        self._head = node.next
+
+    # Additional dict methods.
+
+    def get(self, k, default=None):
+        try:
+            return self._cache[k]
+        except KeyError:
+            return default
 
     def clear(self):
+        n = self._head
+        while n.key is not _notset:
+            n.markempty()
+            n = n.next
+
         self._cache.clear()
-        self._order = collections.deque()
+
+    def _movetohead(self, node):
+        """Mark a node as the newest, making it the new head.
+
+        When a node is accessed, it becomes the freshest entry in the LRU
+        list, which is denoted by self._head.
+
+        Visually, let's make ``N`` the new head node (* denotes head):
+
+            previous/oldest <-> head <-> next/next newest
+
+            ----<->--- A* ---<->-----
+            |                       |
+            E <-> D <-> N <-> C <-> B
+
+        To:
+
+            ----<->--- N* ---<->-----
+            |                       |
+            E <-> D <-> C <-> B <-> A
+
+        This requires the following moves:
+
+           C.next = D  (node.prev.next = node.next)
+           D.prev = C  (node.next.prev = node.prev)
+           E.next = N  (head.prev.next = node)
+           N.prev = E  (node.prev = head.prev)
+           N.next = A  (node.next = head)
+           A.prev = N  (head.prev = node)
+        """
+        head = self._head
+        # C.next = D
+        node.prev.next = node.next
+        # D.prev = C
+        node.next.prev = node.prev
+        # N.prev = E
+        node.prev = head.prev
+        # N.next = A
+        # It is tempting to do just "head" here, however if node is
+        # adjacent to head, this will do bad things.
+        node.next = head.prev.next
+        # E.next = N
+        node.next.prev = node
+        # A.prev = N
+        node.prev.next = node
+
+        self._head = node
+
+    def _addcapacity(self):
+        """Add a node to the circular linked list.
+
+        The new node is inserted before the head node.
+        """
+        head = self._head
+        node = _lrucachenode()
+        head.prev.next = node
+        node.prev = head.prev
+        node.next = head
+        head.prev = node
+        self._size += 1
+        return node
 
 def lrucachefunc(func):
     '''cache most recent results of function calls'''
     cache = {}
     order = collections.deque()
     if func.func_code.co_argcount == 1:
         def f(arg):
             if arg not in cache:
diff --git a/tests/test-lrucachedict.py b/tests/test-lrucachedict.py
--- a/tests/test-lrucachedict.py
+++ b/tests/test-lrucachedict.py
@@ -29,10 +29,18 @@  def test_lrucachedict():
 
     # 'e' should be dropped now
     d['f'] = 'vf'
     printifpresent(d, ['b', 'c', 'd', 'e', 'f'])
 
     d.clear()
     printifpresent(d, ['b', 'c', 'd', 'e', 'f'])
 
+    # Now test dicts that aren't full.
+    d = util.lrucachedict(4)
+    d['a'] = 1
+    d['b'] = 2
+    d['a']
+    d['b']
+    printifpresent(d, ['a', 'b'])
+
 if __name__ == '__main__':
     test_lrucachedict()
diff --git a/tests/test-lrucachedict.py.out b/tests/test-lrucachedict.py.out
--- a/tests/test-lrucachedict.py.out
+++ b/tests/test-lrucachedict.py.out
@@ -24,8 +24,12 @@  d['d']: vd
 'e' in d: False
 'f' in d: True
 d['f']: vf
 'b' in d: False
 'c' in d: False
 'd' in d: False
 'e' in d: False
 'f' in d: False
+'a' in d: True
+d['a']: 1
+'b' in d: True
+d['b']: 2