Patchwork D11204: hgwebdir: avoid systematic full garbage collection

login
register
mail settings
Submitter phabricator
Date July 20, 2021, 4:19 p.m.
Message ID <differential-rev-PHID-DREV-ffpkn4haxzp3eqmbd6i3-req@mercurial-scm.org>
Download mbox | patch
Permalink /patch/49518/
State Superseded
Headers show

Comments

phabricator - July 20, 2021, 4:19 p.m.
gracinet created this revision.
Herald added a reviewer: hg-reviewers.
Herald added a subscriber: mercurial-patches.

REVISION SUMMARY
  Forcing a systematic full garbage collection upon each request
  can serioulsy harm performance. This is reported as
  https://bz.mercurial-scm.org/show_bug.cgi?id=6075
  
  With this change we're performing the full collection according
  to a new setting, `web.full-garbage-collection-rate`.
  The default value is 0, which means not to force any
  full garbage collection.
  
  In my limited and biased towards Python3 experience chasing memory
  leaks in Mercurial servers, the full collection never
  made a difference in terms of memory footprint. It is possible that
  the condition that it was mitigating when it was introduced is not
  there anymore, or that it is specific to some repositories.
  
  On the other hand, as explained in the Python developer docs [1],
  frequent full collections are very harmful in terms of performance if
  lots of objects survive the collection, and hence stay in the
  oldest generation. Note that `gc.collect()` is indeed trying to
  collect the oldest generation [2]. This happens usually in two cases:
  
  - unwanted lingering objects (i.e., an actual memory leak that the GC cannot do anything about). Sadly, we have lots of those these days.
  - desireable long-term objects, typically in caches. This is an area of interest of mine.
  
  In short, the flat rate that this change still permits is
  probably a bad idea in most cases, which is why it is set to zero
  by default. It is possible, though, that a value like 100 or 1000
  could be a good trade-off if someone has a repository for which the GC
  can actually mitigate excessive memory usage.
  
  The test is inspired from test-hgwebdir-paths.py
  
  [1] https://devguide.python.org/garbage_collector/#collecting-the-oldest-generation
  [2] https://docs.python.org/3/library/gc.html#gc.collect

REPOSITORY
  rHG Mercurial

BRANCH
  default

REVISION DETAIL
  https://phab.mercurial-scm.org/D11204

AFFECTED FILES
  mercurial/configitems.py
  mercurial/hgweb/hgwebdir_mod.py
  tests/test-hgwebdir-gc.py

CHANGE DETAILS




To: gracinet, #hg-reviewers
Cc: mercurial-patches, mercurial-devel

Patch

diff --git a/tests/test-hgwebdir-gc.py b/tests/test-hgwebdir-gc.py
new file mode 100644
--- /dev/null
+++ b/tests/test-hgwebdir-gc.py
@@ -0,0 +1,44 @@ 
+from __future__ import absolute_import
+
+import os
+from mercurial.hgweb import hgwebdir_mod
+
+hgwebdir = hgwebdir_mod.hgwebdir
+
+os.mkdir(b'webdir')
+os.chdir(b'webdir')
+
+webdir = os.path.realpath(b'.')
+
+
+def trivial_response(req, res):
+    return []
+
+
+def make_hgwebdir(gc_rate):
+    config = os.path.join(webdir, b'hgwebdir.conf')
+    with open(config, 'wb') as configfile:
+        configfile.write(b'[web]\n')
+        configfile.write(b'full-garbage-collection-rate=%d\n' % gc_rate)
+
+    hg_wd = hgwebdir(config)
+    hg_wd._runwsgi = trivial_response
+    return hg_wd
+
+
+def process_requests(webdir_instance, number):
+    # we don't care for now about passing realistic arguments
+    for _ in range(number):
+        for chunk in webdir_instance.run_wsgi(None, None):
+            pass
+
+
+without_gc = make_hgwebdir(gc_rate=0)
+process_requests(without_gc, 5)
+assert without_gc.requests_count == 5
+assert without_gc.gc_full_collections_done == 0
+
+with_gc = make_hgwebdir(gc_rate=2)
+process_requests(with_gc, 5)
+assert with_gc.requests_count == 5
+assert with_gc.gc_full_collections_done == 2
diff --git a/mercurial/hgweb/hgwebdir_mod.py b/mercurial/hgweb/hgwebdir_mod.py
--- a/mercurial/hgweb/hgwebdir_mod.py
+++ b/mercurial/hgweb/hgwebdir_mod.py
@@ -285,6 +285,7 @@ 
         self.lastrefresh = 0
         self.motd = None
         self.refresh()
+        self.requests_count = 0
         if not baseui:
             # set up environment for new ui
             extensions.loadall(self.ui)
@@ -341,6 +342,10 @@ 
 
         self.repos = repos
         self.ui = u
+        self.gc_full_collect_rate = self.ui.configint(
+            b'web', b'full-garbage-collection-rate'
+        )
+        self.gc_full_collections_done = 0
         encoding.encoding = self.ui.config(b'web', b'encoding')
         self.style = self.ui.config(b'web', b'style')
         self.templatepath = self.ui.config(
@@ -383,12 +388,25 @@ 
             finally:
                 # There are known cycles in localrepository that prevent
                 # those objects (and tons of held references) from being
-                # collected through normal refcounting. We mitigate those
-                # leaks by performing an explicit GC on every request.
-                # TODO remove this once leaks are fixed.
-                # TODO only run this on requests that create localrepository
-                # instances instead of every request.
-                gc.collect()
+                # collected through normal refcounting.
+                # In some cases, the resulting memory consumption can
+                # be tamed by performing explicit garbage collections.
+                # In presence of actual leaks or big long-lived caches, the
+                # impact on performance of such collections can become a
+                # problem, hence the rate shouldn't be set too low.
+                # See "Collecting the oldest generation" in
+                # https://devguide.python.org/garbage_collector
+                # for more about such trade-offs.
+                rate = self.gc_full_collect_rate
+
+                # this is not thread safe, but the consequence (skipping
+                # a garbage collection) is arguably better than risking
+                # to have several threads perform a collection in parallel
+                # (long useless wait on all threads).
+                self.requests_count += 1
+                if rate > 0 and self.requests_count % rate == 0:
+                    gc.collect()
+                    self.gc_full_collections_done += 1
 
     def _runwsgi(self, req, res):
         try:
diff --git a/mercurial/configitems.py b/mercurial/configitems.py
--- a/mercurial/configitems.py
+++ b/mercurial/configitems.py
@@ -2486,6 +2486,11 @@ 
 )
 coreconfigitem(
     b'web',
+    b'full-garbage-collection-rate',
+    default=0,  # means not to force full collections
+)
+coreconfigitem(
+    b'web',
     b'hidden',
     default=False,
 )