Patchwork [2,of,7] revlog: make sure _cache only contain raw content

login
register
mail settings
Submitter Jun Wu
Date March 28, 2017, 7:49 a.m.
Message ID <99cfe31d37df62b50e53.1490687343@localhost.localdomain>
Download mbox | patch
Permalink /patch/19774/
State Changes Requested
Headers show

Comments

Jun Wu - March 28, 2017, 7:49 a.m.
# HG changeset patch
# User Jun Wu <quark@fb.com>
# Date 1490685282 25200
#      Tue Mar 28 00:14:42 2017 -0700
# Node ID 99cfe31d37df62b50e53a126f0eb31a1e352ac67
# Parent  1e84f9bd4385a8f95ac1ec15dee14c723071ab34
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r 99cfe31d37df
revlog: make sure _cache only contain raw content

Previously, revlog._cache could cache both raw and non-raw content. That is
wrong, because _cache will be used directly for delta application. If the
non-raw content in _cache is used, it may corrupt delta application, or
return wrong content when deltas applied successfully but incorrectly, and a
flagprocessor says skip the hash check.

This patch changes _cache assignments to make sure only only raw contents
are assigned to it.
Ryan McElroy - March 29, 2017, 12:39 p.m.
On 3/28/17 8:49 AM, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark@fb.com>
> # Date 1490685282 25200
> #      Tue Mar 28 00:14:42 2017 -0700
> # Node ID 99cfe31d37df62b50e53a126f0eb31a1e352ac67
> # Parent  1e84f9bd4385a8f95ac1ec15dee14c723071ab34
> revlog: make sure _cache only contain raw content
>
> Previously, revlog._cache could cache both raw and non-raw content. That is
> wrong, because _cache will be used directly for delta application. If the
> non-raw content in _cache is used, it may corrupt delta application, or
> return wrong content when deltas applied successfully but incorrectly, and a
> flagprocessor says skip the hash check.
>
> This patch changes _cache assignments to make sure only only raw contents
> are assigned to it.

This patch overall makes sense to me. We definitely want to only be 
caching the same type of thing in a particular cache. I've reviewed 
revlog.py to find places were self._cache is assigned and it looks like 
you got both places where it's assigned to something other than null.

>
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -1284,11 +1284,11 @@ class revlog(object):
>           text = mdiff.patches(text, bins)
>   
> -        text, validatehash = self._processflags(text, self.flags(rev), 'read',
> -                                                raw=raw)
> +        newtext, validatehash = self._processflags(text, self.flags(rev),
> +                                                   'read', raw=raw)
>           if validatehash:
> -            self.checkhash(text, node, rev=rev)
> +            self.checkhash(newtext, node, rev=rev)
>   
>           self._cache = (node, rev, text)
> -        return text
> +        return newtext

Rather than introducing newtext, wouldn't it be sufficient to move the 
self._cache assignment line up to above the processflags call? Or do you 
want to only cache if the checkhash call succeeds?

If the later, I'd suggest renaming "text" to "rawtext" and reducing 
churn here. "newtext" doesn't really tell me about what is stored in it 
("processedtext" would but it much longer). "rawtext" is pretty short 
and descriptive though.

>   
>       def hash(self, text, p1, p2):
> @@ -1715,5 +1715,6 @@ class revlog(object):
>   
>           if type(text) == str: # only accept immutable objects
> -            self._cache = (node, curr, text)
> +            if raw or not flags:
> +                self._cache = (node, curr, text)
>           self._chainbasecache[curr] = chainbase
>           return node
>

Ugh, this _addrevision code is incredibly dense. A 151-line "helper" 
function with two sub-functions defined inside of it. What a joy!

However, I think this change should also update the comment about "only 
accept...". I'd structure it something like this:

# Only cache immutable objects objects that are raw or have no flags.
# Mutable values are dangerous to cache because they can be changed
# after they are passed to usor after they are returned from cache,
# so we would need to deep copy them on both input and output which
# could be expensive.
# Strings are the only immutable objects we expect to see, to check for them
# If raw is true, it means we are bypassing any flag processors, so we 
have cacheable data.
# If there are no flags set, that means no flag processors will be run, 
so we also have cacheable data
if type(text) == str and (raw or not flags):
     self._cache = (node, curr, text)

For my own edification, what are the types that can show up here? Do we 
accept arbitrary things that have string representations? When would we 
get something that isn't a string here? I hate the lack of types here, 
it leads to so much ambiguity about what is going on.

Any comments you can add to the code from what you have learned mucking 
around here would be appreciated.
Jun Wu - March 29, 2017, 3:13 p.m.
Excerpts from Ryan McElroy's message of 2017-03-29 13:39:29 +0100:
> On 3/28/17 8:49 AM, Jun Wu wrote:
> > # HG changeset patch
> > # User Jun Wu <quark@fb.com>
> > # Date 1490685282 25200
> > #      Tue Mar 28 00:14:42 2017 -0700
> > # Node ID 99cfe31d37df62b50e53a126f0eb31a1e352ac67
> > # Parent  1e84f9bd4385a8f95ac1ec15dee14c723071ab34
> > revlog: make sure _cache only contain raw content
> >
> > Previously, revlog._cache could cache both raw and non-raw content. That is
> > wrong, because _cache will be used directly for delta application. If the
> > non-raw content in _cache is used, it may corrupt delta application, or
> > return wrong content when deltas applied successfully but incorrectly, and a
> > flagprocessor says skip the hash check.
> >
> > This patch changes _cache assignments to make sure only only raw contents
> > are assigned to it.
> 
> This patch overall makes sense to me. We definitely want to only be 
> caching the same type of thing in a particular cache. I've reviewed 
> revlog.py to find places were self._cache is assigned and it looks like 
> you got both places where it's assigned to something other than null.
> 
> >
> > diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> > --- a/mercurial/revlog.py
> > +++ b/mercurial/revlog.py
> > @@ -1284,11 +1284,11 @@ class revlog(object):
> >           text = mdiff.patches(text, bins)
> >   
> > -        text, validatehash = self._processflags(text, self.flags(rev), 'read',
> > -                                                raw=raw)
> > +        newtext, validatehash = self._processflags(text, self.flags(rev),
> > +                                                   'read', raw=raw)
> >           if validatehash:
> > -            self.checkhash(text, node, rev=rev)
> > +            self.checkhash(newtext, node, rev=rev)
> >   
> >           self._cache = (node, rev, text)
> > -        return text
> > +        return newtext
> 
> Rather than introducing newtext, wouldn't it be sufficient to move the 
> self._cache assignment line up to above the processflags call? Or do you 
> want to only cache if the checkhash call succeeds?
> 
> If the later, I'd suggest renaming "text" to "rawtext" and reducing 
> churn here. "newtext" doesn't really tell me about what is stored in it 
> ("processedtext" would but it much longer). "rawtext" is pretty short 
> and descriptive though.

Will rename to "rawtext".

> 
> >   
> >       def hash(self, text, p1, p2):
> > @@ -1715,5 +1715,6 @@ class revlog(object):
> >   
> >           if type(text) == str: # only accept immutable objects
> > -            self._cache = (node, curr, text)
> > +            if raw or not flags:
> > +                self._cache = (node, curr, text)
> >           self._chainbasecache[curr] = chainbase
> >           return node
> >
> 
> Ugh, this _addrevision code is incredibly dense. A 151-line "helper" 
> function with two sub-functions defined inside of it. What a joy!

This function is also broken in multiple other ways regarding on raw
handling. I'll send a separate patch.

> However, I think this change should also update the comment about "only 
> accept...". I'd structure it something like this:
> 
> # Only cache immutable objects objects that are raw or have no flags.
> # Mutable values are dangerous to cache because they can be changed
> # after they are passed to usor after they are returned from cache,
> # so we would need to deep copy them on both input and output which
> # could be expensive.
> # Strings are the only immutable objects we expect to see, to check for them
> # If raw is true, it means we are bypassing any flag processors, so we 
> have cacheable data.
> # If there are no flags set, that means no flag processors will be run, 
> so we also have cacheable data
> if type(text) == str and (raw or not flags):
>      self._cache = (node, curr, text)
> 
> For my own edification, what are the types that can show up here? Do we 
> accept arbitrary things that have string representations? When would we 
> get something that isn't a string here? I hate the lack of types here, 
> it leads to so much ambiguity about what is going on.
> 
> Any comments you can add to the code from what you have learned mucking 
> around here would be appreciated.

Thanks for the suggestion. This series looks relatively easy so I didn't
spend much time on adding comments. I'll do so for the "_addrevision"
change.

Patch

diff --git a/mercurial/revlog.py b/mercurial/revlog.py
--- a/mercurial/revlog.py
+++ b/mercurial/revlog.py
@@ -1284,11 +1284,11 @@  class revlog(object):
         text = mdiff.patches(text, bins)
 
-        text, validatehash = self._processflags(text, self.flags(rev), 'read',
-                                                raw=raw)
+        newtext, validatehash = self._processflags(text, self.flags(rev),
+                                                   'read', raw=raw)
         if validatehash:
-            self.checkhash(text, node, rev=rev)
+            self.checkhash(newtext, node, rev=rev)
 
         self._cache = (node, rev, text)
-        return text
+        return newtext
 
     def hash(self, text, p1, p2):
@@ -1715,5 +1715,6 @@  class revlog(object):
 
         if type(text) == str: # only accept immutable objects
-            self._cache = (node, curr, text)
+            if raw or not flags:
+                self._cache = (node, curr, text)
         self._chainbasecache[curr] = chainbase
         return node