Patchwork [1,of,3,RFC] mercurial: implement a source transforming module loader on Python 3

login
register
mail settings
Submitter Gregory Szorc
Date May 16, 2016, 4:02 a.m.
Message ID <7c5d1f8db9618f511f40.1463371371@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/15145/
State RFC, archived
Headers show

Comments

Gregory Szorc - May 16, 2016, 4:02 a.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1463370916 25200
#      Sun May 15 20:55:16 2016 -0700
# Node ID 7c5d1f8db9618f511f40bc4089145310671ca57b
# Parent  f8b87a779c87586aa043bcd6030369715edfc9c1
mercurial: implement a source transforming module loader on Python 3

The most painful part of ensuring Python code runs on both Python 2
and 3 is string encoding. Making this difficult is that string
literals in Python 2 are bytes and string literals in Python 3 are
unicode. So, to ensure consistent types are used, you have to
use "from __future__ import unicode_literals" and/or prefix literals
with their type (e.g. b'foo' or u'foo').

Nearly every string in Mercurial is bytes. So, to use the same source
code on both Python 2 and 3 would require prefixing nearly every
string literal with "b" to make it a byte literal. This is ugly and
not something mpm is willing to do.

This patch implements a custom module loader on Python 3 that performs
source transformation to convert string literals (unicode in Python 3)
to byte literals. In effect, it changes Python 3's string literals to
behave like Python 2's.

The module loader is only used on mercurial.* and hgext.* modules.

The loader works by tokenizing the loaded source and replacing
"string" tokens if necessary. The modified token stream is
untokenized back to source and loaded like normal. This does add some
overhead. However, this all occurs before caching. So .pyc files should
cache the version with byte literals.

This patch isn't suitable for checkin. There are a few deficiencies,
including that changes to the loader won't result in the cache
being invalidated. As part of testing this, I've had to manually
blow away __pycache__ directories. We'll likely need to hack up
cache checking as well so caching is invalidated when
mercurial/__init__.py changes. This is going to be ugly.
Gregory Szorc - May 16, 2016, 6:12 a.m.
On Sun, May 15, 2016 at 9:02 PM, Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1463370916 25200
> #      Sun May 15 20:55:16 2016 -0700
> # Node ID 7c5d1f8db9618f511f40bc4089145310671ca57b
> # Parent  f8b87a779c87586aa043bcd6030369715edfc9c1
> mercurial: implement a source transforming module loader on Python 3
>
> The most painful part of ensuring Python code runs on both Python 2
> and 3 is string encoding. Making this difficult is that string
> literals in Python 2 are bytes and string literals in Python 3 are
> unicode. So, to ensure consistent types are used, you have to
> use "from __future__ import unicode_literals" and/or prefix literals
> with their type (e.g. b'foo' or u'foo').
>
> Nearly every string in Mercurial is bytes. So, to use the same source
> code on both Python 2 and 3 would require prefixing nearly every
> string literal with "b" to make it a byte literal. This is ugly and
> not something mpm is willing to do.
>
> This patch implements a custom module loader on Python 3 that performs
> source transformation to convert string literals (unicode in Python 3)
> to byte literals. In effect, it changes Python 3's string literals to
> behave like Python 2's.
>
> The module loader is only used on mercurial.* and hgext.* modules.
>
> The loader works by tokenizing the loaded source and replacing
> "string" tokens if necessary. The modified token stream is
> untokenized back to source and loaded like normal. This does add some
> overhead. However, this all occurs before caching. So .pyc files should
> cache the version with byte literals.
>
> This patch isn't suitable for checkin. There are a few deficiencies,
> including that changes to the loader won't result in the cache
> being invalidated. As part of testing this, I've had to manually
> blow away __pycache__ directories. We'll likely need to hack up
> cache checking as well so caching is invalidated when
> mercurial/__init__.py changes. This is going to be ugly.
>

Slightly more context for this patch.

I initially tried to implement things at the AST level. However, Python's
AST APIs are quite convoluted. lib2to3 has a bit nicer "framework" for
doing source transformations, but the API isn't stable. There are 3rd party
libraries, but I don't want to introduce the dependency.

After realizing AST manipulation would be too much work, I briefly looked
at the "parser" module. It was also a bit gnarly. So, I went to the next
lowest level - the tokenizer module - and realized a solution was pretty
trivial to implement. The cost is it might be too low-level and more
advanced rewriting in the future could be difficult. It might also be a bit
more expensive than AST transforms. But it should be "fast enough,"
especially with .pyc caching.

Now that we're operating at the source level and we're effectively changing
the source string in its entirety, it would be possible to reimplement this
source transformation as a "# coding:" hack as Jun suggested and I earlier
discounted (because I think we'd have to operate at the AST level). I
didn't realize the tokenizer module could facilitate what we needed. Given
the gotchas with hacking up module importing, a "# coding" hack seems
attractive again. Although, I think we still have the .pyc caching problem:
if we change features of the source rewriter, we need a way to tell the
module loader that the .pyc is invalid (.pyc validation looks at source
file size and mtime to check cache freshness). We'd need to use a custom
module loader on Python 3 to provide custom validation functionality. Or
we'd need to do something real ugly like rewrite the installed files to
have a randomly generated "# coding" value. Yuck.

Anyway, before I do any more work on this, I'd like feedback. I'm
optimistic mpm will like it because it feels like the least invasive
approach, even if it does require sprinkling some u'' around the source. We
might even be able to undo some of the iteritems() and xrange() rewrites
we've done...


>
> diff --git a/mercurial/__init__.py b/mercurial/__init__.py
> --- a/mercurial/__init__.py
> +++ b/mercurial/__init__.py
> @@ -139,14 +139,89 @@ class hgimporter(object):
>              if not modinfo:
>                  raise ImportError('could not find mercurial module %s' %
>                                    name)
>
>          mod = imp.load_module(name, *modinfo)
>          sys.modules[name] = mod
>          return mod
>
> +if sys.version_info[0] >= 3:
> +    from . import pure
> +    import importlib
> +    import io
> +    import token
> +    import tokenize
> +
> +    class hgpathentryfinder(importlib.abc.PathEntryFinder):
> +        """A sys.meta_path finder."""
> +        def find_spec(self, fullname, path, target=None):
> +            # Our custom loader rewrites source code and Python code
> +            # that doesn't belong to Mercurial doesn't expect this.
> +            if not fullname.startswith(('mercurial.', 'hgext.')):
> +                return None
> +
> +            # This assumes Python 3 doesn't support loading C modules.
> +            if fullname in _dualmodules:
> +                stem = fullname.split('.')[-1]
> +                fullname = 'mercurial.pure.%s' % stem
> +                target = pure
> +                assert len(path) == 1
> +                path = [os.path.join(path[0], 'pure')]
> +
> +            # Try to find the module using other registered finders.
> +            spec = None
> +            for finder in sys.meta_path:
> +                if finder == self:
> +                    continue
> +
> +                spec = finder.find_spec(fullname, path, target=target)
> +                if spec:
> +                    break
> +
> +            if not spec:
> +                return None
> +
> +            if fullname.startswith('mercurial.pure.'):
> +                spec.name = spec.name.replace('.pure.', '.')
> +
> +            # TODO need to support loaders from alternate specs, like zip
> +            # loaders.
> +            spec.loader = hgloader(spec.name, spec.origin)
> +            return spec
> +
> +    def replacetoken(t):
> +        if t.type == token.STRING:
> +            s = t.string
> +
> +            # If a docstring, keep it as a string literal.
> +            if s[0:3] in ("'''", '"""'):
> +                return t
> +
> +            if s[0] not in ("'", '"'):
> +                return t
> +
> +            # String literal. Prefix to make a b'' string.
> +            return tokenize.TokenInfo(t.type, 'b%s' % s, t.start, t.end,
> t.line)
> +
> +        return t
> +
> +    class hgloader(importlib.machinery.SourceFileLoader):
> +        """Custom module loader that transforms source code.
> +
> +        When the source code is converted to code, we first transform
> +        string literals to byte literals using the tokenize API.
> +        """
> +        def source_to_code(self, data, path):
> +            buf = io.BytesIO(data)
> +            tokens = tokenize.tokenize(buf.readline)
> +            data = tokenize.untokenize(replacetoken(t) for t in tokens)
> +            return super(hgloader, self).source_to_code(data, path)
> +
>  # We automagically register our custom importer as a side-effect of
> loading.
>  # This is necessary to ensure that any entry points are able to import
>  # mercurial.* modules without having to perform this registration
> themselves.
> -if not any(isinstance(x, hgimporter) for x in sys.meta_path):
> -    # meta_path is used before any implicit finders and before sys.path.
> -    sys.meta_path.insert(0, hgimporter())
> +if sys.version_info[0] >= 3:
> +    sys.meta_path.insert(0, hgpathentryfinder())
> +else:
> +    if not any(isinstance(x, hgimporter) for x in sys.meta_path):
> +        # meta_path is used before any implicit finders and before
> sys.path.
> +        sys.meta_path.insert(0, hgimporter())
>
timeless - May 16, 2016, 3:31 p.m.
Fwiw, We already need some cache invalidation. Switching between Python 2.6
and 2.7 results in really bad outcomes. :)
On May 16, 2016 12:03 AM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1463370916 25200
> #      Sun May 15 20:55:16 2016 -0700
> # Node ID 7c5d1f8db9618f511f40bc4089145310671ca57b
> # Parent  f8b87a779c87586aa043bcd6030369715edfc9c1
> mercurial: implement a source transforming module loader on Python 3
>
> The most painful part of ensuring Python code runs on both Python 2
> and 3 is string encoding. Making this difficult is that string
> literals in Python 2 are bytes and string literals in Python 3 are
> unicode. So, to ensure consistent types are used, you have to
> use "from __future__ import unicode_literals" and/or prefix literals
> with their type (e.g. b'foo' or u'foo').
>
> Nearly every string in Mercurial is bytes. So, to use the same source
> code on both Python 2 and 3 would require prefixing nearly every
> string literal with "b" to make it a byte literal. This is ugly and
> not something mpm is willing to do.
>
> This patch implements a custom module loader on Python 3 that performs
> source transformation to convert string literals (unicode in Python 3)
> to byte literals. In effect, it changes Python 3's string literals to
> behave like Python 2's.
>
> The module loader is only used on mercurial.* and hgext.* modules.
>
> The loader works by tokenizing the loaded source and replacing
> "string" tokens if necessary. The modified token stream is
> untokenized back to source and loaded like normal. This does add some
> overhead. However, this all occurs before caching. So .pyc files should
> cache the version with byte literals.
>
> This patch isn't suitable for checkin. There are a few deficiencies,
> including that changes to the loader won't result in the cache
> being invalidated. As part of testing this, I've had to manually
> blow away __pycache__ directories. We'll likely need to hack up
> cache checking as well so caching is invalidated when
> mercurial/__init__.py changes. This is going to be ugly.
>
> diff --git a/mercurial/__init__.py b/mercurial/__init__.py
> --- a/mercurial/__init__.py
> +++ b/mercurial/__init__.py
> @@ -139,14 +139,89 @@ class hgimporter(object):
>              if not modinfo:
>                  raise ImportError('could not find mercurial module %s' %
>                                    name)
>
>          mod = imp.load_module(name, *modinfo)
>          sys.modules[name] = mod
>          return mod
>
> +if sys.version_info[0] >= 3:
> +    from . import pure
> +    import importlib
> +    import io
> +    import token
> +    import tokenize
> +
> +    class hgpathentryfinder(importlib.abc.PathEntryFinder):
> +        """A sys.meta_path finder."""
> +        def find_spec(self, fullname, path, target=None):
> +            # Our custom loader rewrites source code and Python code
> +            # that doesn't belong to Mercurial doesn't expect this.
> +            if not fullname.startswith(('mercurial.', 'hgext.')):
> +                return None
> +
> +            # This assumes Python 3 doesn't support loading C modules.
> +            if fullname in _dualmodules:
> +                stem = fullname.split('.')[-1]
> +                fullname = 'mercurial.pure.%s' % stem
> +                target = pure
> +                assert len(path) == 1
> +                path = [os.path.join(path[0], 'pure')]
> +
> +            # Try to find the module using other registered finders.
> +            spec = None
> +            for finder in sys.meta_path:
> +                if finder == self:
> +                    continue
> +
> +                spec = finder.find_spec(fullname, path, target=target)
> +                if spec:
> +                    break
> +
> +            if not spec:
> +                return None
> +
> +            if fullname.startswith('mercurial.pure.'):
> +                spec.name = spec.name.replace('.pure.', '.')
> +
> +            # TODO need to support loaders from alternate specs, like zip
> +            # loaders.
> +            spec.loader = hgloader(spec.name, spec.origin)
> +            return spec
> +
> +    def replacetoken(t):
> +        if t.type == token.STRING:
> +            s = t.string
> +
> +            # If a docstring, keep it as a string literal.
> +            if s[0:3] in ("'''", '"""'):
> +                return t
> +
> +            if s[0] not in ("'", '"'):
> +                return t
> +
> +            # String literal. Prefix to make a b'' string.
> +            return tokenize.TokenInfo(t.type, 'b%s' % s, t.start, t.end,
> t.line)
> +
> +        return t
> +
> +    class hgloader(importlib.machinery.SourceFileLoader):
> +        """Custom module loader that transforms source code.
> +
> +        When the source code is converted to code, we first transform
> +        string literals to byte literals using the tokenize API.
> +        """
> +        def source_to_code(self, data, path):
> +            buf = io.BytesIO(data)
> +            tokens = tokenize.tokenize(buf.readline)
> +            data = tokenize.untokenize(replacetoken(t) for t in tokens)
> +            return super(hgloader, self).source_to_code(data, path)
> +
>  # We automagically register our custom importer as a side-effect of
> loading.
>  # This is necessary to ensure that any entry points are able to import
>  # mercurial.* modules without having to perform this registration
> themselves.
> -if not any(isinstance(x, hgimporter) for x in sys.meta_path):
> -    # meta_path is used before any implicit finders and before sys.path.
> -    sys.meta_path.insert(0, hgimporter())
> +if sys.version_info[0] >= 3:
> +    sys.meta_path.insert(0, hgpathentryfinder())
> +else:
> +    if not any(isinstance(x, hgimporter) for x in sys.meta_path):
> +        # meta_path is used before any implicit finders and before
> sys.path.
> +        sys.meta_path.insert(0, hgimporter())
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Simon King - May 16, 2016, 3:43 p.m.
I don't think that's supposed to happen, is it? Python should
automatically invalidate .pyc files based on a magic number that
changes when the format changes:

https://hg.python.org/cpython/file/2.7/Python/import.c#l31

Simon

On Mon, May 16, 2016 at 4:31 PM, timeless <timeless@gmail.com> wrote:
> Fwiw, We already need some cache invalidation. Switching between Python 2.6
> and 2.7 results in really bad outcomes. :)
>
> On May 16, 2016 12:03 AM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote:
>>
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1463370916 25200
>> #      Sun May 15 20:55:16 2016 -0700
>> # Node ID 7c5d1f8db9618f511f40bc4089145310671ca57b
>> # Parent  f8b87a779c87586aa043bcd6030369715edfc9c1
>> mercurial: implement a source transforming module loader on Python 3
>>
>> The most painful part of ensuring Python code runs on both Python 2
>> and 3 is string encoding. Making this difficult is that string
>> literals in Python 2 are bytes and string literals in Python 3 are
>> unicode. So, to ensure consistent types are used, you have to
>> use "from __future__ import unicode_literals" and/or prefix literals
>> with their type (e.g. b'foo' or u'foo').
>>
>> Nearly every string in Mercurial is bytes. So, to use the same source
>> code on both Python 2 and 3 would require prefixing nearly every
>> string literal with "b" to make it a byte literal. This is ugly and
>> not something mpm is willing to do.
>>
>> This patch implements a custom module loader on Python 3 that performs
>> source transformation to convert string literals (unicode in Python 3)
>> to byte literals. In effect, it changes Python 3's string literals to
>> behave like Python 2's.
>>
>> The module loader is only used on mercurial.* and hgext.* modules.
>>
>> The loader works by tokenizing the loaded source and replacing
>> "string" tokens if necessary. The modified token stream is
>> untokenized back to source and loaded like normal. This does add some
>> overhead. However, this all occurs before caching. So .pyc files should
>> cache the version with byte literals.
>>
>> This patch isn't suitable for checkin. There are a few deficiencies,
>> including that changes to the loader won't result in the cache
>> being invalidated. As part of testing this, I've had to manually
>> blow away __pycache__ directories. We'll likely need to hack up
>> cache checking as well so caching is invalidated when
>> mercurial/__init__.py changes. This is going to be ugly.
>>
>> diff --git a/mercurial/__init__.py b/mercurial/__init__.py
>> --- a/mercurial/__init__.py
>> +++ b/mercurial/__init__.py
>> @@ -139,14 +139,89 @@ class hgimporter(object):
>>              if not modinfo:
>>                  raise ImportError('could not find mercurial module %s' %
>>                                    name)
>>
>>          mod = imp.load_module(name, *modinfo)
>>          sys.modules[name] = mod
>>          return mod
>>
>> +if sys.version_info[0] >= 3:
>> +    from . import pure
>> +    import importlib
>> +    import io
>> +    import token
>> +    import tokenize
>> +
>> +    class hgpathentryfinder(importlib.abc.PathEntryFinder):
>> +        """A sys.meta_path finder."""
>> +        def find_spec(self, fullname, path, target=None):
>> +            # Our custom loader rewrites source code and Python code
>> +            # that doesn't belong to Mercurial doesn't expect this.
>> +            if not fullname.startswith(('mercurial.', 'hgext.')):
>> +                return None
>> +
>> +            # This assumes Python 3 doesn't support loading C modules.
>> +            if fullname in _dualmodules:
>> +                stem = fullname.split('.')[-1]
>> +                fullname = 'mercurial.pure.%s' % stem
>> +                target = pure
>> +                assert len(path) == 1
>> +                path = [os.path.join(path[0], 'pure')]
>> +
>> +            # Try to find the module using other registered finders.
>> +            spec = None
>> +            for finder in sys.meta_path:
>> +                if finder == self:
>> +                    continue
>> +
>> +                spec = finder.find_spec(fullname, path, target=target)
>> +                if spec:
>> +                    break
>> +
>> +            if not spec:
>> +                return None
>> +
>> +            if fullname.startswith('mercurial.pure.'):
>> +                spec.name = spec.name.replace('.pure.', '.')
>> +
>> +            # TODO need to support loaders from alternate specs, like zip
>> +            # loaders.
>> +            spec.loader = hgloader(spec.name, spec.origin)
>> +            return spec
>> +
>> +    def replacetoken(t):
>> +        if t.type == token.STRING:
>> +            s = t.string
>> +
>> +            # If a docstring, keep it as a string literal.
>> +            if s[0:3] in ("'''", '"""'):
>> +                return t
>> +
>> +            if s[0] not in ("'", '"'):
>> +                return t
>> +
>> +            # String literal. Prefix to make a b'' string.
>> +            return tokenize.TokenInfo(t.type, 'b%s' % s, t.start, t.end,
>> t.line)
>> +
>> +        return t
>> +
>> +    class hgloader(importlib.machinery.SourceFileLoader):
>> +        """Custom module loader that transforms source code.
>> +
>> +        When the source code is converted to code, we first transform
>> +        string literals to byte literals using the tokenize API.
>> +        """
>> +        def source_to_code(self, data, path):
>> +            buf = io.BytesIO(data)
>> +            tokens = tokenize.tokenize(buf.readline)
>> +            data = tokenize.untokenize(replacetoken(t) for t in tokens)
>> +            return super(hgloader, self).source_to_code(data, path)
>> +
>>  # We automagically register our custom importer as a side-effect of
>> loading.
>>  # This is necessary to ensure that any entry points are able to import
>>  # mercurial.* modules without having to perform this registration
>> themselves.
>> -if not any(isinstance(x, hgimporter) for x in sys.meta_path):
>> -    # meta_path is used before any implicit finders and before sys.path.
>> -    sys.meta_path.insert(0, hgimporter())
>> +if sys.version_info[0] >= 3:
>> +    sys.meta_path.insert(0, hgpathentryfinder())
>> +else:
>> +    if not any(isinstance(x, hgimporter) for x in sys.meta_path):
>> +        # meta_path is used before any implicit finders and before
>> sys.path.
>> +        sys.meta_path.insert(0, hgimporter())
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Gregory Szorc - May 16, 2016, 3:50 p.m.
> On May 16, 2016, at 08:43, Simon King <simon@simonking.org.uk> wrote:
> 
> I don't think that's supposed to happen, is it? Python should
> automatically invalidate .pyc files based on a magic number that
> changes when the format changes:
> 
> https://hg.python.org/cpython/file/2.7/Python/import.c#l31

The problem is we're inserting code between file reading and code generation - code that Python's importers don't know about. Python assumes file content gets converted into code in a deterministic manner that is defined by the Python distribution itself. We're outside of that, so we need to provide our own cache validation checking.

> 
>> On Mon, May 16, 2016 at 4:31 PM, timeless <timeless@gmail.com> wrote:
>> Fwiw, We already need some cache invalidation. Switching between Python 2.6
>> and 2.7 results in really bad outcomes. :)
>> 
>>> On May 16, 2016 12:03 AM, "Gregory Szorc" <gregory.szorc@gmail.com> wrote:
>>> 
>>> # HG changeset patch
>>> # User Gregory Szorc <gregory.szorc@gmail.com>
>>> # Date 1463370916 25200
>>> #      Sun May 15 20:55:16 2016 -0700
>>> # Node ID 7c5d1f8db9618f511f40bc4089145310671ca57b
>>> # Parent  f8b87a779c87586aa043bcd6030369715edfc9c1
>>> mercurial: implement a source transforming module loader on Python 3
>>> 
>>> The most painful part of ensuring Python code runs on both Python 2
>>> and 3 is string encoding. Making this difficult is that string
>>> literals in Python 2 are bytes and string literals in Python 3 are
>>> unicode. So, to ensure consistent types are used, you have to
>>> use "from __future__ import unicode_literals" and/or prefix literals
>>> with their type (e.g. b'foo' or u'foo').
>>> 
>>> Nearly every string in Mercurial is bytes. So, to use the same source
>>> code on both Python 2 and 3 would require prefixing nearly every
>>> string literal with "b" to make it a byte literal. This is ugly and
>>> not something mpm is willing to do.
>>> 
>>> This patch implements a custom module loader on Python 3 that performs
>>> source transformation to convert string literals (unicode in Python 3)
>>> to byte literals. In effect, it changes Python 3's string literals to
>>> behave like Python 2's.
>>> 
>>> The module loader is only used on mercurial.* and hgext.* modules.
>>> 
>>> The loader works by tokenizing the loaded source and replacing
>>> "string" tokens if necessary. The modified token stream is
>>> untokenized back to source and loaded like normal. This does add some
>>> overhead. However, this all occurs before caching. So .pyc files should
>>> cache the version with byte literals.
>>> 
>>> This patch isn't suitable for checkin. There are a few deficiencies,
>>> including that changes to the loader won't result in the cache
>>> being invalidated. As part of testing this, I've had to manually
>>> blow away __pycache__ directories. We'll likely need to hack up
>>> cache checking as well so caching is invalidated when
>>> mercurial/__init__.py changes. This is going to be ugly.
>>> 
>>> diff --git a/mercurial/__init__.py b/mercurial/__init__.py
>>> --- a/mercurial/__init__.py
>>> +++ b/mercurial/__init__.py
>>> @@ -139,14 +139,89 @@ class hgimporter(object):
>>>             if not modinfo:
>>>                 raise ImportError('could not find mercurial module %s' %
>>>                                   name)
>>> 
>>>         mod = imp.load_module(name, *modinfo)
>>>         sys.modules[name] = mod
>>>         return mod
>>> 
>>> +if sys.version_info[0] >= 3:
>>> +    from . import pure
>>> +    import importlib
>>> +    import io
>>> +    import token
>>> +    import tokenize
>>> +
>>> +    class hgpathentryfinder(importlib.abc.PathEntryFinder):
>>> +        """A sys.meta_path finder."""
>>> +        def find_spec(self, fullname, path, target=None):
>>> +            # Our custom loader rewrites source code and Python code
>>> +            # that doesn't belong to Mercurial doesn't expect this.
>>> +            if not fullname.startswith(('mercurial.', 'hgext.')):
>>> +                return None
>>> +
>>> +            # This assumes Python 3 doesn't support loading C modules.
>>> +            if fullname in _dualmodules:
>>> +                stem = fullname.split('.')[-1]
>>> +                fullname = 'mercurial.pure.%s' % stem
>>> +                target = pure
>>> +                assert len(path) == 1
>>> +                path = [os.path.join(path[0], 'pure')]
>>> +
>>> +            # Try to find the module using other registered finders.
>>> +            spec = None
>>> +            for finder in sys.meta_path:
>>> +                if finder == self:
>>> +                    continue
>>> +
>>> +                spec = finder.find_spec(fullname, path, target=target)
>>> +                if spec:
>>> +                    break
>>> +
>>> +            if not spec:
>>> +                return None
>>> +
>>> +            if fullname.startswith('mercurial.pure.'):
>>> +                spec.name = spec.name.replace('.pure.', '.')
>>> +
>>> +            # TODO need to support loaders from alternate specs, like zip
>>> +            # loaders.
>>> +            spec.loader = hgloader(spec.name, spec.origin)
>>> +            return spec
>>> +
>>> +    def replacetoken(t):
>>> +        if t.type == token.STRING:
>>> +            s = t.string
>>> +
>>> +            # If a docstring, keep it as a string literal.
>>> +            if s[0:3] in ("'''", '"""'):
>>> +                return t
>>> +
>>> +            if s[0] not in ("'", '"'):
>>> +                return t
>>> +
>>> +            # String literal. Prefix to make a b'' string.
>>> +            return tokenize.TokenInfo(t.type, 'b%s' % s, t.start, t.end,
>>> t.line)
>>> +
>>> +        return t
>>> +
>>> +    class hgloader(importlib.machinery.SourceFileLoader):
>>> +        """Custom module loader that transforms source code.
>>> +
>>> +        When the source code is converted to code, we first transform
>>> +        string literals to byte literals using the tokenize API.
>>> +        """
>>> +        def source_to_code(self, data, path):
>>> +            buf = io.BytesIO(data)
>>> +            tokens = tokenize.tokenize(buf.readline)
>>> +            data = tokenize.untokenize(replacetoken(t) for t in tokens)
>>> +            return super(hgloader, self).source_to_code(data, path)
>>> +
>>> # We automagically register our custom importer as a side-effect of
>>> loading.
>>> # This is necessary to ensure that any entry points are able to import
>>> # mercurial.* modules without having to perform this registration
>>> themselves.
>>> -if not any(isinstance(x, hgimporter) for x in sys.meta_path):
>>> -    # meta_path is used before any implicit finders and before sys.path.
>>> -    sys.meta_path.insert(0, hgimporter())
>>> +if sys.version_info[0] >= 3:
>>> +    sys.meta_path.insert(0, hgpathentryfinder())
>>> +else:
>>> +    if not any(isinstance(x, hgimporter) for x in sys.meta_path):
>>> +        # meta_path is used before any implicit finders and before
>>> sys.path.
>>> +        sys.meta_path.insert(0, hgimporter())
>>> _______________________________________________
>>> Mercurial-devel mailing list
>>> Mercurial-devel@mercurial-scm.org
>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>> 
>> 
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
Simon King - May 16, 2016, 4:13 p.m.
On Mon, May 16, 2016 at 4:50 PM, Gregory Szorc <gregory.szorc@gmail.com> wrote:
>
>
>> On May 16, 2016, at 08:43, Simon King <simon@simonking.org.uk> wrote:
>>
>> I don't think that's supposed to happen, is it? Python should
>> automatically invalidate .pyc files based on a magic number that
>> changes when the format changes:
>>
>> https://hg.python.org/cpython/file/2.7/Python/import.c#l31
>
> The problem is we're inserting code between file reading and code generation - code that Python's importers don't know about. Python assumes file content gets converted into code in a deterministic manner that is defined by the Python distribution itself. We're outside of that, so we need to provide our own cache validation checking.
>

Yes, sorry, my reply was to timeless' comment about switching between
Py2.6 and Py2.7 causing problems. Invalidating pyc files when we
update this encoding hack is certainly more complicated.

The converter could insert a module-level constant into each rewritten
file indicating the converter version that was used, but I guess you'd
need a custom module loader to inspect those files and compare that
constant with the *current* version.

Simon
timeless - May 16, 2016, 4:29 p.m.
Simon King wrote:
> I don't think that's supposed to happen, is it?

Dunno. I don't claim to be a python expert.

I do actively switch between at least py26 py27 py3 and to a lesser
extent pypy and some other flavors. So I hit all sorts of lousy edges.

> Python should automatically invalidate .pyc files based on a magic number that
> changes when the format changes:
>
> https://hg.python.org/cpython/file/2.7/Python/import.c#l31

(py26)[timeless@gcc2-power8 crew]$ make local
python setup.py  \
  build_py -c -d . \
  build_ext  -i \
  build_hgexe  -i \
  build_mo
running build_py
byte-compiling ./mercurial/byterange.py to byterange.pyc
byte-compiling ./mercurial/archival.py to archival.pyc
byte-compiling ./mercurial/crecord.py to crecord.pyc
byte-compiling ./mercurial/windows.py to windows.pyc
byte-compiling ./mercurial/lsprof.py to lsprof.pyc
byte-compiling ./mercurial/similar.py to similar.pyc
byte-compiling ./mercurial/scmwindows.py to scmwindows.pyc
byte-compiling ./mercurial/verify.py to verify.pyc
byte-compiling ./mercurial/py3kcompat.py to py3kcompat.pyc
byte-compiling ./mercurial/win32.py to win32.pyc
byte-compiling ./mercurial/__modulepolicy__.py to __modulepolicy__.pyc
byte-compiling ./mercurial/hgweb/webutil.py to webutil.pyc
byte-compiling ./mercurial/hgweb/wsgicgi.py to wsgicgi.pyc
byte-compiling ./mercurial/httpclient/socketutil.py to socketutil.pyc
byte-compiling ./hgext/convert/subversion.py to subversion.pyc
byte-compiling ./hgext/convert/convcmd.py to convcmd.pyc
byte-compiling ./hgext/convert/hg.py to hg.pyc
byte-compiling ./hgext/convert/cvsps.py to cvsps.pyc
byte-compiling ./hgext/convert/bzr.py to bzr.pyc
byte-compiling ./hgext/convert/filemap.py to filemap.pyc
byte-compiling ./hgext/convert/common.py to common.pyc
byte-compiling ./hgext/convert/gnuarch.py to gnuarch.pyc
byte-compiling ./hgext/convert/cvs.py to cvs.pyc
byte-compiling ./hgext/convert/monotone.py to monotone.pyc
byte-compiling ./hgext/fsmonitor/pywatchman/pybser.py to pybser.pyc
byte-compiling ./hgext/largefiles/wirestore.py to wirestore.pyc
byte-compiling ./hgext/largefiles/basestore.py to basestore.pyc
byte-compiling ./hgext/largefiles/lfutil.py to lfutil.pyc
byte-compiling ./hgext/largefiles/localstore.py to localstore.pyc
byte-compiling ./hgext/largefiles/proto.py to proto.pyc
byte-compiling ./hgext/largefiles/remotestore.py to remotestore.pyc
running build_ext
building 'mercurial.bdiff' extension
creating build/temp.linux-ppc64le-2.6
creating build/temp.linux-ppc64le-2.6/mercurial
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall
-Wstrict-prototypes -fPIC
-I/home/timeless/.localpython26/include/python2.6 -c mercurial/bdiff.c
-o build/temp.linux-ppc64le-2.6/mercurial/bdiff.o
In file included from
/home/timeless/.localpython26/include/python2.6/Python.h:125:0,
                 from mercurial/bdiff.c:13:
/home/timeless/.localpython26/include/python2.6/modsupport.h:27:1:
warning: '_PyArg_ParseTuple_SizeT' is an unrecognized format function
type [-Wformat=]
 PyAPI_FUNC(int) PyArg_ParseTuple(PyObject *, const char *, ...)
Py_FORMAT_PARSETUPLE(PyArg_ParseTuple, 2, 3);
 ^
gcc -pthread -shared build/temp.linux-ppc64le-2.6/mercurial/bdiff.o -o
/home/timeless/hg/crew/mercurial/bdiff.so
running build_hgexe
running build_mo
env HGRCPATH= python hg version
Traceback (most recent call last):
  File "hg", line 41, in <module>
    mercurial.util.setbinary(fp)
  File "/home/timeless/hg/crew/mercurial/demandimport.py", line 130,
in __getattribute__
    self._load()
  File "/home/timeless/hg/crew/mercurial/demandimport.py", line 96, in _load
    mod = _hgextimport(_import, head, globals, locals, None, level)
  File "/home/timeless/hg/crew/mercurial/demandimport.py", line 53, in
_hgextimport
    return importfunc(name, globals, *args, **kwargs)
  File "/home/timeless/hg/crew/mercurial/util.py", line 2636, in <module>
    if safehasattr(parsers, 'dirs'):
  File "/home/timeless/hg/crew/mercurial/util.py", line 136, in safehasattr
    return getattr(thing, attr, _notset) is not _notset
  File "/home/timeless/hg/crew/mercurial/demandimport.py", line 130,
in __getattribute__
    self._load()
  File "/home/timeless/hg/crew/mercurial/demandimport.py", line 96, in _load
    mod = _hgextimport(_import, head, globals, locals, None, level)
  File "/home/timeless/hg/crew/mercurial/demandimport.py", line 53, in
_hgextimport
    return importfunc(name, globals, *args, **kwargs)
  File "/home/timeless/hg/crew/mercurial/__init__.py", line 143, in load_module
    mod = imp.load_module(name, *modinfo)
ImportError: Python minor version mismatch: The Mercurial extension
modules were compiled with Python 2.7.8, but Mercurial is currently
using Python with sys.hexversion=33950192: Python 2.6.9 (unknown, Apr
13 2016, 12:40:12)
[GCC 4.9.2 20141101 (Red Hat 4.9.2-1)]
 at: /home/timeless/hg/py26/bin/python
Makefile:47: recipe for target 'local' failed
make: *** [local] Error 1

I expect someone to hand waive this as a "build system issue", or as a
"compiled library problem". But from my perspective, switching between
python versions is not fun.
Gregory Szorc - May 16, 2016, 4:56 p.m.
On Mon, May 16, 2016 at 9:13 AM, Simon King <simon@simonking.org.uk> wrote:

> On Mon, May 16, 2016 at 4:50 PM, Gregory Szorc <gregory.szorc@gmail.com>
> wrote:
> >
> >
> >> On May 16, 2016, at 08:43, Simon King <simon@simonking.org.uk> wrote:
> >>
> >> I don't think that's supposed to happen, is it? Python should
> >> automatically invalidate .pyc files based on a magic number that
> >> changes when the format changes:
> >>
> >> https://hg.python.org/cpython/file/2.7/Python/import.c#l31
> >
> > The problem is we're inserting code between file reading and code
> generation - code that Python's importers don't know about. Python assumes
> file content gets converted into code in a deterministic manner that is
> defined by the Python distribution itself. We're outside of that, so we
> need to provide our own cache validation checking.
> >
>
> Yes, sorry, my reply was to timeless' comment about switching between
> Py2.6 and Py2.7 causing problems. Invalidating pyc files when we
> update this encoding hack is certainly more complicated.
>

Email on mobile fail on my part. Sorry about that.


>
> The converter could insert a module-level constant into each rewritten
> file indicating the converter version that was used, but I guess you'd
> need a custom module loader to inspect those files and compare that
> constant with the *current* version.


Yes it would. Fortunately, this isn't too bad in Python 3.5 because the
module import code is all written in Python which means you can create
child classes that provide a custom implementation of just the function you
need. You can even define the path to use for the .pyc file in the module
spec instances. We could e.g. hash the content of mercurial/__init__.py and
put that in the cache path. Lots of options.
Simon King - May 17, 2016, 9:06 a.m.
On Mon, May 16, 2016 at 5:29 PM, timeless <timeless@gmail.com> wrote:
> Simon King wrote:
>> I don't think that's supposed to happen, is it?
>
> Dunno. I don't claim to be a python expert.
>
> I do actively switch between at least py26 py27 py3 and to a lesser
> extent pypy and some other flavors. So I hit all sorts of lousy edges.
>
>> Python should automatically invalidate .pyc files based on a magic number that
>> changes when the format changes:
>>
>> https://hg.python.org/cpython/file/2.7/Python/import.c#l31
>
> (py26)[timeless@gcc2-power8 crew]$ make local
> python setup.py  \
>   build_py -c -d . \
>   build_ext  -i \
>   build_hgexe  -i \
>   build_mo
> running build_py
> byte-compiling ./mercurial/byterange.py to byterange.pyc
> byte-compiling ./mercurial/archival.py to archival.pyc
> byte-compiling ./mercurial/crecord.py to crecord.pyc
> byte-compiling ./mercurial/windows.py to windows.pyc
> byte-compiling ./mercurial/lsprof.py to lsprof.pyc
> byte-compiling ./mercurial/similar.py to similar.pyc
> byte-compiling ./mercurial/scmwindows.py to scmwindows.pyc
> byte-compiling ./mercurial/verify.py to verify.pyc
> byte-compiling ./mercurial/py3kcompat.py to py3kcompat.pyc
> byte-compiling ./mercurial/win32.py to win32.pyc
> byte-compiling ./mercurial/__modulepolicy__.py to __modulepolicy__.pyc
> byte-compiling ./mercurial/hgweb/webutil.py to webutil.pyc
> byte-compiling ./mercurial/hgweb/wsgicgi.py to wsgicgi.pyc
> byte-compiling ./mercurial/httpclient/socketutil.py to socketutil.pyc
> byte-compiling ./hgext/convert/subversion.py to subversion.pyc
> byte-compiling ./hgext/convert/convcmd.py to convcmd.pyc
> byte-compiling ./hgext/convert/hg.py to hg.pyc
> byte-compiling ./hgext/convert/cvsps.py to cvsps.pyc
> byte-compiling ./hgext/convert/bzr.py to bzr.pyc
> byte-compiling ./hgext/convert/filemap.py to filemap.pyc
> byte-compiling ./hgext/convert/common.py to common.pyc
> byte-compiling ./hgext/convert/gnuarch.py to gnuarch.pyc
> byte-compiling ./hgext/convert/cvs.py to cvs.pyc
> byte-compiling ./hgext/convert/monotone.py to monotone.pyc
> byte-compiling ./hgext/fsmonitor/pywatchman/pybser.py to pybser.pyc
> byte-compiling ./hgext/largefiles/wirestore.py to wirestore.pyc
> byte-compiling ./hgext/largefiles/basestore.py to basestore.pyc
> byte-compiling ./hgext/largefiles/lfutil.py to lfutil.pyc
> byte-compiling ./hgext/largefiles/localstore.py to localstore.pyc
> byte-compiling ./hgext/largefiles/proto.py to proto.pyc
> byte-compiling ./hgext/largefiles/remotestore.py to remotestore.pyc
> running build_ext
> building 'mercurial.bdiff' extension
> creating build/temp.linux-ppc64le-2.6
> creating build/temp.linux-ppc64le-2.6/mercurial
> gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall
> -Wstrict-prototypes -fPIC
> -I/home/timeless/.localpython26/include/python2.6 -c mercurial/bdiff.c
> -o build/temp.linux-ppc64le-2.6/mercurial/bdiff.o
> In file included from
> /home/timeless/.localpython26/include/python2.6/Python.h:125:0,
>                  from mercurial/bdiff.c:13:
> /home/timeless/.localpython26/include/python2.6/modsupport.h:27:1:
> warning: '_PyArg_ParseTuple_SizeT' is an unrecognized format function
> type [-Wformat=]
>  PyAPI_FUNC(int) PyArg_ParseTuple(PyObject *, const char *, ...)
> Py_FORMAT_PARSETUPLE(PyArg_ParseTuple, 2, 3);
>  ^
> gcc -pthread -shared build/temp.linux-ppc64le-2.6/mercurial/bdiff.o -o
> /home/timeless/hg/crew/mercurial/bdiff.so
> running build_hgexe
> running build_mo
> env HGRCPATH= python hg version
> Traceback (most recent call last):
>   File "hg", line 41, in <module>
>     mercurial.util.setbinary(fp)
>   File "/home/timeless/hg/crew/mercurial/demandimport.py", line 130,
> in __getattribute__
>     self._load()
>   File "/home/timeless/hg/crew/mercurial/demandimport.py", line 96, in _load
>     mod = _hgextimport(_import, head, globals, locals, None, level)
>   File "/home/timeless/hg/crew/mercurial/demandimport.py", line 53, in
> _hgextimport
>     return importfunc(name, globals, *args, **kwargs)
>   File "/home/timeless/hg/crew/mercurial/util.py", line 2636, in <module>
>     if safehasattr(parsers, 'dirs'):
>   File "/home/timeless/hg/crew/mercurial/util.py", line 136, in safehasattr
>     return getattr(thing, attr, _notset) is not _notset
>   File "/home/timeless/hg/crew/mercurial/demandimport.py", line 130,
> in __getattribute__
>     self._load()
>   File "/home/timeless/hg/crew/mercurial/demandimport.py", line 96, in _load
>     mod = _hgextimport(_import, head, globals, locals, None, level)
>   File "/home/timeless/hg/crew/mercurial/demandimport.py", line 53, in
> _hgextimport
>     return importfunc(name, globals, *args, **kwargs)
>   File "/home/timeless/hg/crew/mercurial/__init__.py", line 143, in load_module
>     mod = imp.load_module(name, *modinfo)
> ImportError: Python minor version mismatch: The Mercurial extension
> modules were compiled with Python 2.7.8, but Mercurial is currently
> using Python with sys.hexversion=33950192: Python 2.6.9 (unknown, Apr
> 13 2016, 12:40:12)
> [GCC 4.9.2 20141101 (Red Hat 4.9.2-1)]
>  at: /home/timeless/hg/py26/bin/python
> Makefile:47: recipe for target 'local' failed
> make: *** [local] Error 1
>
> I expect someone to hand waive this as a "build system issue", or as a
> "compiled library problem". But from my perspective, switching between
> python versions is not fun.

Hmm, yes, I see what you mean. It's not a problem with .pyc files, and
doesn't really have any relation to these patches, but it's clearly a
problem. And indeed <handwaive>it does appear to be a build system
issue</handwaive> - the mercurial/*.so files are
python-version-dependent, but they don't get rebuilt when you run
"make local" with a different python version.

As a workaround, perhaps you could "rm mercurial/*.so; make local"?

Simon
Matt Mackall - June 10, 2016, 7:42 p.m.
On Mon, 2016-05-16 at 09:29 -0700, timeless wrote:
> Simon King wrote:
> > 
> > I don't think that's supposed to happen, is it?
> Dunno. I don't claim to be a python expert.

> ImportError: Python minor version mismatch: The Mercurial extension
> modules were compiled with Python 2.7.8, but Mercurial is currently
> using Python with sys.hexversion=33950192: Python 2.6.9 (unknown, Apr
> 13 2016, 12:40:12)
> [GCC 4.9.2 20141101 (Red Hat 4.9.2-1)]
>  at: /home/timeless/hg/py26/bin/python

This is not a message from Python. You can tell because it knows what a
"Mercurial" is.

https://www.selenic.com/hg/rev/3681de20b0a7

-- 
Mathematics is the supreme nostalgia of our time.

Patch

diff --git a/mercurial/__init__.py b/mercurial/__init__.py
--- a/mercurial/__init__.py
+++ b/mercurial/__init__.py
@@ -139,14 +139,89 @@  class hgimporter(object):
             if not modinfo:
                 raise ImportError('could not find mercurial module %s' %
                                   name)
 
         mod = imp.load_module(name, *modinfo)
         sys.modules[name] = mod
         return mod
 
+if sys.version_info[0] >= 3:
+    from . import pure
+    import importlib
+    import io
+    import token
+    import tokenize
+
+    class hgpathentryfinder(importlib.abc.PathEntryFinder):
+        """A sys.meta_path finder."""
+        def find_spec(self, fullname, path, target=None):
+            # Our custom loader rewrites source code and Python code
+            # that doesn't belong to Mercurial doesn't expect this.
+            if not fullname.startswith(('mercurial.', 'hgext.')):
+                return None
+
+            # This assumes Python 3 doesn't support loading C modules.
+            if fullname in _dualmodules:
+                stem = fullname.split('.')[-1]
+                fullname = 'mercurial.pure.%s' % stem
+                target = pure
+                assert len(path) == 1
+                path = [os.path.join(path[0], 'pure')]
+
+            # Try to find the module using other registered finders.
+            spec = None
+            for finder in sys.meta_path:
+                if finder == self:
+                    continue
+
+                spec = finder.find_spec(fullname, path, target=target)
+                if spec:
+                    break
+
+            if not spec:
+                return None
+
+            if fullname.startswith('mercurial.pure.'):
+                spec.name = spec.name.replace('.pure.', '.')
+
+            # TODO need to support loaders from alternate specs, like zip
+            # loaders.
+            spec.loader = hgloader(spec.name, spec.origin)
+            return spec
+
+    def replacetoken(t):
+        if t.type == token.STRING:
+            s = t.string
+
+            # If a docstring, keep it as a string literal.
+            if s[0:3] in ("'''", '"""'):
+                return t
+
+            if s[0] not in ("'", '"'):
+                return t
+
+            # String literal. Prefix to make a b'' string.
+            return tokenize.TokenInfo(t.type, 'b%s' % s, t.start, t.end, t.line)
+
+        return t
+
+    class hgloader(importlib.machinery.SourceFileLoader):
+        """Custom module loader that transforms source code.
+
+        When the source code is converted to code, we first transform
+        string literals to byte literals using the tokenize API.
+        """
+        def source_to_code(self, data, path):
+            buf = io.BytesIO(data)
+            tokens = tokenize.tokenize(buf.readline)
+            data = tokenize.untokenize(replacetoken(t) for t in tokens)
+            return super(hgloader, self).source_to_code(data, path)
+
 # We automagically register our custom importer as a side-effect of loading.
 # This is necessary to ensure that any entry points are able to import
 # mercurial.* modules without having to perform this registration themselves.
-if not any(isinstance(x, hgimporter) for x in sys.meta_path):
-    # meta_path is used before any implicit finders and before sys.path.
-    sys.meta_path.insert(0, hgimporter())
+if sys.version_info[0] >= 3:
+    sys.meta_path.insert(0, hgpathentryfinder())
+else:
+    if not any(isinstance(x, hgimporter) for x in sys.meta_path):
+        # meta_path is used before any implicit finders and before sys.path.
+        sys.meta_path.insert(0, hgimporter())