Patchwork [2,of,6] py3: pass native string to urlreq.url2pathname()

login
register
mail settings
Submitter Manuel Jacob
Date June 16, 2020, 12:49 p.m.
Message ID <b5676e89a260a5394980.1592311764@tmp>
Download mbox | patch
Permalink /patch/46515/
State Accepted
Headers show

Comments

Manuel Jacob - June 16, 2020, 12:49 p.m.
# HG changeset patch
# User Manuel Jacob <me@manueljacob.de>
# Date 1592308820 -7200
#      Tue Jun 16 14:00:20 2020 +0200
# Branch stable
# Node ID b5676e89a260a539498013984dca533ce4e5159f
# Parent  a1d235193ad132ed0da790f26489a66b7fdc3b1d
# EXP-Topic convert-svn
py3: pass native string to urlreq.url2pathname()

Of course, I’m not happy with the warning, but it’s better than crashing.
Solving the problem properly is hard, and non-UTF-8 percent-encoded bytes in
file URLs seem rare enough to block solving that all file URLs (even if not
SVN-specific) will cause a crash.
Manuel Jacob - June 17, 2020, 1:51 a.m.
I was unhappy about this "fix" and after thinking about it again, I’m 
even more unhappy.

In the following situation, the behavior is problematic:

- We’re on Python 3.
- The URL path contains a percent-encoded valid UTF-8 byte sequence. 
urlreq.url2pathname()’s return value is unicode and will contain the 
corresponding code point.
- pycompat.fsencode() uses a different encoding than UTF-8 (e.g. 
ISO-8859-1). It will encode the code point to a different byte sequence.
- The file will not be found and the warning introduced in this patch is 
not shown.

On Python 2, the percent-decoded bytes are preserved (at least on Linux, 
I don’t have access to a Windows machine to verify).

A proper fix would be to have our own implementation for 
urlreq.url2pathname() that works with bytes. This is the right thing to 
do on Unix. On Windows, I think that we should assume that the 
percent-decoded bytes are UTF-8 (see 
https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). But it seems 
like that would be a change from how it works on Python 2 (again, I 
don’t have a Windows machine to verify) and therefore should be changed 
in the default branch.

This patch is committed, but still draft. I think it would be best to 
prune it. I can send a better fix. For that, access to a Windows machine 
would be helpful.

On 2020-06-16 14:49, Manuel Jacob wrote:
> # HG changeset patch
> # User Manuel Jacob <me@manueljacob.de>
> # Date 1592308820 -7200
> #      Tue Jun 16 14:00:20 2020 +0200
> # Branch stable
> # Node ID b5676e89a260a539498013984dca533ce4e5159f
> # Parent  a1d235193ad132ed0da790f26489a66b7fdc3b1d
> # EXP-Topic convert-svn
> py3: pass native string to urlreq.url2pathname()
> 
> Of course, I’m not happy with the warning, but it’s better than 
> crashing.
> Solving the problem properly is hard, and non-UTF-8 percent-encoded 
> bytes in
> file URLs seem rare enough to block solving that all file URLs (even if 
> not
> SVN-specific) will cause a crash.
> 
> diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
> --- a/hgext/convert/subversion.py
> +++ b/hgext/convert/subversion.py
> @@ -321,7 +321,26 @@
>                  and path[2:6].lower() == b'%3a/'
>              ):
>                  path = path[:2] + b':/' + path[6:]
> -            path = urlreq.url2pathname(path)
> +            # pycompat.fsdecode() / pycompat.fsencode() are used so 
> that bytes
> +            # in the URL roundtrip correctly on Unix. 
> urlreq.url2pathname() on
> +            # py3 will decode percent-encoded bytes using the utf-8 
> encoding
> +            # and the "replace" error handler. This means that it will 
> not
> +            # preserve non-UTF-8 bytes 
> (https://bugs.python.org/issue40983).
> +            # url.open() uses the reverse function 
> (urlreq.pathname2url()) and
> +            # has a similar problem
> +            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). It 
> makes
> +            # sense to solve both problems together and handle all 
> file URLs
> +            # consistently. For now, we warn.
> +            unicodepath = urlreq.url2pathname(pycompat.fsdecode(path))
> +            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in 
> unicodepath:
> +                ui.warn(
> +                    _(
> +                        b'on Python 3, we currently do not support 
> non-UTF-8 '
> +                        b'percent-encoded bytes in file URLs for 
> Subversion '
> +                        b'repositories\n'
> +                    )
> +                )
> +            path = pycompat.fsencode(unicodepath)
>      except ValueError:
>          proto = b'file'
>          path = os.path.abspath(url)
> diff --git a/tests/test-convert-svn-encoding.t
> b/tests/test-convert-svn-encoding.t
> --- a/tests/test-convert-svn-encoding.t
> +++ b/tests/test-convert-svn-encoding.t
> @@ -152,3 +152,23 @@
>    f7e66f98380ed1e53a797c5c7a7a2616a7ab377d branch\xc3\xa9 (esc)
> 
>    $ cd ..
> +
> +#if py3
> +For now, on Python 3, we abort when encountering non-UTF-8 
> percent-encoded
> +bytes in a filename.
> +
> +  $ hg convert file:///%ff test
> +  initializing destination test repository
> +  on Python 3, we currently do not support non-UTF-8 percent-encoded
> bytes in file URLs for Subversion repositories
> +  file:///%ff does not look like a CVS checkout
> +  $TESTTMP/file:/%ff does not look like a Git repository
> +  file:///%ff does not look like a Subversion repository
> +  file:///%ff is not a local Mercurial repository
> +  file:///%ff does not look like a darcs repository
> +  file:///%ff does not look like a monotone repository
> +  file:///%ff does not look like a GNU Arch repository
> +  file:///%ff does not look like a Bazaar repository
> +  file:///%ff does not look like a P4 repository
> +  abort: file:///%ff: missing or unsupported repository
> +  [255]
> +#endif
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Yuya Nishihara - June 17, 2020, 11:26 a.m.
On Wed, 17 Jun 2020 03:51:29 +0200, Manuel Jacob wrote:
> In the following situation, the behavior is problematic:
> 
> - We’re on Python 3.
> - The URL path contains a percent-encoded valid UTF-8 byte sequence. 
> urlreq.url2pathname()’s return value is unicode and will contain the 
> corresponding code point.
> - pycompat.fsencode() uses a different encoding than UTF-8 (e.g. 
> ISO-8859-1). It will encode the code point to a different byte sequence.
> - The file will not be found and the warning introduced in this patch is 
> not shown.
> 
> On Python 2, the percent-decoded bytes are preserved (at least on Linux, 
> I don’t have access to a Windows machine to verify).
> 
> A proper fix would be to have our own implementation for 
> urlreq.url2pathname() that works with bytes. This is the right thing to 
> do on Unix. On Windows, I think that we should assume that the 
> percent-decoded bytes are UTF-8 (see 
> https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). But it seems 
> like that would be a change from how it works on Python 2 (again, I 
> don’t have a Windows machine to verify) and therefore should be changed 
> in the default branch.

What encoding is expected as a subversion URL? It might be UTF-8 since
it is Subversion. Encoding handling in the convert extension is sometimes
wrong. It's probably better to fix things rather than copying the Py2
behavior.
Manuel Jacob - June 17, 2020, 3:01 p.m.
On 2020-06-17 13:26, Yuya Nishihara wrote:
> On Wed, 17 Jun 2020 03:51:29 +0200, Manuel Jacob wrote:
>> In the following situation, the behavior is problematic:
>> 
>> - We’re on Python 3.
>> - The URL path contains a percent-encoded valid UTF-8 byte sequence.
>> urlreq.url2pathname()’s return value is unicode and will contain the
>> corresponding code point.
>> - pycompat.fsencode() uses a different encoding than UTF-8 (e.g.
>> ISO-8859-1). It will encode the code point to a different byte 
>> sequence.
>> - The file will not be found and the warning introduced in this patch 
>> is
>> not shown.
>> 
>> On Python 2, the percent-decoded bytes are preserved (at least on 
>> Linux,
>> I don’t have access to a Windows machine to verify).
>> 
>> A proper fix would be to have our own implementation for
>> urlreq.url2pathname() that works with bytes. This is the right thing 
>> to
>> do on Unix. On Windows, I think that we should assume that the
>> percent-decoded bytes are UTF-8 (see
>> https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). But it seems
>> like that would be a change from how it works on Python 2 (again, I
>> don’t have a Windows machine to verify) and therefore should be 
>> changed
>> in the default branch.
> 
> What encoding is expected as a subversion URL? It might be UTF-8 since
> it is Subversion. Encoding handling in the convert extension is 
> sometimes
> wrong. It's probably better to fix things rather than copying the Py2
> behavior.

You’re right. Subversion converts the command line arguments from the 
local encoding to UTF-8. The percent-encoded bytes are decoded as bytes 
within this UTF-8-encoded string and thus interpreted as UTF-8. When 
accessing the FS, the UTF-8 string is converted to the FS encoding.

The SVN APIs called by the convert extension also expect UTF-8 bytes. 
I’ve found a few bugs because of that and will send patches.
Augie Fackler - June 18, 2020, 2:41 p.m.
> On Jun 17, 2020, at 07:26, Yuya Nishihara <yuya@tcha.org> wrote:
> 
> On Wed, 17 Jun 2020 03:51:29 +0200, Manuel Jacob wrote:
>> In the following situation, the behavior is problematic:
>> 
>> - We’re on Python 3.
>> - The URL path contains a percent-encoded valid UTF-8 byte sequence. 
>> urlreq.url2pathname()’s return value is unicode and will contain the 
>> corresponding code point.
>> - pycompat.fsencode() uses a different encoding than UTF-8 (e.g. 
>> ISO-8859-1). It will encode the code point to a different byte sequence.
>> - The file will not be found and the warning introduced in this patch is 
>> not shown.
>> 
>> On Python 2, the percent-decoded bytes are preserved (at least on Linux, 
>> I don’t have access to a Windows machine to verify).
>> 
>> A proper fix would be to have our own implementation for 
>> urlreq.url2pathname() that works with bytes. This is the right thing to 
>> do on Unix. On Windows, I think that we should assume that the 
>> percent-decoded bytes are UTF-8 (see 
>> https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). But it seems 
>> like that would be a change from how it works on Python 2 (again, I 
>> don’t have a Windows machine to verify) and therefore should be changed 
>> in the default branch.
> 
> What encoding is expected as a subversion URL? It might be UTF-8 since
> it is Subversion. Encoding handling in the convert extension is sometimes
> wrong. It's probably better to fix things rather than copying the Py2
> behavior.

All pathnames in Subversion are UTF-8.

> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Manuel Jacob - June 19, 2020, 2:39 a.m.
On 2020-06-18 16:41, Augie Fackler wrote:
>> On Jun 17, 2020, at 07:26, Yuya Nishihara <yuya@tcha.org> wrote:
>> 
>> On Wed, 17 Jun 2020 03:51:29 +0200, Manuel Jacob wrote:
>>> In the following situation, the behavior is problematic:
>>> 
>>> - We’re on Python 3.
>>> - The URL path contains a percent-encoded valid UTF-8 byte sequence.
>>> urlreq.url2pathname()’s return value is unicode and will contain the
>>> corresponding code point.
>>> - pycompat.fsencode() uses a different encoding than UTF-8 (e.g.
>>> ISO-8859-1). It will encode the code point to a different byte 
>>> sequence.
>>> - The file will not be found and the warning introduced in this patch 
>>> is
>>> not shown.
>>> 
>>> On Python 2, the percent-decoded bytes are preserved (at least on 
>>> Linux,
>>> I don’t have access to a Windows machine to verify).
>>> 
>>> A proper fix would be to have our own implementation for
>>> urlreq.url2pathname() that works with bytes. This is the right thing 
>>> to
>>> do on Unix. On Windows, I think that we should assume that the
>>> percent-decoded bytes are UTF-8 (see
>>> https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). But it 
>>> seems
>>> like that would be a change from how it works on Python 2 (again, I
>>> don’t have a Windows machine to verify) and therefore should be 
>>> changed
>>> in the default branch.
>> 
>> What encoding is expected as a subversion URL? It might be UTF-8 since
>> it is Subversion. Encoding handling in the convert extension is 
>> sometimes
>> wrong. It's probably better to fix things rather than copying the Py2
>> behavior.
> 
> All pathnames in Subversion are UTF-8.

Almost no conversion from the local encoding to UTF-8 is attempted in 
the convert.subversion extension. Is this by design?

Let’s suppose I’m on an ISO-8859-1 system. I commit some file with a 
non-ASCII filename. Subversion internally stores the filename as UTF-8 
(converted from ISO-8859-1). The convert extension will pass the UTF-8 
filename unchanged. The resulting Mercurial repository contains a UTF-8 
filename. When looking at the repository on the same machine, I get 
mojibake.

> 
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Augie Fackler - June 19, 2020, 2:56 a.m.
> On Jun 18, 2020, at 22:39, Manuel Jacob <me@manueljacob.de> wrote:
> 
> On 2020-06-18 16:41, Augie Fackler wrote:
>>> On Jun 17, 2020, at 07:26, Yuya Nishihara <yuya@tcha.org> wrote:
>>> On Wed, 17 Jun 2020 03:51:29 +0200, Manuel Jacob wrote:
>>>> In the following situation, the behavior is problematic:
>>>> - We’re on Python 3.
>>>> - The URL path contains a percent-encoded valid UTF-8 byte sequence.
>>>> urlreq.url2pathname()’s return value is unicode and will contain the
>>>> corresponding code point.
>>>> - pycompat.fsencode() uses a different encoding than UTF-8 (e.g.
>>>> ISO-8859-1). It will encode the code point to a different byte sequence.
>>>> - The file will not be found and the warning introduced in this patch is
>>>> not shown.
>>>> On Python 2, the percent-decoded bytes are preserved (at least on Linux,
>>>> I don’t have access to a Windows machine to verify).
>>>> A proper fix would be to have our own implementation for
>>>> urlreq.url2pathname() that works with bytes. This is the right thing to
>>>> do on Unix. On Windows, I think that we should assume that the
>>>> percent-decoded bytes are UTF-8 (see
>>>> https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). But it seems
>>>> like that would be a change from how it works on Python 2 (again, I
>>>> don’t have a Windows machine to verify) and therefore should be changed
>>>> in the default branch.
>>> What encoding is expected as a subversion URL? It might be UTF-8 since
>>> it is Subversion. Encoding handling in the convert extension is sometimes
>>> wrong. It's probably better to fix things rather than copying the Py2
>>> behavior.
>> All pathnames in Subversion are UTF-8.
> 
> Almost no conversion from the local encoding to UTF-8 is attempted in the convert.subversion extension. Is this by design?

In which direction? Making a Mercurial repo from a Subversion one, I'd expect no conversions because the "original" bytes in the filenames were the svn-normalized UTF-8. In the other direction I'd kind of expect some fromlocal, but that code predates me.

> Let’s suppose I’m on an ISO-8859-1 system. I commit some file with a non-ASCII filename. Subversion internally stores the filename as UTF-8 (converted from ISO-8859-1). The convert extension will pass the UTF-8 filename unchanged. The resulting Mercurial repository contains a UTF-8 filename. When looking at the repository on the same machine, I get mojibake.

I think you're talking about Subversion -> Mercurial. I agree that this is mojibake, but it's also unavoidable if you want to avoid breaking any Makefiles that involve non-ASCII filenames.

> 
>>> _______________________________________________
>>> Mercurial-devel mailing list
>>> Mercurial-devel@mercurial-scm.org <mailto:Mercurial-devel@mercurial-scm.org>
>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel <https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel>
Manuel Jacob - June 19, 2020, 4:17 a.m.
On 2020-06-19 04:56, Augie Fackler wrote:
>> On Jun 18, 2020, at 22:39, Manuel Jacob <me@manueljacob.de> wrote:
>> 
>> On 2020-06-18 16:41, Augie Fackler wrote:
>>>> On Jun 17, 2020, at 07:26, Yuya Nishihara <yuya@tcha.org> wrote:
>>>> On Wed, 17 Jun 2020 03:51:29 +0200, Manuel Jacob wrote:
>>>>> In the following situation, the behavior is problematic:
>>>>> - We’re on Python 3.
>>>>> - The URL path contains a percent-encoded valid UTF-8 byte 
>>>>> sequence.
>>>>> urlreq.url2pathname()’s return value is unicode and will contain 
>>>>> the
>>>>> corresponding code point.
>>>>> - pycompat.fsencode() uses a different encoding than UTF-8 (e.g.
>>>>> ISO-8859-1). It will encode the code point to a different byte 
>>>>> sequence.
>>>>> - The file will not be found and the warning introduced in this 
>>>>> patch is
>>>>> not shown.
>>>>> On Python 2, the percent-decoded bytes are preserved (at least on 
>>>>> Linux,
>>>>> I don’t have access to a Windows machine to verify).
>>>>> A proper fix would be to have our own implementation for
>>>>> urlreq.url2pathname() that works with bytes. This is the right 
>>>>> thing to
>>>>> do on Unix. On Windows, I think that we should assume that the
>>>>> percent-decoded bytes are UTF-8 (see
>>>>> https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). But it 
>>>>> seems
>>>>> like that would be a change from how it works on Python 2 (again, I
>>>>> don’t have a Windows machine to verify) and therefore should be 
>>>>> changed
>>>>> in the default branch.
>>>> What encoding is expected as a subversion URL? It might be UTF-8 
>>>> since
>>>> it is Subversion. Encoding handling in the convert extension is 
>>>> sometimes
>>>> wrong. It's probably better to fix things rather than copying the 
>>>> Py2
>>>> behavior.
>>> All pathnames in Subversion are UTF-8.
>> 
>> Almost no conversion from the local encoding to UTF-8 is attempted in 
>> the convert.subversion extension. Is this by design?
> 
> In which direction? Making a Mercurial repo from a Subversion one, I'd
> expect no conversions because the "original" bytes in the filenames
> were the svn-normalized UTF-8. In the other direction I'd kind of
> expect some fromlocal, but that code predates me.

 Fromlocal is not used at all in convert.subversion. Tolocal is only used 
in one place where XML output is parsed. In the Mercurial -> Subversion 
case, the raw bytes filenames are passed as arguments to a svn 
subprocess (which converts them from the local encoding to UTF-8 
internally).

>> Let’s suppose I’m on an ISO-8859-1 system. I commit some file with a 
>> non-ASCII filename. Subversion internally stores the filename as UTF-8 
>> (converted from ISO-8859-1). The convert extension will pass the UTF-8 
>> filename unchanged. The resulting Mercurial repository contains a 
>> UTF-8 filename. When looking at the repository on the same machine, I 
>> get mojibake.
> 
> I think you're talking about Subversion -> Mercurial. I agree that
> this is mojibake, but it's also unavoidable if you want to avoid
> breaking any Makefiles that involve non-ASCII filenames.

Yes, I’m talking about Subversion -> Mercurial. In my example, Makefiles 
are broken by the convert extension. On a ISO-8859-1 system, in the SVN 
working copy the Makefile works fine if both the filename and the 
Makefile are ISO-8859-1. SVN stores the filename as UTF-8 (converted 
from ISO-8859-1) and the Makefile as ISO-8859-1. Since the convert 
extension preserves both the UTF-8 filename and the ISO-8859-1 Makefile 
and stores it in the Mercurial repository, it won’t work on the same 
machine.

As described above, Mercurial -> Subversion converts the encoding from 
local to UTF-8. So a Subversion -> Mercurial -> Subversion conversion 
produces nonsense on non-UTF-8 systems.

>> 
>>>> _______________________________________________
>>>> Mercurial-devel mailing list
>>>> Mercurial-devel@mercurial-scm.org 
>>>> <mailto:Mercurial-devel@mercurial-scm.org>
>>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel 
>>>> <https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel>
Augie Fackler - June 21, 2020, 6:35 p.m.
> On Jun 19, 2020, at 12:17 AM, Manuel Jacob <me@manueljacob.de> wrote:
> 
> On 2020-06-19 04:56, Augie Fackler wrote:
>>> On Jun 18, 2020, at 22:39, Manuel Jacob <me@manueljacob.de> wrote:
>>> On 2020-06-18 16:41, Augie Fackler wrote:
>>>>> On Jun 17, 2020, at 07:26, Yuya Nishihara <yuya@tcha.org> wrote:
>>>>> On Wed, 17 Jun 2020 03:51:29 +0200, Manuel Jacob wrote:
>>>>>> In the following situation, the behavior is problematic:
>>>>>> - We’re on Python 3.
>>>>>> - The URL path contains a percent-encoded valid UTF-8 byte sequence.
>>>>>> urlreq.url2pathname()’s return value is unicode and will contain the
>>>>>> corresponding code point.
>>>>>> - pycompat.fsencode() uses a different encoding than UTF-8 (e.g.
>>>>>> ISO-8859-1). It will encode the code point to a different byte sequence.
>>>>>> - The file will not be found and the warning introduced in this patch is
>>>>>> not shown.
>>>>>> On Python 2, the percent-decoded bytes are preserved (at least on Linux,
>>>>>> I don’t have access to a Windows machine to verify).
>>>>>> A proper fix would be to have our own implementation for
>>>>>> urlreq.url2pathname() that works with bytes. This is the right thing to
>>>>>> do on Unix. On Windows, I think that we should assume that the
>>>>>> percent-decoded bytes are UTF-8 (see
>>>>>> https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). But it seems
>>>>>> like that would be a change from how it works on Python 2 (again, I
>>>>>> don’t have a Windows machine to verify) and therefore should be changed
>>>>>> in the default branch.
>>>>> What encoding is expected as a subversion URL? It might be UTF-8 since
>>>>> it is Subversion. Encoding handling in the convert extension is sometimes
>>>>> wrong. It's probably better to fix things rather than copying the Py2
>>>>> behavior.
>>>> All pathnames in Subversion are UTF-8.
>>> Almost no conversion from the local encoding to UTF-8 is attempted in the convert.subversion extension. Is this by design?
>> In which direction? Making a Mercurial repo from a Subversion one, I'd
>> expect no conversions because the "original" bytes in the filenames
>> were the svn-normalized UTF-8. In the other direction I'd kind of
>> expect some fromlocal, but that code predates me.
> 
> Fromlocal is not used at all in convert.subversion. Tolocal is only used in one place where XML output is parsed. In the Mercurial -> Subversion case, the raw bytes filenames are passed as arguments to a svn subprocess (which converts them from the local encoding to UTF-8 internally).
> 
>>> Let’s suppose I’m on an ISO-8859-1 system. I commit some file with a non-ASCII filename. Subversion internally stores the filename as UTF-8 (converted from ISO-8859-1). The convert extension will pass the UTF-8 filename unchanged. The resulting Mercurial repository contains a UTF-8 filename. When looking at the repository on the same machine, I get mojibake.
>> I think you're talking about Subversion -> Mercurial. I agree that
>> this is mojibake, but it's also unavoidable if you want to avoid
>> breaking any Makefiles that involve non-ASCII filenames.
> 
> Yes, I’m talking about Subversion -> Mercurial. In my example, Makefiles are broken by the convert extension. On a ISO-8859-1 system, in the SVN working copy the Makefile works fine if both the filename and the Makefile are ISO-8859-1. SVN stores the filename as UTF-8 (converted from ISO-8859-1) and the Makefile as ISO-8859-1. Since the convert extension preserves both the UTF-8 filename and the ISO-8859-1 Makefile and stores it in the Mercurial repository, it won’t work on the same machine.

Ah. You have one of the nightmare situations in Subversion[0]. I’ll gladly review a patch for convert to let you have a “destination filename encoding” knob you could set, but I can’t commit to doing it for you due to time constraints.

Thanks,
Augie

0: “Nightmare” because if you were to check out this repo on a macOS machine, the Makefiles would be broken. Probably on Linux too!

> As described above, Mercurial -> Subversion converts the encoding from local to UTF-8. So a Subversion -> Mercurial -> Subversion conversion produces nonsense on non-UTF-8 systems.
> 
>>>>> _______________________________________________
>>>>> Mercurial-devel mailing list
>>>>> Mercurial-devel@mercurial-scm.org <mailto:Mercurial-devel@mercurial-scm.org>
>>>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel <https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel>

Patch

diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
--- a/hgext/convert/subversion.py
+++ b/hgext/convert/subversion.py
@@ -321,7 +321,26 @@ 
                 and path[2:6].lower() == b'%3a/'
             ):
                 path = path[:2] + b':/' + path[6:]
-            path = urlreq.url2pathname(path)
+            # pycompat.fsdecode() / pycompat.fsencode() are used so that bytes
+            # in the URL roundtrip correctly on Unix. urlreq.url2pathname() on
+            # py3 will decode percent-encoded bytes using the utf-8 encoding
+            # and the "replace" error handler. This means that it will not
+            # preserve non-UTF-8 bytes (https://bugs.python.org/issue40983).
+            # url.open() uses the reverse function (urlreq.pathname2url()) and
+            # has a similar problem
+            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). It makes
+            # sense to solve both problems together and handle all file URLs
+            # consistently. For now, we warn.
+            unicodepath = urlreq.url2pathname(pycompat.fsdecode(path))
+            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in unicodepath:
+                ui.warn(
+                    _(
+                        b'on Python 3, we currently do not support non-UTF-8 '
+                        b'percent-encoded bytes in file URLs for Subversion '
+                        b'repositories\n'
+                    )
+                )
+            path = pycompat.fsencode(unicodepath)
     except ValueError:
         proto = b'file'
         path = os.path.abspath(url)
diff --git a/tests/test-convert-svn-encoding.t b/tests/test-convert-svn-encoding.t
--- a/tests/test-convert-svn-encoding.t
+++ b/tests/test-convert-svn-encoding.t
@@ -152,3 +152,23 @@ 
   f7e66f98380ed1e53a797c5c7a7a2616a7ab377d branch\xc3\xa9 (esc)
 
   $ cd ..
+
+#if py3
+For now, on Python 3, we abort when encountering non-UTF-8 percent-encoded
+bytes in a filename.
+
+  $ hg convert file:///%ff test
+  initializing destination test repository
+  on Python 3, we currently do not support non-UTF-8 percent-encoded bytes in file URLs for Subversion repositories
+  file:///%ff does not look like a CVS checkout
+  $TESTTMP/file:/%ff does not look like a Git repository
+  file:///%ff does not look like a Subversion repository
+  file:///%ff is not a local Mercurial repository
+  file:///%ff does not look like a darcs repository
+  file:///%ff does not look like a monotone repository
+  file:///%ff does not look like a GNU Arch repository
+  file:///%ff does not look like a Bazaar repository
+  file:///%ff does not look like a P4 repository
+  abort: file:///%ff: missing or unsupported repository
+  [255]
+#endif