Patchwork [6,of,6,stable] convert: handle percent-encoded bytes in file URLs like Subversion

login
register
mail settings
Submitter Manuel Jacob
Date June 30, 2020, 6:45 a.m.
Message ID <9915fdff8d1732ce62b6.1593499547@tmp>
Download mbox | patch
Permalink /patch/46599/
State Accepted
Headers show

Comments

Manuel Jacob - June 30, 2020, 6:45 a.m.
# HG changeset patch
# User Manuel Jacob <me@manueljacob.de>
# Date 1593494609 -7200
#      Tue Jun 30 07:23:29 2020 +0200
# Branch stable
# Node ID 9915fdff8d1732ce62b6df69b50106384d4ad4d1
# Parent  e1a4c7f23e804f37c3848fc408607af916d619d1
# EXP-Topic svn_encoding
convert: handle percent-encoded bytes in file URLs like Subversion

75b59d221aa3 added most of the code that gets removed by this patch. It helped
making progress on Python 3, but the reasoning was wrong in many ways. I tried
to retract it while it was queued, but it was too late.

Back then, I was asssuming that what happened on Python 2 (preserving bytes) is
correct and my Python 3 change is a hack. However it turned out that Subversion
interprets percent-encoded bytes as UTF-8. Accepting the same format as
Subversion is a good idea.

Consistency with urlreq.pathname2url() (as described in the removed comment)
doesn’t matter because that function is only used for passing paths to urllib.

This is not a backwards-incompatible change because before 5c0d5b48e58c,
non-ASCII filenames didn’t work at all on Python 2.

When the locale encoding is ISO-8859-15, `svn` accepts `file:///tmp/a%E2%82%AC`
for `/tmp/a€`. Before this patch, this was the case for this extension on
Python 3, but not on Python 2. This patch makes it work like with `svn` on both
Python 2 and Python 3.
Yuya Nishihara - June 30, 2020, 12:24 p.m.
On Tue, 30 Jun 2020 08:45:47 +0200, Manuel Jacob wrote:
> # HG changeset patch
> # User Manuel Jacob <me@manueljacob.de>
> # Date 1593494609 -7200
> #      Tue Jun 30 07:23:29 2020 +0200
> # Branch stable
> # Node ID 9915fdff8d1732ce62b6df69b50106384d4ad4d1
> # Parent  e1a4c7f23e804f37c3848fc408607af916d619d1
> # EXP-Topic svn_encoding
> convert: handle percent-encoded bytes in file URLs like Subversion

>  def issvnurl(ui, url):
>      try:
>          proto, path = url.split(b'://', 1)
> @@ -361,7 +387,7 @@
>              ):
>                  path = path[:2] + b':/' + path[6:]
>              try:
> -                path.decode(fsencoding)
> +                unicodepath = path.decode(fsencoding)
>              except UnicodeDecodeError:
>                  ui.warn(
>                      _(
> @@ -371,28 +397,17 @@
>                      % pycompat.sysbytes(fsencoding)
>                  )
>                  return False
> -            # FIXME: The following reasoning and logic is wrong and will be
> -            # fixed in a following changeset.
> -            # pycompat.fsdecode() / pycompat.fsencode() are used so that bytes
> -            # in the URL roundtrip correctly on Unix. urlreq.url2pathname() on
> -            # py3 will decode percent-encoded bytes using the utf-8 encoding
> -            # and the "replace" error handler. This means that it will not
> -            # preserve non-UTF-8 bytes (https://bugs.python.org/issue40983).
> -            # url.open() uses the reverse function (urlreq.pathname2url()) and
> -            # has a similar problem
> -            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). It makes
> -            # sense to solve both problems together and handle all file URLs
> -            # consistently. For now, we warn.
> -            unicodepath = urlreq.url2pathname(pycompat.fsdecode(path))
> -            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in unicodepath:
> +            try:
> +                unicodepath = url2pathname_like_subversion(unicodepath)
> +            except NonUtf8PercentEncodedBytes:
>                  ui.warn(
>                      _(
> -                        b'on Python 3, we currently do not support non-UTF-8 '
> -                        b'percent-encoded bytes in file URLs for Subversion '
> -                        b'repositories\n'
> +                        b'Subversion does not support non-UTF-8 '
> +                        b'percent-encoded bytes in file URLs\n'
>                      )
>                  )
> -            path = pycompat.fsencode(unicodepath)
> +                return False
> +            path = unicodepath.encode(fsencoding)

I think pycompat.fsencode() is more correct since the path will be later
tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
encoding.unitolocal() is okay.
Manuel Jacob - June 30, 2020, 1:25 p.m.
On 2020-06-30 14:24, Yuya Nishihara wrote:
> On Tue, 30 Jun 2020 08:45:47 +0200, Manuel Jacob wrote:
>> # HG changeset patch
>> # User Manuel Jacob <me@manueljacob.de>
>> # Date 1593494609 -7200
>> #      Tue Jun 30 07:23:29 2020 +0200
>> # Branch stable
>> # Node ID 9915fdff8d1732ce62b6df69b50106384d4ad4d1
>> # Parent  e1a4c7f23e804f37c3848fc408607af916d619d1
>> # EXP-Topic svn_encoding
>> convert: handle percent-encoded bytes in file URLs like Subversion
> 
>>  def issvnurl(ui, url):
>>      try:
>>          proto, path = url.split(b'://', 1)
>> @@ -361,7 +387,7 @@
>>              ):
>>                  path = path[:2] + b':/' + path[6:]
>>              try:
>> -                path.decode(fsencoding)
>> +                unicodepath = path.decode(fsencoding)
>>              except UnicodeDecodeError:
>>                  ui.warn(
>>                      _(
>> @@ -371,28 +397,17 @@
>>                      % pycompat.sysbytes(fsencoding)
>>                  )
>>                  return False
>> -            # FIXME: The following reasoning and logic is wrong and 
>> will be
>> -            # fixed in a following changeset.
>> -            # pycompat.fsdecode() / pycompat.fsencode() are used so 
>> that bytes
>> -            # in the URL roundtrip correctly on Unix. 
>> urlreq.url2pathname() on
>> -            # py3 will decode percent-encoded bytes using the utf-8 
>> encoding
>> -            # and the "replace" error handler. This means that it 
>> will not
>> -            # preserve non-UTF-8 bytes 
>> (https://bugs.python.org/issue40983).
>> -            # url.open() uses the reverse function 
>> (urlreq.pathname2url()) and
>> -            # has a similar problem
>> -            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). It 
>> makes
>> -            # sense to solve both problems together and handle all 
>> file URLs
>> -            # consistently. For now, we warn.
>> -            unicodepath = 
>> urlreq.url2pathname(pycompat.fsdecode(path))
>> -            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in 
>> unicodepath:
>> +            try:
>> +                unicodepath = 
>> url2pathname_like_subversion(unicodepath)
>> +            except NonUtf8PercentEncodedBytes:
>>                  ui.warn(
>>                      _(
>> -                        b'on Python 3, we currently do not support 
>> non-UTF-8 '
>> -                        b'percent-encoded bytes in file URLs for 
>> Subversion '
>> -                        b'repositories\n'
>> +                        b'Subversion does not support non-UTF-8 '
>> +                        b'percent-encoded bytes in file URLs\n'
>>                      )
>>                  )
>> -            path = pycompat.fsencode(unicodepath)
>> +                return False
>> +            path = unicodepath.encode(fsencoding)
> 
> I think pycompat.fsencode() is more correct since the path will be 
> later
> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> encoding.unitolocal() is okay.

It was deliberate to use something that fails when it can’t be encoded. 
However, I thought about failing in case of a logic error, not about 
that the URL could contain a (UTF-8-encoded) code point that is not 
available in `fsencoding`.

On Windows, Subversion will be able to handle all code points. We’re 
unnecessarily restricting ourselves by doing the rest of the function on 
bytes. Although not entirely Mercurial-idiomatic, we should maybe 
consider doing the rest of the function in Unicode. We can also bail 
out, but we have to make sure the warning mentions that it’s not the 
user’s fault.

On Unix, Subversion will try later to convert the path to the locale 
encoding, which fails if the code point is not encodable. We should bail 
out in this case.
Manuel Jacob - June 30, 2020, 1:43 p.m.
On 2020-06-30 15:25, Manuel Jacob wrote:
> On 2020-06-30 14:24, Yuya Nishihara wrote:
>> On Tue, 30 Jun 2020 08:45:47 +0200, Manuel Jacob wrote:
>>> # HG changeset patch
>>> # User Manuel Jacob <me@manueljacob.de>
>>> # Date 1593494609 -7200
>>> #      Tue Jun 30 07:23:29 2020 +0200
>>> # Branch stable
>>> # Node ID 9915fdff8d1732ce62b6df69b50106384d4ad4d1
>>> # Parent  e1a4c7f23e804f37c3848fc408607af916d619d1
>>> # EXP-Topic svn_encoding
>>> convert: handle percent-encoded bytes in file URLs like Subversion
>> 
>>>  def issvnurl(ui, url):
>>>      try:
>>>          proto, path = url.split(b'://', 1)
>>> @@ -361,7 +387,7 @@
>>>              ):
>>>                  path = path[:2] + b':/' + path[6:]
>>>              try:
>>> -                path.decode(fsencoding)
>>> +                unicodepath = path.decode(fsencoding)
>>>              except UnicodeDecodeError:
>>>                  ui.warn(
>>>                      _(
>>> @@ -371,28 +397,17 @@
>>>                      % pycompat.sysbytes(fsencoding)
>>>                  )
>>>                  return False
>>> -            # FIXME: The following reasoning and logic is wrong and 
>>> will be
>>> -            # fixed in a following changeset.
>>> -            # pycompat.fsdecode() / pycompat.fsencode() are used so 
>>> that bytes
>>> -            # in the URL roundtrip correctly on Unix. 
>>> urlreq.url2pathname() on
>>> -            # py3 will decode percent-encoded bytes using the utf-8 
>>> encoding
>>> -            # and the "replace" error handler. This means that it 
>>> will not
>>> -            # preserve non-UTF-8 bytes 
>>> (https://bugs.python.org/issue40983).
>>> -            # url.open() uses the reverse function 
>>> (urlreq.pathname2url()) and
>>> -            # has a similar problem
>>> -            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). 
>>> It makes
>>> -            # sense to solve both problems together and handle all 
>>> file URLs
>>> -            # consistently. For now, we warn.
>>> -            unicodepath = 
>>> urlreq.url2pathname(pycompat.fsdecode(path))
>>> -            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in 
>>> unicodepath:
>>> +            try:
>>> +                unicodepath = 
>>> url2pathname_like_subversion(unicodepath)
>>> +            except NonUtf8PercentEncodedBytes:
>>>                  ui.warn(
>>>                      _(
>>> -                        b'on Python 3, we currently do not support 
>>> non-UTF-8 '
>>> -                        b'percent-encoded bytes in file URLs for 
>>> Subversion '
>>> -                        b'repositories\n'
>>> +                        b'Subversion does not support non-UTF-8 '
>>> +                        b'percent-encoded bytes in file URLs\n'
>>>                      )
>>>                  )
>>> -            path = pycompat.fsencode(unicodepath)
>>> +                return False
>>> +            path = unicodepath.encode(fsencoding)
>> 
>> I think pycompat.fsencode() is more correct since the path will be 
>> later
>> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
>> encoding.unitolocal() is okay.
> 
> It was deliberate to use something that fails when it can’t be
> encoded. However, I thought about failing in case of a logic error,
> not about that the URL could contain a (UTF-8-encoded) code point that
> is not available in `fsencoding`.
> 
> On Windows, Subversion will be able to handle all code points. We’re
> unnecessarily restricting ourselves by doing the rest of the function
> on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> consider doing the rest of the function in Unicode. We can also bail
> out, but we have to make sure the warning mentions that it’s not the
> user’s fault.
> 
> On Unix, Subversion will try later to convert the path to the locale
> encoding, which fails if the code point is not encodable. We should
> bail out in this case.

Let me add that with "bail out", I included that a warning describing 
the problem is shown.

Since what I described is mostly about showing nicer messages, should 
they be added in an amended patch or as follow-ups?
Yuya Nishihara - June 30, 2020, 2:03 p.m.
On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
> >>> -            path = pycompat.fsencode(unicodepath)
> >>> +                return False
> >>> +            path = unicodepath.encode(fsencoding)
> >> 
> >> I think pycompat.fsencode() is more correct since the path will be 
> >> later
> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> >> encoding.unitolocal() is okay.
> > 
> > It was deliberate to use something that fails when it can’t be
> > encoded. However, I thought about failing in case of a logic error,
> > not about that the URL could contain a (UTF-8-encoded) code point that
> > is not available in `fsencoding`.
> > 
> > On Windows, Subversion will be able to handle all code points. We’re
> > unnecessarily restricting ourselves by doing the rest of the function
> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> > consider doing the rest of the function in Unicode. We can also bail
> > out, but we have to make sure the warning mentions that it’s not the
> > user’s fault.
> > 
> > On Unix, Subversion will try later to convert the path to the locale
> > encoding, which fails if the code point is not encodable. We should
> > bail out in this case.
> 
> Let me add that with "bail out", I included that a warning describing 
> the problem is shown.
> 
> Since what I described is mostly about showing nicer messages, should 
> they be added in an amended patch or as follow-ups?

I'm not sure if I understood what you mean.

I just pointed out that the fsencoding is the encoding in which Subversion
will convert unicode to bytes, whereas fsencode() is the function for Python
layer. And the path should be a bytes encoded in Python way.
Manuel Jacob - June 30, 2020, 2:28 p.m.
On 2020-06-30 16:03, Yuya Nishihara wrote:
> On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
>> >>> -            path = pycompat.fsencode(unicodepath)
>> >>> +                return False
>> >>> +            path = unicodepath.encode(fsencoding)
>> >>
>> >> I think pycompat.fsencode() is more correct since the path will be
>> >> later
>> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
>> >> encoding.unitolocal() is okay.
>> >
>> > It was deliberate to use something that fails when it can’t be
>> > encoded. However, I thought about failing in case of a logic error,
>> > not about that the URL could contain a (UTF-8-encoded) code point that
>> > is not available in `fsencoding`.
>> >
>> > On Windows, Subversion will be able to handle all code points. We’re
>> > unnecessarily restricting ourselves by doing the rest of the function
>> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
>> > consider doing the rest of the function in Unicode. We can also bail
>> > out, but we have to make sure the warning mentions that it’s not the
>> > user’s fault.
>> >
>> > On Unix, Subversion will try later to convert the path to the locale
>> > encoding, which fails if the code point is not encodable. We should
>> > bail out in this case.
>> 
>> Let me add that with "bail out", I included that a warning describing
>> the problem is shown.
>> 
>> Since what I described is mostly about showing nicer messages, should
>> they be added in an amended patch or as follow-ups?
> 
> I'm not sure if I understood what you mean.
> 
> I just pointed out that the fsencoding is the encoding in which 
> Subversion
> will convert unicode to bytes, whereas fsencode() is the function for 
> Python
> layer. And the path should be a bytes encoded in Python way.

Subversion will do a similar check as in the rest of the function to 
determine whether the repo is actually a SVN repo.

If we use a different encoding, we have a higher chance of running into 
cases where our function thinks it’s a SVN repo, while Subversion thinks 
it’s not (and vice versa). Therefore, I think we should use 
`fsencoding`.
Yuya Nishihara - June 30, 2020, 2:48 p.m.
On Tue, 30 Jun 2020 16:28:55 +0200, Manuel Jacob wrote:
> On 2020-06-30 16:03, Yuya Nishihara wrote:
> > On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
> >> >>> -            path = pycompat.fsencode(unicodepath)
> >> >>> +                return False
> >> >>> +            path = unicodepath.encode(fsencoding)
> >> >>
> >> >> I think pycompat.fsencode() is more correct since the path will be
> >> >> later
> >> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> >> >> encoding.unitolocal() is okay.
> >> >
> >> > It was deliberate to use something that fails when it can’t be
> >> > encoded. However, I thought about failing in case of a logic error,
> >> > not about that the URL could contain a (UTF-8-encoded) code point that
> >> > is not available in `fsencoding`.
> >> >
> >> > On Windows, Subversion will be able to handle all code points. We’re
> >> > unnecessarily restricting ourselves by doing the rest of the function
> >> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> >> > consider doing the rest of the function in Unicode. We can also bail
> >> > out, but we have to make sure the warning mentions that it’s not the
> >> > user’s fault.
> >> >
> >> > On Unix, Subversion will try later to convert the path to the locale
> >> > encoding, which fails if the code point is not encodable. We should
> >> > bail out in this case.
> >> 
> >> Let me add that with "bail out", I included that a warning describing
> >> the problem is shown.
> >> 
> >> Since what I described is mostly about showing nicer messages, should
> >> they be added in an amended patch or as follow-ups?
> > 
> > I'm not sure if I understood what you mean.
> > 
> > I just pointed out that the fsencoding is the encoding in which 
> > Subversion
> > will convert unicode to bytes, whereas fsencode() is the function for 
> > Python
> > layer. And the path should be a bytes encoded in Python way.
> 
> Subversion will do a similar check as in the rest of the function to 
> determine whether the repo is actually a SVN repo.
> 
> If we use a different encoding, we have a higher chance of running into 
> cases where our function thinks it’s a SVN repo, while Subversion thinks 
> it’s not (and vice versa). Therefore, I think we should use 
> `fsencoding`.

Then, can you adjust the documentation of fsencoding? Without that, I would
see there are at least three types of path variables:

 - py_fs_bytes_t (or maybe hg_fs_bytes_t)
 - svn_fs_bytes_t
 - unicode_t (or svn_url_unicode_t)

Be aware that the path will be converted back to UCS2-ish encoding on
Windows by either Python 3 or Win32 ANSI API.
Manuel Jacob - July 1, 2020, 4:23 a.m.
On 2020-06-30 16:48, Yuya Nishihara wrote:
> On Tue, 30 Jun 2020 16:28:55 +0200, Manuel Jacob wrote:
>> On 2020-06-30 16:03, Yuya Nishihara wrote:
>> > On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
>> >> >>> -            path = pycompat.fsencode(unicodepath)
>> >> >>> +                return False
>> >> >>> +            path = unicodepath.encode(fsencoding)
>> >> >>
>> >> >> I think pycompat.fsencode() is more correct since the path will be
>> >> >> later
>> >> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
>> >> >> encoding.unitolocal() is okay.
>> >> >
>> >> > It was deliberate to use something that fails when it can’t be
>> >> > encoded. However, I thought about failing in case of a logic error,
>> >> > not about that the URL could contain a (UTF-8-encoded) code point that
>> >> > is not available in `fsencoding`.
>> >> >
>> >> > On Windows, Subversion will be able to handle all code points. We’re
>> >> > unnecessarily restricting ourselves by doing the rest of the function
>> >> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
>> >> > consider doing the rest of the function in Unicode. We can also bail
>> >> > out, but we have to make sure the warning mentions that it’s not the
>> >> > user’s fault.
>> >> >
>> >> > On Unix, Subversion will try later to convert the path to the locale
>> >> > encoding, which fails if the code point is not encodable. We should
>> >> > bail out in this case.
>> >>
>> >> Let me add that with "bail out", I included that a warning describing
>> >> the problem is shown.
>> >>
>> >> Since what I described is mostly about showing nicer messages, should
>> >> they be added in an amended patch or as follow-ups?
>> >
>> > I'm not sure if I understood what you mean.
>> >
>> > I just pointed out that the fsencoding is the encoding in which
>> > Subversion
>> > will convert unicode to bytes, whereas fsencode() is the function for
>> > Python
>> > layer. And the path should be a bytes encoded in Python way.
>> 
>> Subversion will do a similar check as in the rest of the function to
>> determine whether the repo is actually a SVN repo.
>> 
>> If we use a different encoding, we have a higher chance of running 
>> into
>> cases where our function thinks it’s a SVN repo, while Subversion 
>> thinks
>> it’s not (and vice versa). Therefore, I think we should use
>> `fsencoding`.
> 
> Then, can you adjust the documentation of fsencoding? Without that, I 
> would
> see there are at least three types of path variables:
> 
>  - py_fs_bytes_t (or maybe hg_fs_bytes_t)
>  - svn_fs_bytes_t
>  - unicode_t (or svn_url_unicode_t)
> 
> Be aware that the path will be converted back to UCS2-ish encoding on
> Windows by either Python 3 or Win32 ANSI API.

The basic distinction is between:

* Mercurial stores paths as bytes.
* The type for Subversion paths and URLs is Unicode. We store them as 
UTF-8-encoded bytes (because that is what the Subversion C API accepts), 
except for `unicodepath`, which is of Python type `unicode`.
* The OS uses bytes or Unicode for paths. I call that "native string" in 
the documentation for `fsencoding`.

In the bottom of issvnurl(), we approximate how Subversion checks the 
URL. On Unix, we should therefore convert the path to `fsencoding` (like 
Subversion does). On Windows, the right thing would actually be to leave 
the path as unicode. For now, I think that we can restrict the URL to 
MBCS, given that before, only ASCII worked reliably.

(Part of) that reasoning could be put into a code comment inside 
issvnurl().

What piece of information is missing in the documentation for 
`fsencoding`?

Obviously the name is not perfect. It’s for the encoding that should be 
used for converting hg_fs_bytes_t to Unicode. Plus, On Unix it’s also 
the encoding that Subversion actually uses to convert from Unicode to OS 
bytes.
Yuya Nishihara - July 1, 2020, 12:54 p.m.
On Wed, 01 Jul 2020 06:23:19 +0200, Manuel Jacob wrote:
> On 2020-06-30 16:48, Yuya Nishihara wrote:
> > On Tue, 30 Jun 2020 16:28:55 +0200, Manuel Jacob wrote:
> >> On 2020-06-30 16:03, Yuya Nishihara wrote:
> >> > On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
> >> >> >>> -            path = pycompat.fsencode(unicodepath)
> >> >> >>> +                return False
> >> >> >>> +            path = unicodepath.encode(fsencoding)
> >> >> >>
> >> >> >> I think pycompat.fsencode() is more correct since the path will be
> >> >> >> later
> >> >> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> >> >> >> encoding.unitolocal() is okay.
> >> >> >
> >> >> > It was deliberate to use something that fails when it can’t be
> >> >> > encoded. However, I thought about failing in case of a logic error,
> >> >> > not about that the URL could contain a (UTF-8-encoded) code point that
> >> >> > is not available in `fsencoding`.
> >> >> >
> >> >> > On Windows, Subversion will be able to handle all code points. We’re
> >> >> > unnecessarily restricting ourselves by doing the rest of the function
> >> >> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> >> >> > consider doing the rest of the function in Unicode. We can also bail
> >> >> > out, but we have to make sure the warning mentions that it’s not the
> >> >> > user’s fault.
> >> >> >
> >> >> > On Unix, Subversion will try later to convert the path to the locale
> >> >> > encoding, which fails if the code point is not encodable. We should
> >> >> > bail out in this case.
> >> >>
> >> >> Let me add that with "bail out", I included that a warning describing
> >> >> the problem is shown.
> >> >>
> >> >> Since what I described is mostly about showing nicer messages, should
> >> >> they be added in an amended patch or as follow-ups?
> >> >
> >> > I'm not sure if I understood what you mean.
> >> >
> >> > I just pointed out that the fsencoding is the encoding in which
> >> > Subversion
> >> > will convert unicode to bytes, whereas fsencode() is the function for
> >> > Python
> >> > layer. And the path should be a bytes encoded in Python way.
> >> 
> >> Subversion will do a similar check as in the rest of the function to
> >> determine whether the repo is actually a SVN repo.
> >> 
> >> If we use a different encoding, we have a higher chance of running 
> >> into
> >> cases where our function thinks it’s a SVN repo, while Subversion 
> >> thinks
> >> it’s not (and vice versa). Therefore, I think we should use
> >> `fsencoding`.
> > 
> > Then, can you adjust the documentation of fsencoding? Without that, I 
> > would
> > see there are at least three types of path variables:
> > 
> >  - py_fs_bytes_t (or maybe hg_fs_bytes_t)
> >  - svn_fs_bytes_t
> >  - unicode_t (or svn_url_unicode_t)
> > 
> > Be aware that the path will be converted back to UCS2-ish encoding on
> > Windows by either Python 3 or Win32 ANSI API.
> 
> The basic distinction is between:
> 
> * Mercurial stores paths as bytes.
> * The type for Subversion paths and URLs is Unicode. We store them as 
> UTF-8-encoded bytes (because that is what the Subversion C API accepts), 
> except for `unicodepath`, which is of Python type `unicode`.
> * The OS uses bytes or Unicode for paths. I call that "native string" in 
> the documentation for `fsencoding`.
> 
> In the bottom of issvnurl(), we approximate how Subversion checks the 
> URL. On Unix, we should therefore convert the path to `fsencoding` (like 
> Subversion does). On Windows, the right thing would actually be to leave 
> the path as unicode. For now, I think that we can restrict the URL to 
> MBCS, given that before, only ASCII worked reliably.
> 
> (Part of) that reasoning could be put into a code comment inside
> issvnurl().

+1, and adding the inline comment will address my concern.

> What piece of information is missing in the documentation for 
> `fsencoding`?

IIUC, it basically says fsencoding should be used when passing bytes to
Subversion where Subversion will convert it back. So passing "fsencod"-ed
bytes to Python os.* functions looks generally wrong (because Python has
its own encoding strategy on Windows.) That was my concern, but in this
issvnurl() case, it works as you've described above.

Patch

diff --git a/hgext/convert/subversion.py b/hgext/convert/subversion.py
--- a/hgext/convert/subversion.py
+++ b/hgext/convert/subversion.py
@@ -349,6 +349,32 @@ 
 }
 
 
+class NonUtf8PercentEncodedBytes(Exception):
+    pass
+
+
+# Subversion paths are Unicode. Since the percent-decoding is done on
+# UTF-8-encoded strings, percent-encoded bytes are interpreted as UTF-8.
+def url2pathname_like_subversion(unicodepath):
+    if pycompat.ispy3:
+        # On Python 3, we have to pass unicode to urlreq.url2pathname().
+        # Percent-decoded bytes get decoded using UTF-8 and the 'replace' error
+        # handler.
+        unicodepath = urlreq.url2pathname(unicodepath)
+        if u'\N{REPLACEMENT CHARACTER}' in unicodepath:
+            raise NonUtf8PercentEncodedBytes
+        else:
+            return unicodepath
+    else:
+        # If we passed unicode on Python 2, it would be converted using the
+        # latin-1 encoding. Therefore, we pass UTF-8-encoded bytes.
+        unicodepath = urlreq.url2pathname(unicodepath.encode('utf-8'))
+        try:
+            return unicodepath.decode('utf-8')
+        except UnicodeDecodeError:
+            raise NonUtf8PercentEncodedBytes
+
+
 def issvnurl(ui, url):
     try:
         proto, path = url.split(b'://', 1)
@@ -361,7 +387,7 @@ 
             ):
                 path = path[:2] + b':/' + path[6:]
             try:
-                path.decode(fsencoding)
+                unicodepath = path.decode(fsencoding)
             except UnicodeDecodeError:
                 ui.warn(
                     _(
@@ -371,28 +397,17 @@ 
                     % pycompat.sysbytes(fsencoding)
                 )
                 return False
-            # FIXME: The following reasoning and logic is wrong and will be
-            # fixed in a following changeset.
-            # pycompat.fsdecode() / pycompat.fsencode() are used so that bytes
-            # in the URL roundtrip correctly on Unix. urlreq.url2pathname() on
-            # py3 will decode percent-encoded bytes using the utf-8 encoding
-            # and the "replace" error handler. This means that it will not
-            # preserve non-UTF-8 bytes (https://bugs.python.org/issue40983).
-            # url.open() uses the reverse function (urlreq.pathname2url()) and
-            # has a similar problem
-            # (https://bz.mercurial-scm.org/show_bug.cgi?id=6357). It makes
-            # sense to solve both problems together and handle all file URLs
-            # consistently. For now, we warn.
-            unicodepath = urlreq.url2pathname(pycompat.fsdecode(path))
-            if pycompat.ispy3 and u'\N{REPLACEMENT CHARACTER}' in unicodepath:
+            try:
+                unicodepath = url2pathname_like_subversion(unicodepath)
+            except NonUtf8PercentEncodedBytes:
                 ui.warn(
                     _(
-                        b'on Python 3, we currently do not support non-UTF-8 '
-                        b'percent-encoded bytes in file URLs for Subversion '
-                        b'repositories\n'
+                        b'Subversion does not support non-UTF-8 '
+                        b'percent-encoded bytes in file URLs\n'
                     )
                 )
-            path = pycompat.fsencode(unicodepath)
+                return False
+            path = unicodepath.encode(fsencoding)
     except ValueError:
         proto = b'file'
         path = os.path.abspath(url)
diff --git a/tests/test-convert-svn-encoding.t b/tests/test-convert-svn-encoding.t
--- a/tests/test-convert-svn-encoding.t
+++ b/tests/test-convert-svn-encoding.t
@@ -197,13 +197,13 @@ 
   abort: file:/*/$TESTTMP/\xff: missing or unsupported repository (glob) (esc)
   [255]
 
-#if py3
-For now, on Python 3, we abort when encountering non-UTF-8 percent-encoded
-bytes in a filename.
+Subversion decodes percent-encoded bytes on the converted, UTF-8-encoded
+string. Therefore, if the percent-encoded bytes aren't valid UTF-8, Subversion
+would choke on them when converting them to the locale encoding.
 
   $ hg convert file://$TESTTMP/%FF test
   initializing destination test repository
-  on Python 3, we currently do not support non-UTF-8 percent-encoded bytes in file URLs for Subversion repositories
+  Subversion does not support non-UTF-8 percent-encoded bytes in file URLs
   file:/*/$TESTTMP/%FF does not look like a CVS checkout (glob)
   $TESTTMP/file:$TESTTMP/%FF does not look like a Git repository
   file:/*/$TESTTMP/%FF does not look like a Subversion repository (glob)
@@ -215,4 +215,3 @@ 
   file:/*/$TESTTMP/%FF does not look like a P4 repository (glob)
   abort: file:/*/$TESTTMP/%FF: missing or unsupported repository (glob)
   [255]
-#endif