Patchwork [stable] fsmonitor: match watchman and local encoding

login
register
mail settings
Submitter Olivier Trempe
Date March 6, 2017, 5:50 p.m.
Message ID <c9d3f8d1a57346228f5c.1488822636@ronce.innov.local>
Download mbox | patch
Permalink /patch/18944/
State Changes Requested
Headers show

Comments

Olivier Trempe - March 6, 2017, 5:50 p.m.
# HG changeset patch
# User Olivier Trempe <oliviertrempe@gmail.com>
# Date 1488810111 18000
#      Mon Mar 06 09:21:51 2017 -0500
# Branch stable
# Node ID c9d3f8d1a57346228f5c3bb749acdff90d37e194
# Parent  6b00c3ecd15b26587de8cca6fab811069cba3b2f
fsmonitor: match watchman and local encoding

watchman's paths encoding is os dependant. For example, on Windows, it's
always utf-8. This causes paths comparison mismatch when paths contain non ascii
characters.
Siddharth Agarwal - March 7, 2017, 1:14 a.m.
On 3/6/17 09:50, Olivier Trempe wrote:
> # HG changeset patch
> # User Olivier Trempe <oliviertrempe@gmail.com>
> # Date 1488810111 18000
> #      Mon Mar 06 09:21:51 2017 -0500
> # Branch stable
> # Node ID c9d3f8d1a57346228f5c3bb749acdff90d37e194
> # Parent  6b00c3ecd15b26587de8cca6fab811069cba3b2f
> fsmonitor: match watchman and local encoding
>
> watchman's paths encoding is os dependant. For example, on Windows, it's
> always utf-8. This causes paths comparison mismatch when paths contain non ascii
> characters.

I really doubt this is correct unixes, where Watchman returns bytes as 
they are on disk, which matches exactly with what Mercurial wants.

(On Windows Watchman indeed always returns UTF-8.)

>
> diff -r 6b00c3ecd15b -r c9d3f8d1a573 hgext/fsmonitor/__init__.py
> --- a/hgext/fsmonitor/__init__.py	Thu Mar 02 20:19:45 2017 -0500
> +++ b/hgext/fsmonitor/__init__.py	Mon Mar 06 09:21:51 2017 -0500
> @@ -99,6 +99,7 @@
>   from mercurial import (
>       context,
>       encoding,
> +    error,
>       extensions,
>       localrepo,
>       merge,
> @@ -110,6 +111,7 @@
>   from mercurial import match as matchmod
>   
>   from . import (
> +    pywatchman,
>       state,
>       watchmanclient,
>   )
> @@ -159,6 +161,20 @@
>       sha1.update('\0')
>       return sha1.hexdigest()
>   
> +def _watchmanencodingtolocal(path):
> +    """Fix path to match watchman and local encoding
> +
> +    watchman's paths encoding is os dependant. For example, on Windows, it's

"dependent"

> +    always utf-8. This converts watchman encoded paths to local encoding to
> +    avoid paths comparison mismatch.
> +    """
> +    try:
> +        decoded = pywatchman.encoding.decode_local(path)
> +    except UnicodeDecodeError as e:
> +        raise error.Abort(e, hint='watchman encoding error')

Could you elaborate a bit on when the exception can happen?

> +
> +    return decoded.encode(encoding.encoding, 'replace')
> +
>   def overridewalk(orig, self, match, subrepos, unknown, ignored, full=True):
>       '''Replacement for dirstate.walk, hooking into Watchman.
>   
> @@ -302,7 +318,7 @@
>       # Watchman tracks files.  We use this property to reconcile deletes
>       # for name case changes.
>       for entry in result['files']:
> -        fname = entry['name']
> +        fname = _watchmanencodingtolocal(entry['name'])

Adding a non-trivial function call here is likely going to regress 
performance. Have you measured the impact?

- Siddharth

>           if switch_slashes:
>               fname = fname.replace('\\', '/')
>           if normalize:
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Olivier Trempe - March 7, 2017, 1:41 p.m.
On Mon, Mar 6, 2017 at 8:14 PM, Siddharth Agarwal <sid@less-broken.com>
wrote:

> On 3/6/17 09:50, Olivier Trempe wrote:
>
>> # HG changeset patch
>> # User Olivier Trempe <oliviertrempe@gmail.com>
>> # Date 1488810111 18000
>> #      Mon Mar 06 09:21:51 2017 -0500
>> # Branch stable
>> # Node ID c9d3f8d1a57346228f5c3bb749acdff90d37e194
>> # Parent  6b00c3ecd15b26587de8cca6fab811069cba3b2f
>> fsmonitor: match watchman and local encoding
>>
>> watchman's paths encoding is os dependant. For example, on Windows, it's
>> always utf-8. This causes paths comparison mismatch when paths contain
>> non ascii
>> characters.
>>
>
> I really doubt this is correct unixes, where Watchman returns bytes as
> they are on disk, which matches exactly with what Mercurial wants.
>
> (On Windows Watchman indeed always returns UTF-8.)
>
> This is what I meant by "os dependent". You get a different behavior on a
different os. I can rewrite it to be more Windows specific.

>
>> diff -r 6b00c3ecd15b -r c9d3f8d1a573 hgext/fsmonitor/__init__.py
>> --- a/hgext/fsmonitor/__init__.py       Thu Mar 02 20:19:45 2017 -0500
>> +++ b/hgext/fsmonitor/__init__.py       Mon Mar 06 09:21:51 2017 -0500
>> @@ -99,6 +99,7 @@
>>   from mercurial import (
>>       context,
>>       encoding,
>> +    error,
>>       extensions,
>>       localrepo,
>>       merge,
>> @@ -110,6 +111,7 @@
>>   from mercurial import match as matchmod
>>     from . import (
>> +    pywatchman,
>>       state,
>>       watchmanclient,
>>   )
>> @@ -159,6 +161,20 @@
>>       sha1.update('\0')
>>       return sha1.hexdigest()
>>   +def _watchmanencodingtolocal(path):
>> +    """Fix path to match watchman and local encoding
>> +
>> +    watchman's paths encoding is os dependant. For example, on Windows,
>> it's
>>
>
> "dependent"
>
> +    always utf-8. This converts watchman encoded paths to local encoding
>> to
>> +    avoid paths comparison mismatch.
>> +    """
>> +    try:
>> +        decoded = pywatchman.encoding.decode_local(path)
>> +    except UnicodeDecodeError as e:
>> +        raise error.Abort(e, hint='watchman encoding error')
>>
>
> Could you elaborate a bit on when the exception can happen?
>

This is a defensive implementation. I did not encounter this exception
myself. However, even though watchman and pywatchman are maintained
together, there is no guarantee that the user will run the watchman build
matching the pywatchman version distributed with mercurial. Given that
getting the watchman encoding relies on pywatchman's encoding module, you
could get an error if something changes on either side. I just didn't want
to let a possible unhandled exception propagate up to the user.

>
> +
>> +    return decoded.encode(encoding.encoding, 'replace')
>> +
>>   def overridewalk(orig, self, match, subrepos, unknown, ignored,
>> full=True):
>>       '''Replacement for dirstate.walk, hooking into Watchman.
>>   @@ -302,7 +318,7 @@
>>       # Watchman tracks files.  We use this property to reconcile deletes
>>       # for name case changes.
>>       for entry in result['files']:
>> -        fname = entry['name']
>> +        fname = _watchmanencodingtolocal(entry['name'])
>>
>
> Adding a non-trivial function call here is likely going to regress
> performance. Have you measured the impact?


100000 calls to the function result in a ~0.3 seconds impact.
I can optimize it by triggering the conversion only if the encoding
returned by pywatchman is different than the local encoding. If they are
the same, the impact is close to null. If not, the impact is ~0.24 seconds.

>
>
- Siddharth
>
>           if switch_slashes:
>>               fname = fname.replace('\\', '/')
>>           if normalize:
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>
>
>
This is a very nasty bug on windows. In common cases, you don't get
modified files when calling status. No so bad. However, I lost
modifications while doing history editing operations. Files containing
non-ascii characters in their path were silently discarded from a rebased
changeset.

Thanks for your comments. I will submit a new patch when I get feedback
from you.

Patch

diff -r 6b00c3ecd15b -r c9d3f8d1a573 hgext/fsmonitor/__init__.py
--- a/hgext/fsmonitor/__init__.py	Thu Mar 02 20:19:45 2017 -0500
+++ b/hgext/fsmonitor/__init__.py	Mon Mar 06 09:21:51 2017 -0500
@@ -99,6 +99,7 @@ 
 from mercurial import (
     context,
     encoding,
+    error,
     extensions,
     localrepo,
     merge,
@@ -110,6 +111,7 @@ 
 from mercurial import match as matchmod
 
 from . import (
+    pywatchman,
     state,
     watchmanclient,
 )
@@ -159,6 +161,20 @@ 
     sha1.update('\0')
     return sha1.hexdigest()
 
+def _watchmanencodingtolocal(path):
+    """Fix path to match watchman and local encoding
+
+    watchman's paths encoding is os dependant. For example, on Windows, it's
+    always utf-8. This converts watchman encoded paths to local encoding to
+    avoid paths comparison mismatch.
+    """
+    try:
+        decoded = pywatchman.encoding.decode_local(path)
+    except UnicodeDecodeError as e:
+        raise error.Abort(e, hint='watchman encoding error')
+
+    return decoded.encode(encoding.encoding, 'replace')
+
 def overridewalk(orig, self, match, subrepos, unknown, ignored, full=True):
     '''Replacement for dirstate.walk, hooking into Watchman.
 
@@ -302,7 +318,7 @@ 
     # Watchman tracks files.  We use this property to reconcile deletes
     # for name case changes.
     for entry in result['files']:
-        fname = entry['name']
+        fname = _watchmanencodingtolocal(entry['name'])
         if switch_slashes:
             fname = fname.replace('\\', '/')
         if normalize: