Patchwork D7864: rust-utils: add Rust implementation of Python's "os.path.splitdrive"

login
register
mail settings
Submitter phabricator
Date Jan. 14, 2020, 5:33 p.m.
Message ID <differential-rev-PHID-DREV-wzsurufdwdx5zgeqjs2p-req@mercurial-scm.org>
Download mbox | patch
Permalink /patch/44330/
State Superseded
Headers show

Comments

phabricator - Jan. 14, 2020, 5:33 p.m.
Alphare created this revision.
Herald added subscribers: mercurial-devel, kevincox, durin42.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  I also wrote the NT version although I didn't mean to at first, so I thought
  I would keep it, so that any further effort to get the Rust code working on
  Windows is a little easier.

REPOSITORY
  rHG Mercurial

BRANCH
  default

REVISION DETAIL
  https://phab.mercurial-scm.org/D7864

AFFECTED FILES
  rust/hg-core/src/utils/files.rs

CHANGE DETAILS




To: Alphare, #hg-reviewers
Cc: durin42, kevincox, mercurial-devel
phabricator - Jan. 15, 2020, 2:58 p.m.
This revision now requires changes to proceed.
kevincox added inline comments.
kevincox requested changes to this revision.

INLINE COMMENTS

> files.rs:109
> +/// Paths cannot contain both a drive letter and a UNC path.
> +pub fn split_drive(path: impl AsRef<HgPath>) -> (HgPathBuf, HgPathBuf) {
> +    let path = path.as_ref();

I think this can be simplified. See https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=78136ead96596afe6305e7542a881ca4. I had to use &[u8] to avoid having to copy all of the HgPath stuff into the playground, but the code should be easy to integrate.

Notable changes:

- Avoid creating `norm_bytes` vector (and allocation) by creating an is_sep function.
  - Note, we can probably use std::path::is_separator <https://doc.rust-lang.org/std/path/fn.is_separator.html>
- Return references to the input since we guarantee that the output will be parts of the input. The caller can always clone.
- Use slice::split_at to make it obvious that we are returning the entire path just split.
- Use pattern matching rather than unwrapping.
- Use fall-through returns to make it obvious we handle every path.

> files.rs:230
> +        );
> +    }
> +

We should add an example of a UNC and Drive path here to ensure that they don't get split.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7864/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7864

To: Alphare, #hg-reviewers, kevincox
Cc: durin42, kevincox, mercurial-devel
phabricator - Jan. 15, 2020, 7:42 p.m.
Alphare added inline comments.

INLINE COMMENTS

> kevincox wrote in files.rs:109
> I think this can be simplified. See https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=78136ead96596afe6305e7542a881ca4. I had to use &[u8] to avoid having to copy all of the HgPath stuff into the playground, but the code should be easy to integrate.
> 
> Notable changes:
> 
> - Avoid creating `norm_bytes` vector (and allocation) by creating an is_sep function.
>   - Note, we can probably use std::path::is_separator <https://doc.rust-lang.org/std/path/fn.is_separator.html>
> - Return references to the input since we guarantee that the output will be parts of the input. The caller can always clone.
> - Use slice::split_at to make it obvious that we are returning the entire path just split.
> - Use pattern matching rather than unwrapping.
> - Use fall-through returns to make it obvious we handle every path.

Indeed, this is much better. While trying to adapt this code to fit with `HgPath`, I find myself needing to translate to and from bytes whenever indexing or when using `split_at`. Should we give a `HgPath` a `split_at` method or also all the `Index<>` ones? I remember that we decided against that earlier.

> kevincox wrote in files.rs:230
> We should add an example of a UNC and Drive path here to ensure that they don't get split.

Good idea

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7864/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7864

To: Alphare, #hg-reviewers, kevincox
Cc: durin42, kevincox, mercurial-devel
phabricator - Jan. 15, 2020, 8:07 p.m.
kevincox added inline comments.

INLINE COMMENTS

> Alphare wrote in files.rs:109
> Indeed, this is much better. While trying to adapt this code to fit with `HgPath`, I find myself needing to translate to and from bytes whenever indexing or when using `split_at`. Should we give a `HgPath` a `split_at` method or also all the `Index<>` ones? I remember that we decided against that earlier.

I would recommend just converting to bytes at the top of the function then converting the return value to a path at the exit. I feel when you are doing manipulation like this it makes the most sense to treat it as plain bytes within the function.

Alternatively I wouldn't mind putting an index operator but have a slight preference for `path.as_bytes()[n]` to keep it explicit as most of the code shouldn't be reaching into paths.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7864/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7864

To: Alphare, #hg-reviewers, kevincox
Cc: durin42, kevincox, mercurial-devel
phabricator - Jan. 15, 2020, 8:12 p.m.
Alphare added inline comments.

INLINE COMMENTS

> kevincox wrote in files.rs:109
> I would recommend just converting to bytes at the top of the function then converting the return value to a path at the exit. I feel when you are doing manipulation like this it makes the most sense to treat it as plain bytes within the function.
> 
> Alternatively I wouldn't mind putting an index operator but have a slight preference for `path.as_bytes()[n]` to keep it explicit as most of the code shouldn't be reaching into paths.

I agree. I'll send a follow-up for all your remarks in an hour.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7864/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7864

To: Alphare, #hg-reviewers, kevincox
Cc: durin42, kevincox, mercurial-devel
phabricator - Jan. 16, 2020, 9:59 a.m.
kevincox added a comment.
kevincox accepted this revision.


  One last note is why not just put this in HgPath?

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7864/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7864

To: Alphare, #hg-reviewers, kevincox
Cc: durin42, kevincox, mercurial-devel
phabricator - Jan. 16, 2020, 10:02 a.m.
Alphare added a comment.


  In D7864#116100 <https://phab.mercurial-scm.org/D7864#116100>, @kevincox wrote:
  
  > One last note is why not just put this in HgPath?
  
  I was about to start making this a method on `HgPath`, I guess we agree.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7864/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7864

To: Alphare, #hg-reviewers, kevincox
Cc: durin42, kevincox, mercurial-devel
Yuya Nishihara - Jan. 21, 2020, 2:37 p.m.
So `HgPath` type is no longer intended for "a repository-relative path"?
phabricator - Jan. 21, 2020, 2:41 p.m.
yuja added a comment.


  So `HgPath` type is no longer intended for "a repository-relative path"?

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7864/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7864

To: Alphare, #hg-reviewers, kevincox, pulkit
Cc: yuja, durin42, kevincox, mercurial-devel

Patch

diff --git a/rust/hg-core/src/utils/files.rs b/rust/hg-core/src/utils/files.rs
--- a/rust/hg-core/src/utils/files.rs
+++ b/rust/hg-core/src/utils/files.rs
@@ -85,6 +85,83 @@ 
     path.to_ascii_lowercase()
 }
 
+#[cfg(windows)]
+/// Copied from the Python stdlib's `os.path.splitdrive` implementation.
+///
+/// Split a pathname into drive/UNC sharepoint and relative path specifiers.
+/// Returns a 2-tuple (drive_or_unc, path); either part may be empty.
+///
+/// If you assign
+///  result = split_drive(p)
+/// It is always true that:
+///  result[0] + result[1] == p
+///
+/// If the path contained a drive letter, drive_or_unc will contain everything
+/// up to and including the colon.
+/// e.g. split_drive("c:/dir") returns ("c:", "/dir")
+///
+/// If the path contained a UNC path, the drive_or_unc will contain the host
+/// name and share up to but not including the fourth directory separator
+/// character.
+/// e.g. split_drive("//host/computer/dir") returns ("//host/computer", "/dir")
+///
+/// Paths cannot contain both a drive letter and a UNC path.
+pub fn split_drive(path: impl AsRef<HgPath>) -> (HgPathBuf, HgPathBuf) {
+    let path = path.as_ref();
+    let sep = std::path::MAIN_SEPARATOR as u8;
+    let bytes = path.as_bytes();
+    let norm_bytes: Vec<_> = path
+        .as_bytes()
+        .iter()
+        .map(|c| if *c == b'\\' { sep } else { *c })
+        .collect();
+    if norm_bytes.len() > 1 {
+        if norm_bytes[0] == sep
+            && norm_bytes[1] == sep
+            && (norm_bytes.len() < 3 || norm_bytes[2] != sep)
+        {
+            // Is a UNC path:
+            // vvvvvvvvvvvvvvvvvvvv drive letter or UNC path
+            // \\machine\mountpoint\directory\etc\...
+            //           directory ^^^^^^^^^^^^^^^
+            let index = norm_bytes[2..].iter().position(|b| *b == sep);
+            if index.is_none() {
+                return (HgPathBuf::new(), path.to_owned());
+            }
+            let index = index.unwrap() + 2;
+
+            let index2 =
+                norm_bytes[index + 1..].iter().position(|b| *b == sep);
+            // A UNC path can't have two slashes in a row
+            // (after the initial two)
+            if index2 == Some(0) {
+                return (HgPathBuf::new(), path.to_owned());
+            }
+            let index2 = match index2 {
+                Some(i) => i + index + 1,
+                None => norm_bytes.len(),
+            };
+            return (
+                HgPathBuf::from_bytes(&bytes[..index2]),
+                HgPathBuf::from_bytes(&bytes[index2..]),
+            );
+        }
+        if norm_bytes[1] == b':' {
+            return (
+                HgPathBuf::from_bytes(&bytes[..2]),
+                HgPathBuf::from_bytes(&bytes[2..]),
+            );
+        }
+    }
+    (HgPathBuf::new(), path.to_owned())
+}
+
+#[cfg(unix)]
+/// Split a pathname into drive and path. On Posix, drive is always empty.
+pub fn split_drive(path: impl AsRef<HgPath>) -> (HgPathBuf, HgPathBuf) {
+    (HgPathBuf::new(), path.as_ref().into())
+}
+
 #[derive(Eq, PartialEq, Ord, PartialOrd, Copy, Clone)]
 pub struct HgMetadata {
     pub st_dev: u64,
@@ -133,4 +210,101 @@ 
         assert_eq!(dirs.next(), None);
         assert_eq!(dirs.next(), None);
     }
+
+    #[test]
+    #[cfg(unix)]
+    fn test_split_drive() {
+        // Taken from the Python stdlib's tests
+        assert_eq!(
+            split_drive(HgPath::new(br"/foo/bar")),
+            (HgPathBuf::new(), HgPathBuf::from_bytes(br"/foo/bar"))
+        );
+        assert_eq!(
+            split_drive(HgPath::new(br"foo:bar")),
+            (HgPathBuf::new(), HgPathBuf::from_bytes(br"foo:bar"))
+        );
+        assert_eq!(
+            split_drive(HgPath::new(br":foo:bar")),
+            (HgPathBuf::new(), HgPathBuf::from_bytes(br":foo:bar"))
+        );
+    }
+
+    #[test]
+    #[cfg(windows)]
+    fn test_split_drive() {
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(br"c:\foo\bar")),
+            (
+                HgPathBuf::from_bytes(br"c:"),
+                HgPathBuf::from_bytes(br"\foo\bar")
+            )
+        );
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(b"c:/foo/bar")),
+            (
+                HgPathBuf::from_bytes(br"c:"),
+                HgPathBuf::from_bytes(br"/foo/bar")
+            )
+        );
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(br"\\conky\mountpoint\foo\bar")),
+            (
+                HgPathBuf::from_bytes(br"\\conky\mountpoint"),
+                HgPathBuf::from_bytes(br"\foo\bar")
+            )
+        );
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(br"//conky/mountpoint/foo/bar")),
+            (
+                HgPathBuf::from_bytes(br"//conky/mountpoint"),
+                HgPathBuf::from_bytes(br"/foo/bar")
+            )
+        );
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(
+                br"\\\conky\mountpoint\foo\bar"
+            )),
+            (
+                HgPathBuf::from_bytes(br""),
+                HgPathBuf::from_bytes(br"\\\conky\mountpoint\foo\bar")
+            )
+        );
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(
+                br"///conky/mountpoint/foo/bar"
+            )),
+            (
+                HgPathBuf::from_bytes(br""),
+                HgPathBuf::from_bytes(br"///conky/mountpoint/foo/bar")
+            )
+        );
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(
+                br"\\conky\\mountpoint\foo\bar"
+            )),
+            (
+                HgPathBuf::from_bytes(br""),
+                HgPathBuf::from_bytes(br"\\conky\\mountpoint\foo\bar")
+            )
+        );
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(
+                br"//conky//mountpoint/foo/bar"
+            )),
+            (
+                HgPathBuf::from_bytes(br""),
+                HgPathBuf::from_bytes(br"//conky//mountpoint/foo/bar")
+            )
+        );
+        // UNC part containing U+0130
+        assert_eq!(
+            split_drive(HgPathBuf::from_bytes(
+                b"//conky/MOUNTPO\xc4\xb0NT/foo/bar"
+            )),
+            (
+                HgPathBuf::from_bytes(b"//conky/MOUNTPO\xc4\xb0NT"),
+                HgPathBuf::from_bytes(br"/foo/bar")
+            )
+        );
+    }
 }