Patchwork D7788: rust-node: binary Node and conversion utilities

login
register
mail settings
Submitter phabricator
Date Jan. 6, 2020, 7:26 p.m.
Message ID <differential-rev-PHID-DREV-zijyu5vwcwp4zupgxeve-req@mercurial-scm.org>
Download mbox | patch
Permalink /patch/44154/
State Superseded
Headers show

Comments

phabricator - Jan. 6, 2020, 7:26 p.m.
gracinet created this revision.
Herald added subscribers: mercurial-devel, kevincox, durin42.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  Our choice of type makes sure that a `Node` has the exact
  wanted size. We'll use a different type for prefixes.

REPOSITORY
  rHG Mercurial

BRANCH
  default

REVISION DETAIL
  https://phab.mercurial-scm.org/D7788

AFFECTED FILES
  rust/hg-core/src/dirstate/dirs_multiset.rs
  rust/hg-core/src/matchers.rs
  rust/hg-core/src/revlog.rs
  rust/hg-core/src/revlog/node.rs
  rust/hg-core/src/utils.rs
  rust/hg-core/src/utils/hg_path.rs

CHANGE DETAILS




To: gracinet, #hg-reviewers
Cc: durin42, kevincox, mercurial-devel
phabricator - Jan. 16, 2020, 10:50 a.m.
kevincox added inline comments.
kevincox accepted this revision.

INLINE COMMENTS

> node.rs:14
> +/// Binary revisions SHA
> +pub type Node = [u8; 20];
> +

I would recommend making this a proper type because almost no one should be poking at the bytes of the hash.

  struct Node([u8; 20]);

You can make the field `pub` if necessary.

> node.rs:31
> +    for i in 0..20 {
> +        node[i] = u8::from_str_radix(&hex[i * 2..i * 2 + 2], 16)?
> +    }

I find progressing through the string easier to understand than this slicing. WDYT.

  for byte in node.iter_mut() {
      *byte = u8::from_str_radix(&hex[..2], 16)?;
      hex = &hex[2..];
  }

> node.rs:39
> +    as_vec.join("")
> +}
> +

If we want to avoid a number of allocations we can do:

  pub fn node_to_hex(n: &Node) -> String {
      let mut r = String::with_capacity(n.len() * 2);
      for byte in n {
          write!(r, "{:02x}", byte).unwrap();
      }
      debug_assert_eq!(r.len(), n.len() * 2);
      r
  }

The generated code for `write!()` doesn't look great but if hex printing shows up in benchmarks it would be trivial to write a custom hex-formatter.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7788/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7788

To: gracinet, #hg-reviewers, kevincox
Cc: durin42, kevincox, mercurial-devel
phabricator - Jan. 18, 2020, 1:24 a.m.
This revision now requires changes to proceed.
martinvonz added inline comments.
martinvonz requested changes to this revision.

INLINE COMMENTS

> dirs_multiset.rs:111
>                          path.as_ref().to_owned(),
> -                    ))
> +                    ));
>                  }

please revert unrelated change

> kevincox wrote in node.rs:31
> I find progressing through the string easier to understand than this slicing. WDYT.
> 
>   for byte in node.iter_mut() {
>       *byte = u8::from_str_radix(&hex[..2], 16)?;
>       hex = &hex[2..];
>   }

Or even use `hex::decode()` from the `hex` crate? Can also use `hex::encode()` in `node_to_hex` if we're okay with that.

> node.rs:41-51
> +/// Retrieve the `i`th half-byte from a bytes slice
> +///
> +/// This is also the `i`th hexadecimal digit in numeric form,
> +/// also called a [nybble](https://en.wikipedia.org/wiki/Nibble).
> +pub fn get_nybble(i: usize, s: &[u8]) -> u8 {
> +    if i % 2 == 0 {
> +        s[i / 2] >> 4

This feels like something that will only be used in nodemap.rs, so put it there instead? Or do you think (or know) that it will be used elsewhere soon?

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7788/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7788

To: gracinet, #hg-reviewers, kevincox, martinvonz
Cc: martinvonz, durin42, kevincox, mercurial-devel
phabricator - Jan. 18, 2020, 10:57 p.m.
martinvonz added inline comments.

INLINE COMMENTS

> node.rs:14
> +/// Binary revisions SHA
> +pub type Node = [u8; 20];
> +

Oh, I think @durin42 and others were trying to remove the assumption that nodeids are 20 bytes (so we can use other hashes and hash lengths).

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7788/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7788

To: gracinet, #hg-reviewers, kevincox, martinvonz
Cc: martinvonz, durin42, kevincox, mercurial-devel
phabricator - Jan. 18, 2020, 11:18 p.m.
durin42 added a comment.


  Yes, I'd appreciate not adding any assumptions that nodes are 20 bytes.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7788/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7788

To: gracinet, #hg-reviewers, kevincox, martinvonz
Cc: martinvonz, durin42, kevincox, mercurial-devel
phabricator - Jan. 20, 2020, 1:17 p.m.
gracinet added a comment.


  @durin42 @martinvonz we might have a nice path forward with this, with the suggestion of @kevincox to make `Node` a truly defined type, not just an alias, with a private member for the bytes.
  
  Indeed I find it important to know if a node information is complete or merely a prefix. With a truly defined type, the assumption that it's 20 bytes long would be completely encapsulated in the `Node` type. It would then be very easy (much easier than with Python code) to change it to something else, whether that'd be a longer bag of bytes, or a struct with two values, an enum with alternatives for sha-1 or sha-2 etc.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7788/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7788

To: gracinet, #hg-reviewers, kevincox, martinvonz
Cc: martinvonz, durin42, kevincox, mercurial-devel
phabricator - Jan. 21, 2020, 6:54 p.m.
martinvonz added a comment.


  In D7788#116756 <https://phab.mercurial-scm.org/D7788#116756>, @gracinet wrote:
  
  > @durin42 @martinvonz we might have a nice path forward with this, with the suggestion of @kevincox to make `Node` a truly defined type, not just an alias, with a private member for the bytes.
  
  Sounds good. I think I would have preferred that anyway.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7788/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7788

To: gracinet, #hg-reviewers, kevincox, martinvonz
Cc: martinvonz, durin42, kevincox, mercurial-devel
phabricator - Jan. 22, 2020, 8:27 p.m.
gracinet added a comment.
gracinet marked 6 inline comments as done.


  As the new commit description explains, I've done all I could to make the change of hash format easier
  
  I should add that my other colleagues at Octobus seem also to be involved in the improvement of hashing, there's no risk we would forget this one.

INLINE COMMENTS

> martinvonz wrote in dirs_multiset.rs:111
> please revert unrelated change

ah yes, sorry must have been a rebase artifact

> martinvonz wrote in node.rs:31
> Or even use `hex::decode()` from the `hex` crate? Can also use `hex::encode()` in `node_to_hex` if we're okay with that.

Yes, `hex` is nice, it's able to produce arrays without further copies and does length checks. Its error type does not repeat the input string, though, which in my experience is really very useful to understand problems when they occur.

`hex` does not support hexadecimal string representations with odd numbers of
digits, which will be easy to work around in the next patch.

> kevincox wrote in node.rs:39
> If we want to avoid a number of allocations we can do:
> 
>   pub fn node_to_hex(n: &Node) -> String {
>       let mut r = String::with_capacity(n.len() * 2);
>       for byte in n {
>           write!(r, "{:02x}", byte).unwrap();
>       }
>       debug_assert_eq!(r.len(), n.len() * 2);
>       r
>   }
> 
> The generated code for `write!()` doesn't look great but if hex printing shows up in benchmarks it would be trivial to write a custom hex-formatter.

Finally used the `hex` crate here too.

> martinvonz wrote in node.rs:41-51
> This feels like something that will only be used in nodemap.rs, so put it there instead? Or do you think (or know) that it will be used elsewhere soon?

Well it's become a public method in the new `Node` struct.
`NodePrefix` will be another user.

I tend to prefer encapsulation over where it's used for these choices. This way, there's no tempation for nodemap.rs to play with the binary data directly.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D7788/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D7788

To: gracinet, #hg-reviewers, kevincox, martinvonz
Cc: martinvonz, durin42, kevincox, mercurial-devel

Patch

diff --git a/rust/hg-core/src/utils/hg_path.rs b/rust/hg-core/src/utils/hg_path.rs
--- a/rust/hg-core/src/utils/hg_path.rs
+++ b/rust/hg-core/src/utils/hg_path.rs
@@ -157,7 +157,7 @@ 
                     return Err(HgPathError::ContainsNullByte(
                         bytes.to_vec(),
                         index,
-                    ))
+                    ));
                 }
                 b'/' => {
                     if previous_byte.is_some() && previous_byte == Some(b'/') {
diff --git a/rust/hg-core/src/utils.rs b/rust/hg-core/src/utils.rs
--- a/rust/hg-core/src/utils.rs
+++ b/rust/hg-core/src/utils.rs
@@ -18,10 +18,7 @@ 
 /// use crate::hg::utils::replace_slice;
 /// let mut line = b"I hate writing tests!".to_vec();
 /// replace_slice(&mut line, b"hate", b"love");
-/// assert_eq!(
-///     line,
-///     b"I love writing tests!".to_vec()
-/// );
+/// assert_eq!(line, b"I love writing tests!".to_vec());
 /// ```
 pub fn replace_slice<T>(buf: &mut [T], from: &[T], to: &[T])
 where
@@ -66,18 +63,9 @@ 
 
     /// ```
     /// use hg::utils::SliceExt;
-    /// assert_eq!(
-    ///     b"  to trim  ".trim(),
-    ///     b"to trim"
-    /// );
-    /// assert_eq!(
-    ///     b"to trim  ".trim(),
-    ///     b"to trim"
-    /// );
-    /// assert_eq!(
-    ///     b"  to trim".trim(),
-    ///     b"to trim"
-    /// );
+    /// assert_eq!(b"  to trim  ".trim(), b"to trim");
+    /// assert_eq!(b"to trim  ".trim(), b"to trim");
+    /// assert_eq!(b"  to trim".trim(), b"to trim");
     /// ```
     fn trim(&self) -> &[u8] {
         self.trim_start().trim_end()
diff --git a/rust/hg-core/src/revlog/node.rs b/rust/hg-core/src/revlog/node.rs
new file mode 100644
--- /dev/null
+++ b/rust/hg-core/src/revlog/node.rs
@@ -0,0 +1,91 @@ 
+// Copyright 2019-2020 Georges Racinet <georges.racinet@octobus.net>
+//
+// This software may be used and distributed according to the terms of the
+// GNU General Public License version 2 or any later version.
+
+//! Definitions and utilities for Revision nodes
+//!
+//! In Mercurial code base, it is customary to call "a node" the binary SHA
+//! of a revision.
+
+use std::num::ParseIntError;
+
+/// Binary revisions SHA
+pub type Node = [u8; 20];
+
+/// The node value for NULL_REVISION
+pub const NULL_NODE: Node = [0; 20];
+
+#[derive(Debug, PartialEq)]
+pub enum NodeError {
+    ExactLengthRequired(String),
+    NotHexadecimal,
+}
+
+pub fn node_from_hex(hex: &str) -> Result<Node, NodeError> {
+    if hex.len() != 40 {
+        return Err(NodeError::ExactLengthRequired(hex.to_string()));
+    }
+    let mut node = [0; 20];
+    for i in 0..20 {
+        node[i] = u8::from_str_radix(&hex[i * 2..i * 2 + 2], 16)?
+    }
+    Ok(node)
+}
+
+pub fn node_to_hex(n: &Node) -> String {
+    let as_vec: Vec<String> = n.iter().map(|b| format!("{:02x}", b)).collect();
+    as_vec.join("")
+}
+
+/// Retrieve the `i`th half-byte from a bytes slice
+///
+/// This is also the `i`th hexadecimal digit in numeric form,
+/// also called a [nybble](https://en.wikipedia.org/wiki/Nibble).
+pub fn get_nybble(i: usize, s: &[u8]) -> u8 {
+    if i % 2 == 0 {
+        s[i / 2] >> 4
+    } else {
+        s[i / 2] & 0x0f
+    }
+}
+
+impl From<ParseIntError> for NodeError {
+    fn from(_: ParseIntError) -> Self {
+        NodeError::NotHexadecimal
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    const SAMPLE_NODE_HEX: &str = "0123456789abcdeffedcba9876543210deadbeef";
+
+    #[test]
+    fn test_node_from_hex() {
+        assert_eq!(
+            node_from_hex(SAMPLE_NODE_HEX),
+            Ok([
+                0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef, 0xfe, 0xdc,
+                0xba, 0x98, 0x76, 0x54, 0x32, 0x10, 0xde, 0xad, 0xbe, 0xef
+            ])
+        );
+        let short = "0123456789abcdeffedcba9876543210";
+        assert_eq!(
+            node_from_hex(short),
+            Err(NodeError::ExactLengthRequired(short.to_string())),
+        );
+    }
+
+    #[test]
+    fn test_node_to_hex() {
+        assert_eq!(
+            node_to_hex(&[
+                0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef, 0xfe, 0xdc,
+                0xba, 0x98, 0x76, 0x54, 0x32, 0x10, 0xde, 0xad, 0xbe, 0xef
+            ]),
+            SAMPLE_NODE_HEX
+        );
+    }
+}
diff --git a/rust/hg-core/src/revlog.rs b/rust/hg-core/src/revlog.rs
--- a/rust/hg-core/src/revlog.rs
+++ b/rust/hg-core/src/revlog.rs
@@ -5,7 +5,9 @@ 
 // GNU General Public License version 2 or any later version.
 //! Mercurial concepts for handling revision history
 
+pub mod node;
 pub mod nodemap;
+pub use node::{Node, NodeError};
 
 /// Mercurial revision numbers
 ///
diff --git a/rust/hg-core/src/matchers.rs b/rust/hg-core/src/matchers.rs
--- a/rust/hg-core/src/matchers.rs
+++ b/rust/hg-core/src/matchers.rs
@@ -81,7 +81,10 @@ 
 
 /// Matches everything.
 ///```
-/// use hg::{ matchers::{Matcher, AlwaysMatcher}, utils::hg_path::HgPath };
+/// use hg::{
+///     matchers::{AlwaysMatcher, Matcher},
+///     utils::hg_path::HgPath,
+/// };
 ///
 /// let matcher = AlwaysMatcher;
 ///
@@ -121,7 +124,10 @@ 
 /// patterns.
 ///
 ///```
-/// use hg::{ matchers::{Matcher, FileMatcher}, utils::hg_path::HgPath };
+/// use hg::{
+///     matchers::{FileMatcher, Matcher},
+///     utils::hg_path::HgPath,
+/// };
 ///
 /// let files = [HgPath::new(b"a.txt"), HgPath::new(br"re:.*\.c$")];
 /// let matcher = FileMatcher::new(&files).unwrap();
diff --git a/rust/hg-core/src/dirstate/dirs_multiset.rs b/rust/hg-core/src/dirstate/dirs_multiset.rs
--- a/rust/hg-core/src/dirstate/dirs_multiset.rs
+++ b/rust/hg-core/src/dirstate/dirs_multiset.rs
@@ -108,7 +108,7 @@ 
                 Entry::Vacant(_) => {
                     return Err(DirstateMapError::PathNotFound(
                         path.as_ref().to_owned(),
-                    ))
+                    ));
                 }
             };
         }