Patchwork D1973: bdiff: write a native version of splitnewlines

login
register
mail settings
Submitter phabricator
Date Feb. 1, 2018, 9:58 p.m.
Message ID <differential-rev-PHID-DREV-ftpldziaya7v5pnxmeff-req@phab.mercurial-scm.org>
Download mbox | patch
Permalink /patch/27155/
State Superseded
Headers show

Comments

phabricator - Feb. 1, 2018, 9:58 p.m.
durin42 created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  ./hg perfunidiff mercurial/manifest.py 0 --count 500 --profile before:
  ! wall 0.309280 comb 0.350000 user 0.290000 sys 0.060000 (best of 32)
  
  ./hg perfunidiff mercurial/manifest.py 0 --count 500 --profile after:
  ! wall 0.241572 comb 0.260000 user 0.240000 sys 0.020000 (best of 39)
  
  so it's about 20% faster. I hate Python. I wish we could usefully
  write this in Rust, but it doesn't look like that's realistic without
  using the cpython crate, which I'd still like to avoid.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1973

AFFECTED FILES
  mercurial/cext/bdiff.c
  mercurial/mdiff.py

CHANGE DETAILS




To: durin42, #hg-reviewers
Cc: mercurial-devel
phabricator - Feb. 1, 2018, 11:08 p.m.
indygreg added subscribers: yuja, indygreg.
indygreg requested changes to this revision.
indygreg added a comment.
This revision now requires changes to proceed.


  This looks mostly good. Needs some minor tweaks. Some of my comments are informative and can probably be ignored as far as reviewing goes.

INLINE COMMENTS

> bdiff.c:187
> +		   const char *source, Py_ssize_t len) {
> +	PyObject *sliced = PyString_FromStringAndSize(source, len);
> +	if (sliced == NULL)

Let's use `PyBytes` here for compatibility with Python 3.

Also, `Py_buffer` here would let us avoid a memory allocation. Unfortunately, various C extensions in older versions of Python 2.7 don't recognize the `Py_buffer` interface (I'm looking at you `zlib`). So we need to be careful about where all we use `Py_buffer` tricks :( It would be really nice if *all* of this code were in C so we didn't have to allocate a `PyObject` for every line: we could just pass an array of line offsets around.

> bdiff.c:190
> +		return false;
> +	PyList_SetItem(list, destidx, sliced);
> +	return true;

The `PyList` is pre-allocated. That means you can use `PyList_SET_ITEM()` for even faster execution.

> bdiff.c:201-202
> +
> +	if (!PyArg_ParseTuple(args, "s#", &text, &size))
> +		goto abort;
> +	if (!size) {

Does our style guideline not require braces yet? Seeing this reminds me of `goto fail` :(

> bdiff.c:208-212
> +	for (i = 0; i < size - 1; ++i) {
> +		if (text[i] == '\n') {
> +			++nelts;
> +		}
> +	}

For a micro optimization, I bet if you rewrite this to iterate over chunks of size `size_t` and do bit tests that this will be even faster. The searching for newlines is a hot loop in the `bdiff` code. Unfortunately, I never completed my refactor to optimize the line scanning.

> bdiff.c:216-217
> +	nelts = 0;
> +	for (i = 0; i < size - 1; ++i) {
> +		if (text[i] == '\n') {
> +			if (!sliceintolist(

I have a feeling this extra line scan will matter in a benchmark. Could you `perf record` the new `hg perf*` command and verify? If it is a big deal, I would allocate an `int[16384]` array on the stack or something to deal with the common case and store additional newline offsets on the heap so we only do the newline scan once.

> mdiff.py:43
>  
> +splitnewlines = getattr(bdiff, 'splitnewlines', splitnewlines)
> +

@yuja may have an opinion on this, but mine is that we now have stronger guarantees around C extension versioning, so if the active module policy is to use C extensions, we should ensure we get `splitnewlines` from the C module. I /think/ if we move `splitnewlines` to the `bdiff` module that things will sort themselves out? This could be done as a follow-up easily enough though.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1973

To: durin42, #hg-reviewers, indygreg
Cc: indygreg, yuja, mercurial-devel
phabricator - Feb. 2, 2018, 1:34 p.m.
yuja added inline comments.

INLINE COMMENTS

> bdiff.c:197
> +	const char *text;
> +	int i, start = 0;
> +	Py_ssize_t nelts = 0, size;

Perhaps Py_ssize_t is preferred.

> bdiff.c:202
> +	if (!PyArg_ParseTuple(args, "s#", &text, &size))
> +		goto abort;
> +	if (!size) {

Here `result` isn't initialized to NULL yet.

> indygreg wrote in bdiff.c:216-217
> I have a feeling this extra line scan will matter in a benchmark. Could you `perf record` the new `hg perf*` command and verify? If it is a big deal, I would allocate an `int[16384]` array on the stack or something to deal with the common case and store additional newline offsets on the heap so we only do the newline scan once.

I don't know which is faster, but if we do preallocate a buffer,  we can create
a large `PyList` and shrink it by `PyList_SetSlice(..., NULL)` at the end.

`builtin_filter()` of Python 2 do that.

> indygreg wrote in mdiff.py:43
> @yuja may have an opinion on this, but mine is that we now have stronger guarantees around C extension versioning, so if the active module policy is to use C extensions, we should ensure we get `splitnewlines` from the C module. I /think/ if we move `splitnewlines` to the `bdiff` module that things will sort themselves out? This could be done as a follow-up easily enough though.

> if we move splitnewlines to the bdiff module that things will sort themselves out

Yes, splitnewlines should be moved to pure/bdiff.py. And we need to bump the C API version.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1973

To: durin42, #hg-reviewers, indygreg
Cc: indygreg, yuja, mercurial-devel
phabricator - Feb. 2, 2018, 4:26 p.m.
durin42 added inline comments.

INLINE COMMENTS

> indygreg wrote in bdiff.c:216-217
> I have a feeling this extra line scan will matter in a benchmark. Could you `perf record` the new `hg perf*` command and verify? If it is a big deal, I would allocate an `int[16384]` array on the stack or something to deal with the common case and store additional newline offsets on the heap so we only do the newline scan once.

It's already enough of a win I'd like to land the simple version and then focus on spending effort on a Rust version rather than continuing with this. How do you feel about that as a plan?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1973

To: durin42, #hg-reviewers, indygreg
Cc: indygreg, yuja, mercurial-devel
phabricator - Feb. 3, 2018, 8:17 a.m.
yuja requested changes to this revision.
yuja added a comment.
This revision now requires changes to proceed.


  Looks mostly good to me. I hesitated to fix minor nits in flight because
  it's C.

INLINE COMMENTS

> bdiff.c:185
>  
> +bool sliceintolist(PyObject *list, Py_ssize_t destidx,
> +		   const char *source, Py_ssize_t len) {

Nit: `static bool sliceintolist(`

> bdiff.c:224
> +	}
> +	if (start < size) {
> +		if (!sliceintolist(result, nelts++, text+start, size-start))

`start < size` should be always true because `size > 0`. Otherwise,
`PyList_New(nelts + 1)` could be wrong.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1973

To: durin42, #hg-reviewers, indygreg, yuja
Cc: indygreg, yuja, mercurial-devel
phabricator - Feb. 12, 2018, 3:24 p.m.
durin42 marked an inline comment as done.
durin42 added a comment.


  Good catches. This should be ready now.
  
  I also added bdiff.c to clang-format oversight in a newly inserted parent, because the file is simple enough that doing so was easy.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1973

To: durin42, #hg-reviewers, indygreg, yuja
Cc: indygreg, yuja, mercurial-devel
phabricator - Feb. 12, 2018, 7:11 p.m.
indygreg accepted this revision.
indygreg added a comment.


  I'm happy with this as a first revision.
  
  While I'm accepting as hg-reviewers, I think C code should have an extra set of eyes. So I'll defer to @yuja to queue it.
  
  For the record, I'm no fan of not having braces for all bodies of conditionals. Can't wait to globally reformat our code to fix that.

INLINE COMMENTS

> yuja wrote in bdiff.c:185
> Nit: `static bool sliceintolist(`

This doesn't need to be static. I'd declare it as `inline bool sliceintolist(`.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1973

To: durin42, #hg-reviewers, indygreg, yuja
Cc: indygreg, yuja, mercurial-devel
phabricator - Feb. 13, 2018, 12:10 p.m.
yuja accepted this revision.
yuja added a comment.
This revision is now accepted and ready to land.


  Queued, thanks.

INLINE COMMENTS

> indygreg wrote in bdiff.c:185
> This doesn't need to be static. I'd declare it as `inline bool sliceintolist(`.

IIRC, `inline` doesn't imply a function has a file scope, and `inline`
could be `` on unsupported platforms.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D1973

To: durin42, #hg-reviewers, indygreg, yuja
Cc: indygreg, yuja, mercurial-devel

Patch

diff --git a/mercurial/mdiff.py b/mercurial/mdiff.py
--- a/mercurial/mdiff.py
+++ b/mercurial/mdiff.py
@@ -40,6 +40,8 @@ 
             lines[-1] = lines[-1][:-1]
     return lines
 
+splitnewlines = getattr(bdiff, 'splitnewlines', splitnewlines)
+
 class diffopts(object):
     '''context is the number of context lines
     text treats all files as text
diff --git a/mercurial/cext/bdiff.c b/mercurial/cext/bdiff.c
--- a/mercurial/cext/bdiff.c
+++ b/mercurial/cext/bdiff.c
@@ -182,13 +182,64 @@ 
 	return result ? result : PyErr_NoMemory();
 }
 
+bool sliceintolist(PyObject *list, Py_ssize_t destidx,
+		   const char *source, Py_ssize_t len) {
+	PyObject *sliced = PyString_FromStringAndSize(source, len);
+	if (sliced == NULL)
+		return false;
+	PyList_SetItem(list, destidx, sliced);
+	return true;
+}
+
+static PyObject *splitnewlines(PyObject *self, PyObject *args)
+{
+	const char *text;
+	int i, start = 0;
+	Py_ssize_t nelts = 0, size;
+	PyObject *result;
+
+	if (!PyArg_ParseTuple(args, "s#", &text, &size))
+		goto abort;
+	if (!size) {
+		return PyList_New(0);
+	}
+	/* This loops to size-1 because if the last byte is a newline,
+	 * we don't want to perform a split there. */
+	for (i = 0; i < size - 1; ++i) {
+		if (text[i] == '\n') {
+			++nelts;
+		}
+	}
+	if ((result = PyList_New(nelts+1)) == NULL)
+		goto abort;
+	nelts = 0;
+	for (i = 0; i < size - 1; ++i) {
+		if (text[i] == '\n') {
+			if (!sliceintolist(
+				    result, nelts++, text+start, i-start+1))
+				goto abort;
+			start = i+1;
+		}
+	}
+	if (start < size) {
+		if (!sliceintolist(result, nelts++, text+start, size-start))
+			goto abort;
+	}
+	return result;
+abort:
+	Py_XDECREF(result);
+	return NULL;
+}
+
 
 static char mdiff_doc[] = "Efficient binary diff.";
 
 static PyMethodDef methods[] = {
 	{"bdiff", bdiff, METH_VARARGS, "calculate a binary diff\n"},
 	{"blocks", blocks, METH_VARARGS, "find a list of matching lines\n"},
 	{"fixws", fixws, METH_VARARGS, "normalize diff whitespaces\n"},
+	{"splitnewlines", splitnewlines, METH_VARARGS,
+	 "like str.splitlines, but only split on newlines\n"},
 	{NULL, NULL}
 };