Patchwork [06,of,11] internals: document compression negotiation

login
register
mail settings
Submitter Gregory Szorc
Date Nov. 20, 2016, 10:23 p.m.
Message ID <952478a50f2583be4400.1479680623@ubuntu-vm-main>
Download mbox | patch
Permalink /patch/17657/
State Accepted
Headers show

Comments

Gregory Szorc - Nov. 20, 2016, 10:23 p.m.
# HG changeset patch
# User Gregory Szorc <gregory.szorc@gmail.com>
# Date 1479679271 28800
#      Sun Nov 20 14:01:11 2016 -0800
# Node ID 952478a50f2583be4400c0f6fcc156d73d46711c
# Parent  8d1b65503e8b360dd5121488f31d52a3587a0819
internals: document compression negotiation

As part of adding zstd support to all of the things, we'll need
to teach the wire protocol to support non-zlib compression formats.

This commit documents how we'll implement that.

To understand how we arrived at this proposal, let's look at how
things are done today.

The wire protocol today doesn't have a unified format. Instead,
there is a limited facility for differentiating replies as successful
or not. And, each command essentially defines its own response format.

A significant deficiency in the current protocol is the lack of
payload framing over the SSH transport. In the HTTP transport,
chunked transfer is used and the end of an HTTP response body (and
the end of a Mercurial command response) can be identified by a 0
length chunk. This is how HTTP chunked transfer works. But in the
SSH transport, there is no such framing, at least for certain
responses (notably the response to "getbundle" requests). Clients
can't simply read until end of stream because the socket is
persistent and reused for multiple requests. Clients need to know
when they've encountered the end of a request but there is nothing
simple for them to key off of to detect this. So what happens is
the client must decode the payload (as opposed to being dumb and
forwarding frames/packets). This means the payload itself needs
to support identifying end of stream. In some cases (bundle2), it
also means the payload can encode "error" or "interrupt" events
telling the client to e.g. abort processing. The lack of framing
on the SSH transport and the transfer of its responsibilities to
e.g. bundle2 is a massive layering violation and a wart on the
protocol architecture. It needs to be fixed someday by inventing a
proper framing protocol.

So about compression.

The client transport abstractions have a "_callcompressable()"
API. This API is called to invoke a remote command that will
send a compressable response. The response is essentially a
"streaming" response (no framing data at the Mercurial layer)
that is fed into a decompressor.

On the HTTP transport, the decompressor is zlib and only zlib.
There is currently no mechanism for the client to specify an
alternate compression format. And, clients don't advertise what
compression formats they support or ask the server to send a
specific compression format. Instead, it is assumed that non-error
responses to "compressable" commands are zlib compressed.

On the SSH transport, there is no compression at the Mercurial
protocol layer. Instead, compression must be handled by SSH
itself (e.g. `ssh -C`) or within the payload data (e.g. bundle
compression).

For the HTTP transport, adding new compression formats is pretty
straightforward. Once you know what decompressor to use, you can
stream data into the decompressor until you reach a 0 size HTTP
chunk, at which point you are at end of stream.

So our wire protocol changes for the HTTP transport are pretty
straightforward: the client and server advertise what compression
formats they support and an appropriate compression format is
chosen. We introduce a new HTTP media type to hold compressed
payloads. The first 2 bytes of the payload define the compression
format being used. Whoever is on the receiving end can sniff the
first 2 bytes and handle the remaining data accordingly.

Support for multiple compression formats is advertised on both
server and client. The server advertises a "compression" capability
saying which compression formats it supports and in what order they
are preferred. Clients advertise their support for multiple
compression formats via the HTTP "Accept" header.

Strictly speaking, servers don't need to advertise which compression
formats they support. But doing so allows clients to fail fast if
they don't support any of the formats the server does. This is useful
in situations like sending bundles, where the client may have to
perform expensive computation before sending data to the server.

By advertising compression support on each request in the "Accept"
header and by introducing a new media type, the server is able
to gradually transition existing commands/responses to use compression,
even if they don't do so today. Contrast with the old world, where
"application/mercurial-0.1" may or may not use zlib compression
depending on the command being called. Compression is defined as
part of "application/mercurial-0.2," so if a client supports this
media type it supports compression.

It's worth noting that we explicitly don't use "Accept-Encoding,"
"Content-Encoding," or "Transfer-Encoding" for handling compression.
People knowledgeable of the HTTP specifications will say that we
should use these because compression is a media or transfer encoding,
not a media type and dynamic compression is exactly what these
headers should be used for. They have a point and I sympathize with
the argument. However, my years of experience rolling out services
leveraging HTTP has taught me to not trust the HTTP layer, especially
if you are going outside the normal spec (such as using a custom
"Content-Encoding" value to represent zstd streams). I've seen load
balancers, proxies, and other network devices do very bad and
unexpected things to HTTP messages (like insisting zlib compressed
content is decoded and then re-encoded at a different compression level
or even stripping compression completely). I've found that the best
way to avoid surprises when writing protocols on top of HTTP is to use
HTTP as a dumb transport as much as possible to minimize the chances
that an "intelligent" agent between endpoints will muck with your data.
While the widespread use of TLS is mitigating many intermediate
network agents interfering with HTTP, there are still problems at the
edges, with e.g. the origin HTTP server needing to convert HTTP to and
from WSGI and buggy or feature-lacking HTTP client implementations.
I've found the best way to avoid these problems is to avoid using
headers like "Content-Encoding" and to bake as much logic as possible
into media types and HTTP message bodies. The protocol changes in this
commit do rely on the "Accept" and "Content-Type" headers. But we
used them before, so we shouldn't be increasing our exposure to "bad"
HTTP agents.

What about SSH.

For the SSH transport, we can't easily implement content negotiation
to determine compression formats because the SSH transport has no
content negotiation capabilities today. And without a framing protocol,
we don't know how much data to feed into a decompressor. So in order
to implement compression support on the SSH transport, we'd need to
invent a mechanism to represent content types and an outer framing
protocol to stream data robustly. While I'm fully capable of doing
that, it is a lot of work and not something that should be undertaken
lightly. My opinion is that if we're going to change the SSH transport
protocol, we should take a long hard look at implementing a grand
unified protocol that attempts to address all the deficiencies with
the existing protocol. While I want this to happen, that would be
massive scope bloat standing in the way of zstd support. So, I've
decided to take the easy solution: the SSH transport will not gain
support for multiple compression formats. Keep in mind it doesn't
support *any* compression today. So essentially nothing is changing
on the SSH front.
Augie Fackler - Nov. 21, 2016, 10:46 p.m.
On Sun, Nov 20, 2016 at 02:23:43PM -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1479679271 28800
> #      Sun Nov 20 14:01:11 2016 -0800
> # Node ID 952478a50f2583be4400c0f6fcc156d73d46711c
> # Parent  8d1b65503e8b360dd5121488f31d52a3587a0819
> internals: document compression negotiation

Lots of comments here. I've taken patches 1-5, and will review the
rest of the series before stopping.

>
> As part of adding zstd support to all of the things, we'll need
> to teach the wire protocol to support non-zlib compression formats.
>
> This commit documents how we'll implement that.
>
> To understand how we arrived at this proposal, let's look at how
> things are done today.
>
> The wire protocol today doesn't have a unified format. Instead,
> there is a limited facility for differentiating replies as successful
> or not. And, each command essentially defines its own response format.
>
> A significant deficiency in the current protocol is the lack of
> payload framing over the SSH transport. In the HTTP transport,
> chunked transfer is used and the end of an HTTP response body (and
> the end of a Mercurial command response) can be identified by a 0
> length chunk. This is how HTTP chunked transfer works. But in the
> SSH transport, there is no such framing, at least for certain
> responses (notably the response to "getbundle" requests). Clients
> can't simply read until end of stream because the socket is
> persistent and reused for multiple requests. Clients need to know
> when they've encountered the end of a request but there is nothing
> simple for them to key off of to detect this. So what happens is
> the client must decode the payload (as opposed to being dumb and
> forwarding frames/packets). This means the payload itself needs
> to support identifying end of stream. In some cases (bundle2), it
> also means the payload can encode "error" or "interrupt" events
> telling the client to e.g. abort processing. The lack of framing
> on the SSH transport and the transfer of its responsibilities to
> e.g. bundle2 is a massive layering violation and a wart on the
> protocol architecture. It needs to be fixed someday by inventing a
> proper framing protocol.

Love this paragraph. It's a huge loss that the framing delegation
happened, because it means the existing batch() method isn't streaming
over ssh (but it is over http).

>
> So about compression.
>
> The client transport abstractions have a "_callcompressable()"
> API. This API is called to invoke a remote command that will
> send a compressable response. The response is essentially a
> "streaming" response (no framing data at the Mercurial layer)
> that is fed into a decompressor.
>
> On the HTTP transport, the decompressor is zlib and only zlib.
> There is currently no mechanism for the client to specify an
> alternate compression format. And, clients don't advertise what
> compression formats they support or ask the server to send a
> specific compression format. Instead, it is assumed that non-error
> responses to "compressable" commands are zlib compressed.
>
> On the SSH transport, there is no compression at the Mercurial
> protocol layer. Instead, compression must be handled by SSH
> itself (e.g. `ssh -C`) or within the payload data (e.g. bundle
> compression).
>
> For the HTTP transport, adding new compression formats is pretty
> straightforward. Once you know what decompressor to use, you can
> stream data into the decompressor until you reach a 0 size HTTP
> chunk, at which point you are at end of stream.
>
> So our wire protocol changes for the HTTP transport are pretty
> straightforward: the client and server advertise what compression
> formats they support and an appropriate compression format is
> chosen. We introduce a new HTTP media type to hold compressed
> payloads. The first 2 bytes of the payload define the compression
> format being used. Whoever is on the receiving end can sniff the
> first 2 bytes and handle the remaining data accordingly.
>
> Support for multiple compression formats is advertised on both
> server and client. The server advertises a "compression" capability
> saying which compression formats it supports and in what order they
> are preferred. Clients advertise their support for multiple
> compression formats via the HTTP "Accept" header.
>
> Strictly speaking, servers don't need to advertise which compression
> formats they support. But doing so allows clients to fail fast if
> they don't support any of the formats the server does. This is useful
> in situations like sending bundles, where the client may have to
> perform expensive computation before sending data to the server.
>
> By advertising compression support on each request in the "Accept"
> header and by introducing a new media type, the server is able
> to gradually transition existing commands/responses to use compression,
> even if they don't do so today. Contrast with the old world, where
> "application/mercurial-0.1" may or may not use zlib compression
> depending on the command being called. Compression is defined as
> part of "application/mercurial-0.2," so if a client supports this
> media type it supports compression.
>
> It's worth noting that we explicitly don't use "Accept-Encoding,"
> "Content-Encoding," or "Transfer-Encoding" for handling compression.
> People knowledgeable of the HTTP specifications will say that we
> should use these because compression is a media or transfer encoding,
> not a media type and dynamic compression is exactly what these
> headers should be used for. They have a point and I sympathize with
> the argument. However, my years of experience rolling out services
> leveraging HTTP has taught me to not trust the HTTP layer, especially
> if you are going outside the normal spec (such as using a custom
> "Content-Encoding" value to represent zstd streams). I've seen load
> balancers, proxies, and other network devices do very bad and
> unexpected things to HTTP messages (like insisting zlib compressed
> content is decoded and then re-encoded at a different compression level
> or even stripping compression completely). I've found that the best
> way to avoid surprises when writing protocols on top of HTTP is to use
> HTTP as a dumb transport as much as possible to minimize the chances
> that an "intelligent" agent between endpoints will muck with your data.

Totally agreed on this front, in case others have qualms. Experience
with git's smart-http protocol in the wild has motivated me to never
trust invisible intermediate proxies to do reasonable things. I've got
a variety of war stories debugging supposed bugs in the old google
code stack only to discover that it was an http proxy problem between
us and the client.

> While the widespread use of TLS is mitigating many intermediate
> network agents interfering with HTTP, there are still problems at the
> edges, with e.g. the origin HTTP server needing to convert HTTP to and
> from WSGI and buggy or feature-lacking HTTP client implementations.
> I've found the best way to avoid these problems is to avoid using
> headers like "Content-Encoding" and to bake as much logic as possible
> into media types and HTTP message bodies. The protocol changes in this
> commit do rely on the "Accept" and "Content-Type" headers. But we
> used them before, so we shouldn't be increasing our exposure to "bad"
> HTTP agents.
>
> What about SSH.

s/\./?/

>
> For the SSH transport, we can't easily implement content negotiation
> to determine compression formats because the SSH transport has no
> content negotiation capabilities today. And without a framing protocol,
> we don't know how much data to feed into a decompressor. So in order
> to implement compression support on the SSH transport, we'd need to
> invent a mechanism to represent content types and an outer framing
> protocol to stream data robustly. While I'm fully capable of doing
> that, it is a lot of work and not something that should be undertaken
> lightly.

> My opinion is that if we're going to change the SSH transport
> protocol, we should take a long hard look at implementing a grand
> unified protocol that attempts to address all the deficiencies with
> the existing protocol.

Yes, totally agreed.

> While I want this to happen, that would be
> massive scope bloat standing in the way of zstd support. So, I've
> decided to take the easy solution: the SSH transport will not gain
> support for multiple compression formats. Keep in mind it doesn't
> support *any* compression today. So essentially nothing is changing
> on the SSH front.

This sounds like a reasonable approach here. I'd like to get a clean
re-do on the wire protocol some day, but I suspect it'll be many moons
before someone is willing to pay for that work (and it seems unlikely
I'll get around to it for the laughs).

I wonder if it'd be reasonable to have an sshv2 protocol just *be*
http-over-ssh via stdin/stdout? I feel a touch dirty even suggesting
it, but the framing rules etc are already there...

>
> diff --git a/mercurial/help/internals/wireprotocol.txt b/mercurial/help/internals/wireprotocol.txt
> --- a/mercurial/help/internals/wireprotocol.txt
> +++ b/mercurial/help/internals/wireprotocol.txt
> @@ -68,8 +68,16 @@ Example HTTP requests::
>  The ``Content-Type`` HTTP response header identifies the response as coming
>  from Mercurial and can also be used to signal an error has occurred.
>
> -The ``application/mercurial-0.1`` media type indicates a generic Mercurial
> -response. It matches the media type sent by the client.
> +The ``application/mercurial-*`` media types indicate a generic Mercurial
> +data type.
> +
> +The ``application/mercurial-0.1`` media type is raw Mercurial data.

Perhaps the word legacy wants to be in this statement.

> +
> +The ``application/mercurial-0.2`` media type is compression framed Mercurial
> +data. The first 2 bytes of the payload indicate the compression format
> +used. The remaining bytes are compressed according to that compression
> +format. The decompressed data behaves the same as with
> +``application/mercurial-0.1``.
>
>  The ``application/hg-error`` media type indicates a generic error occurred.
>  The content of the HTTP response body typically holds text describing the
> @@ -81,15 +89,19 @@ type.
>  Clients also accept the ``text/plain`` media type. All other media
>  types should cause the client to error.
>
> +Behavior of media types is further described in the ``Content Negotiation``
> +section below.
> +
>  Clients should issue a ``User-Agent`` request header that identifies the client.
>  The server should not use the ``User-Agent`` for feature detection.
>
> -A command returning a ``string`` response issues the
> -``application/mercurial-0.1`` media type and the HTTP response body contains
> -the raw string value. A ``Content-Length`` header is typically issued.
> +A command returning a ``string`` response issues a
> +``application/mercurial-0.*`` media type and the HTTP response body contains
> +the raw string value (after compression decoding, if used). A
> +``Content-Length`` header is typically issued, but not required.
>
> -A command returning a ``stream`` response issues the
> -``application/mercurial-0.1`` media type and the HTTP response is typically
> +A command returning a ``stream`` response issues a
> +``application/mercurial-0.*`` media type and the HTTP response is typically
>  using *chunked transfer* (``Transfer-Encoding: chunked``).
>
>  SSH Transport
> @@ -233,6 +245,29 @@ 2006).
>  This capability was introduced at the same time as the ``lookup``
>  capability/command.
>
> +compression
> +-----------
> +
> +Declares support for negotiating compression formats.
> +
> +Presence of this capability indicates the server supports dynamic selection
> +of compression formats based on the client request.
> +
> +Servers advertising this capability are required to support the
> +``application/mercurial-0.2`` media type in response to commands returning
> +streams. Servers may support this media type on any command.
> +
> +The value of the capability is a comma-delimited list of strings declaring
> +supported compression formats. The order of the compression formats is in
> +server-preferred order, most preferred first.
> +
> +The compression format strings are 2 byte identifiers. These are the same
> +2 byte *header* values at the beginning of ``application/mercurial-0.2``
> +media types (as used by the HTTP transport).
> +
> +This capability was introduced in Mercurial 4.1 (released February
> +2017).

Mention that as of that release it was not yet used over the ssh
transport? Or state that it was only used over http?

> +
>  getbundle
>  ---------
>
> @@ -416,6 +451,46 @@ Mercurial server replies to the client-i
>  not conforming to the expected command responses is assumed to be not related
>  to Mercurial and can be ignored.
>
> +Content Negotiation
> +===================
> +
> +The wire protocol has some mechanisms to help peers determine what content
> +types and encoding the other side will accept. Historically, these mechanisms
> +have been built into commands themselves because most commands only send a
> +well-defined response type and only certain commands needed to support
> +functionality like compression.
> +
> +Currently, only the HTTP transport supports content negotiation at the protocol
> +layer.
> +
> +HTTP requests advertise accepted media types via the ``Accept`` header.
> +
> +All clients should advertise an ``application/mercurial-0.1`` value.
> +
> +Clients supporting it can also advertise ``application/mercurial-0.2``.
> +This media type supports the ``comp`` parameter to declare which compression
> +formats the client accepts. The value is a ``quoted-string`` (defined by
> +HTTP specification) containing a space-delimited list of 2 byte compression
> +format identifiers. e.g. ``application/mercurial-0.2; comp="ZS ZL UN"``.
> +If the ``comp`` parameter is absent, the server interprets this as equivalent
> +to ``ZL UN``.
> +
> +Clients may choose to only advertise the ``application/mercurial-0.2`` media
> +type if the server advertises the ``compression`` capability.
> +
> +A server that doesn't receive an ``Accept`` header listing any
> +``application/mercurial-*`` values should infer that
> +``application/mercurial-0.1`` was sent, as this media type should be supported
> +by all clients ever written.

I'd like to be more cautious in the wording here, and give servers
room to reject old clients for not understanding
application/mercurial-0.2. It's not hard for me to envision a future
where someone writes a modern-proto-only client in
Java/Go/Rust/Piet/whatever, and it'd be nice to have a defined way for
the client and server to identify each other in that case. I also want
to be able to run servers that intentionally lock out old clients
(potentially for scaling reasons, since we're talking about
compression performance).

> +
> +A server receiving multiple ``application/mercurial-*`` values may choose any
> +of them. For example, a server may issue ``application/mercurial-0.2`` only
> +for responses that it chooses to compress.
> +
> +A server may issue ``application/hg-*`` media types even though the client
> +does not specify support for them in an ``Accept`` header. This is for
> +backwards compatibility reasons.

Can we provide guidance here that new servers should not issue these
forms? Other than hg-error, we don't send them today, so I'd like to
put a stop to any growth of that now (especially
application/hg-changegroup, which looks super crufty and ancient).

> +
>  Commands
>  ========
>
Kyle Lippincott - Nov. 22, 2016, midnight
On Sun, Nov 20, 2016 at 2:23 PM, Gregory Szorc <gregory.szorc@gmail.com>
wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc@gmail.com>
> # Date 1479679271 28800
> #      Sun Nov 20 14:01:11 2016 -0800
> # Node ID 952478a50f2583be4400c0f6fcc156d73d46711c
> # Parent  8d1b65503e8b360dd5121488f31d52a3587a0819
> internals: document compression negotiation
>
> As part of adding zstd support to all of the things, we'll need
> to teach the wire protocol to support non-zlib compression formats.
>
> This commit documents how we'll implement that.
>
> To understand how we arrived at this proposal, let's look at how
> things are done today.
>
> The wire protocol today doesn't have a unified format. Instead,
> there is a limited facility for differentiating replies as successful
> or not. And, each command essentially defines its own response format.
>
> A significant deficiency in the current protocol is the lack of
> payload framing over the SSH transport. In the HTTP transport,
> chunked transfer is used and the end of an HTTP response body (and
> the end of a Mercurial command response) can be identified by a 0
> length chunk. This is how HTTP chunked transfer works. But in the
> SSH transport, there is no such framing, at least for certain
> responses (notably the response to "getbundle" requests). Clients
> can't simply read until end of stream because the socket is
> persistent and reused for multiple requests. Clients need to know
> when they've encountered the end of a request but there is nothing
> simple for them to key off of to detect this. So what happens is
> the client must decode the payload (as opposed to being dumb and
> forwarding frames/packets). This means the payload itself needs
> to support identifying end of stream. In some cases (bundle2), it
> also means the payload can encode "error" or "interrupt" events
> telling the client to e.g. abort processing. The lack of framing
> on the SSH transport and the transfer of its responsibilities to
> e.g. bundle2 is a massive layering violation and a wart on the
> protocol architecture. It needs to be fixed someday by inventing a
> proper framing protocol.
>
> So about compression.
>
> The client transport abstractions have a "_callcompressable()"
> API. This API is called to invoke a remote command that will
> send a compressable response. The response is essentially a
> "streaming" response (no framing data at the Mercurial layer)
> that is fed into a decompressor.
>
> On the HTTP transport, the decompressor is zlib and only zlib.
> There is currently no mechanism for the client to specify an
> alternate compression format. And, clients don't advertise what
> compression formats they support or ask the server to send a
> specific compression format. Instead, it is assumed that non-error
> responses to "compressable" commands are zlib compressed.
>
> On the SSH transport, there is no compression at the Mercurial
> protocol layer. Instead, compression must be handled by SSH
> itself (e.g. `ssh -C`) or within the payload data (e.g. bundle
> compression).
>
> For the HTTP transport, adding new compression formats is pretty
> straightforward. Once you know what decompressor to use, you can
> stream data into the decompressor until you reach a 0 size HTTP
> chunk, at which point you are at end of stream.
>
> So our wire protocol changes for the HTTP transport are pretty
> straightforward: the client and server advertise what compression
> formats they support and an appropriate compression format is
> chosen. We introduce a new HTTP media type to hold compressed
> payloads. The first 2 bytes of the payload define the compression
> format being used. Whoever is on the receiving end can sniff the
> first 2 bytes and handle the remaining data accordingly.
>
> Support for multiple compression formats is advertised on both
> server and client. The server advertises a "compression" capability
> saying which compression formats it supports and in what order they
> are preferred. Clients advertise their support for multiple
> compression formats via the HTTP "Accept" header.
>
> Strictly speaking, servers don't need to advertise which compression
> formats they support. But doing so allows clients to fail fast if
> they don't support any of the formats the server does. This is useful
> in situations like sending bundles, where the client may have to
> perform expensive computation before sending data to the server.
>
> By advertising compression support on each request in the "Accept"
> header and by introducing a new media type, the server is able
> to gradually transition existing commands/responses to use compression,
> even if they don't do so today. Contrast with the old world, where
> "application/mercurial-0.1" may or may not use zlib compression
> depending on the command being called. Compression is defined as
> part of "application/mercurial-0.2," so if a client supports this
> media type it supports compression.
>
> It's worth noting that we explicitly don't use "Accept-Encoding,"
> "Content-Encoding," or "Transfer-Encoding" for handling compression.
> People knowledgeable of the HTTP specifications will say that we
> should use these because compression is a media or transfer encoding,
> not a media type and dynamic compression is exactly what these
> headers should be used for. They have a point and I sympathize with
> the argument. However, my years of experience rolling out services
> leveraging HTTP has taught me to not trust the HTTP layer, especially
> if you are going outside the normal spec (such as using a custom
> "Content-Encoding" value to represent zstd streams). I've seen load
> balancers, proxies, and other network devices do very bad and
> unexpected things to HTTP messages (like insisting zlib compressed
> content is decoded and then re-encoded at a different compression level
> or even stripping compression completely). I've found that the best
> way to avoid surprises when writing protocols on top of HTTP is to use
> HTTP as a dumb transport as much as possible to minimize the chances
> that an "intelligent" agent between endpoints will muck with your data.
> While the widespread use of TLS is mitigating many intermediate
> network agents interfering with HTTP, there are still problems at the
> edges, with e.g. the origin HTTP server needing to convert HTTP to and
> from WSGI and buggy or feature-lacking HTTP client implementations.
> I've found the best way to avoid these problems is to avoid using
> headers like "Content-Encoding" and to bake as much logic as possible
> into media types and HTTP message bodies. The protocol changes in this
> commit do rely on the "Accept" and "Content-Type" headers. But we
> used them before, so we shouldn't be increasing our exposure to "bad"
> HTTP agents.
>
> What about SSH.
>
> For the SSH transport, we can't easily implement content negotiation
> to determine compression formats because the SSH transport has no
> content negotiation capabilities today. And without a framing protocol,
> we don't know how much data to feed into a decompressor. So in order
> to implement compression support on the SSH transport, we'd need to
> invent a mechanism to represent content types and an outer framing
> protocol to stream data robustly. While I'm fully capable of doing
> that, it is a lot of work and not something that should be undertaken
> lightly. My opinion is that if we're going to change the SSH transport
> protocol, we should take a long hard look at implementing a grand
> unified protocol that attempts to address all the deficiencies with
> the existing protocol. While I want this to happen, that would be
> massive scope bloat standing in the way of zstd support. So, I've
> decided to take the easy solution: the SSH transport will not gain
> support for multiple compression formats. Keep in mind it doesn't
> support *any* compression today. So essentially nothing is changing
> on the SSH front.
>
> diff --git a/mercurial/help/internals/wireprotocol.txt
> b/mercurial/help/internals/wireprotocol.txt
> --- a/mercurial/help/internals/wireprotocol.txt
> +++ b/mercurial/help/internals/wireprotocol.txt
> @@ -68,8 +68,16 @@ Example HTTP requests::
>  The ``Content-Type`` HTTP response header identifies the response as
> coming
>  from Mercurial and can also be used to signal an error has occurred.
>
> -The ``application/mercurial-0.1`` media type indicates a generic Mercurial
> -response. It matches the media type sent by the client.
> +The ``application/mercurial-*`` media types indicate a generic Mercurial
> +data type.
> +
> +The ``application/mercurial-0.1`` media type is raw Mercurial data.
> +
> +The ``application/mercurial-0.2`` media type is compression framed
> Mercurial
> +data. The first 2 bytes of the payload indicate the compression format
> +used. The remaining bytes are compressed according to that compression
> +format. The decompressed data behaves the same as with
> +``application/mercurial-0.1``.
>

The 2 character limitation concerns me, because it doesn't give many usable
values (considering that a lot of compression has a Z in it, this is
perhaps fewer than you might expect) or mechanisms to describe variations.

Examples:
Would lz4 be encoded as z4, l4, or lz?  (lz seems bad, since lzma, lzf,
lzo, quicklz)
Would lz4-without-framing (currently in use by remotefilelog) be
represented differently than lz4-with-framing?  How so?

This wasn't a large problem with a single list of compressors (typically
when interacting with bundles, such as HG10UN or whatever), but with
pluggable compressors this becomes a bigger problem :)

Regardless of how this ends up being implemented
(X-MercurialCompressionFormat: zstd, or using a size byte ("4zstd<data>"),
or using a delimiter ("zstd\0<data>"), etc.) we should document case
sensitivity here.

Is 'batch' handled in any special fashion by this?  Does the batch response
need to be either entirely compressed or entirely uncompressed, or are we
anticipating individual commands making the decision, and batch being an
intermediate (non-compressed) framing format around the commands?  It might
be nice if I could do something like batch cmds=getfile foo;heads;getfile
bar  and return "lz4\0<foo_data>;un\0<heads_data>;zstd\0<bar_data>"
depending on which compressed files I happened to already have available.
I know augie had been possibly re-thinking batch, we may want to have that
discussion now :)

If we're going to be putting data at the beginning of these blocks (and if
we're doing batch, I think we might need to), can we make it extensible?
I'm at risk of sounding like a broken record, but any time we find a reason
to put *some* metadata somewhere, we essentially always end up finding a
reason to put some *other* metadata somewhere and then have to design for
that.  Specifically, it might be better if instead of just assuming
'<compressiontype>\0' prefixes the response, we have something like we have
for bundlecaps: some format of specifying key/value pairs.  I can think of
three that I want, already: 1) compressiontype, 2) batchnum [to handle
out-of-order batch responses], 3) batchcode [if an individual batch message
encounters an error, there needs to be some way of indicating the error
status vs. the success status.  I think it's remotefilelog that currently
does this inline, via <int>\0<data_or_error_str>, formalizing that and
making it available to other things would be nice :)]


>  The ``application/hg-error`` media type indicates a generic error
> occurred.
>  The content of the HTTP response body typically holds text describing the
> @@ -81,15 +89,19 @@ type.
>  Clients also accept the ``text/plain`` media type. All other media
>  types should cause the client to error.
>
> +Behavior of media types is further described in the ``Content
> Negotiation``
> +section below.
> +
>  Clients should issue a ``User-Agent`` request header that identifies the
> client.
>  The server should not use the ``User-Agent`` for feature detection.
>
> -A command returning a ``string`` response issues the
> -``application/mercurial-0.1`` media type and the HTTP response body
> contains
> -the raw string value. A ``Content-Length`` header is typically issued.
> +A command returning a ``string`` response issues a
> +``application/mercurial-0.*`` media type and the HTTP response body
> contains
> +the raw string value (after compression decoding, if used). A
> +``Content-Length`` header is typically issued, but not required.
>
> -A command returning a ``stream`` response issues the
> -``application/mercurial-0.1`` media type and the HTTP response is
> typically
> +A command returning a ``stream`` response issues a
> +``application/mercurial-0.*`` media type and the HTTP response is
> typically
>  using *chunked transfer* (``Transfer-Encoding: chunked``).
>
>  SSH Transport
> @@ -233,6 +245,29 @@ 2006).
>  This capability was introduced at the same time as the ``lookup``
>  capability/command.
>
> +compression
> +-----------
> +
> +Declares support for negotiating compression formats.
> +
> +Presence of this capability indicates the server supports dynamic
> selection
> +of compression formats based on the client request.
> +
> +Servers advertising this capability are required to support the
> +``application/mercurial-0.2`` media type in response to commands returning
> +streams. Servers may support this media type on any command.
> +
> +The value of the capability is a comma-delimited list of strings declaring
> +supported compression formats. The order of the compression formats is in
> +server-preferred order, most preferred first.
> +
> +The compression format strings are 2 byte identifiers. These are the same
> +2 byte *header* values at the beginning of ``application/mercurial-0.2``
> +media types (as used by the HTTP transport).
> +
> +This capability was introduced in Mercurial 4.1 (released February
> +2017).
> +
>  getbundle
>  ---------
>
> @@ -416,6 +451,46 @@ Mercurial server replies to the client-i
>  not conforming to the expected command responses is assumed to be not
> related
>  to Mercurial and can be ignored.
>
> +Content Negotiation
> +===================
> +
> +The wire protocol has some mechanisms to help peers determine what content
> +types and encoding the other side will accept. Historically, these
> mechanisms
> +have been built into commands themselves because most commands only send a
> +well-defined response type and only certain commands needed to support
> +functionality like compression.
> +
> +Currently, only the HTTP transport supports content negotiation at the
> protocol
> +layer.
> +
> +HTTP requests advertise accepted media types via the ``Accept`` header.
> +
> +All clients should advertise an ``application/mercurial-0.1`` value.
> +
> +Clients supporting it can also advertise ``application/mercurial-0.2``.
> +This media type supports the ``comp`` parameter to declare which
> compression
> +formats the client accepts. The value is a ``quoted-string`` (defined by
> +HTTP specification) containing a space-delimited list of 2 byte
> compression
> +format identifiers. e.g. ``application/mercurial-0.2; comp="ZS ZL UN"``.
> +If the ``comp`` parameter is absent, the server interprets this as
> equivalent
> +to ``ZL UN``.
> +
> +Clients may choose to only advertise the ``application/mercurial-0.2``
> media
> +type if the server advertises the ``compression`` capability.
> +
> +A server that doesn't receive an ``Accept`` header listing any
> +``application/mercurial-*`` values should infer that
> +``application/mercurial-0.1`` was sent, as this media type should be
> supported
> +by all clients ever written.
> +
> +A server receiving multiple ``application/mercurial-*`` values may choose
> any
> +of them. For example, a server may issue ``application/mercurial-0.2``
> only
> +for responses that it chooses to compress.
> +
> +A server may issue ``application/hg-*`` media types even though the client
> +does not specify support for them in an ``Accept`` header. This is for
> +backwards compatibility reasons.
> +
>  Commands
>  ========
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
Augie Fackler - Nov. 22, 2016, 2:26 a.m.
> On Nov 21, 2016, at 7:00 PM, Kyle Lippincott <spectral@pewpew.net> wrote:
> 
> The 2 character limitation concerns me, because it doesn't give many usable values (considering that a lot of compression has a Z in it, this is perhaps fewer than you might expect) or mechanisms to describe variations.
> 
> Examples:
> Would lz4 be encoded as z4, l4, or lz?  (lz seems bad, since lzma, lzf, lzo, quicklz)
> Would lz4-without-framing (currently in use by remotefilelog) be represented differently than lz4-with-framing?  How so?

I thought about suggesting what my old Mac brain would call a fourcc instead of a two character code. I’m +0 on the idea of at least using four characters.
Gregory Szorc - Nov. 25, 2016, 6:26 p.m.
On Mon, Nov 21, 2016 at 2:46 PM, Augie Fackler <raf@durin42.com> wrote:

> On Sun, Nov 20, 2016 at 02:23:43PM -0800, Gregory Szorc wrote:
> > # HG changeset patch
> > # User Gregory Szorc <gregory.szorc@gmail.com>
> > # Date 1479679271 28800
> > #      Sun Nov 20 14:01:11 2016 -0800
> > # Node ID 952478a50f2583be4400c0f6fcc156d73d46711c
> > # Parent  8d1b65503e8b360dd5121488f31d52a3587a0819
> > internals: document compression negotiation
>
> Lots of comments here. I've taken patches 1-5, and will review the
> rest of the series before stopping.
>
> >
> > As part of adding zstd support to all of the things, we'll need
> > to teach the wire protocol to support non-zlib compression formats.
> >
> > This commit documents how we'll implement that.
> >
> > To understand how we arrived at this proposal, let's look at how
> > things are done today.
> >
> > The wire protocol today doesn't have a unified format. Instead,
> > there is a limited facility for differentiating replies as successful
> > or not. And, each command essentially defines its own response format.
> >
> > A significant deficiency in the current protocol is the lack of
> > payload framing over the SSH transport. In the HTTP transport,
> > chunked transfer is used and the end of an HTTP response body (and
> > the end of a Mercurial command response) can be identified by a 0
> > length chunk. This is how HTTP chunked transfer works. But in the
> > SSH transport, there is no such framing, at least for certain
> > responses (notably the response to "getbundle" requests). Clients
> > can't simply read until end of stream because the socket is
> > persistent and reused for multiple requests. Clients need to know
> > when they've encountered the end of a request but there is nothing
> > simple for them to key off of to detect this. So what happens is
> > the client must decode the payload (as opposed to being dumb and
> > forwarding frames/packets). This means the payload itself needs
> > to support identifying end of stream. In some cases (bundle2), it
> > also means the payload can encode "error" or "interrupt" events
> > telling the client to e.g. abort processing. The lack of framing
> > on the SSH transport and the transfer of its responsibilities to
> > e.g. bundle2 is a massive layering violation and a wart on the
> > protocol architecture. It needs to be fixed someday by inventing a
> > proper framing protocol.
>
> Love this paragraph. It's a huge loss that the framing delegation
> happened, because it means the existing batch() method isn't streaming
> over ssh (but it is over http).
>
> >
> > So about compression.
> >
> > The client transport abstractions have a "_callcompressable()"
> > API. This API is called to invoke a remote command that will
> > send a compressable response. The response is essentially a
> > "streaming" response (no framing data at the Mercurial layer)
> > that is fed into a decompressor.
> >
> > On the HTTP transport, the decompressor is zlib and only zlib.
> > There is currently no mechanism for the client to specify an
> > alternate compression format. And, clients don't advertise what
> > compression formats they support or ask the server to send a
> > specific compression format. Instead, it is assumed that non-error
> > responses to "compressable" commands are zlib compressed.
> >
> > On the SSH transport, there is no compression at the Mercurial
> > protocol layer. Instead, compression must be handled by SSH
> > itself (e.g. `ssh -C`) or within the payload data (e.g. bundle
> > compression).
> >
> > For the HTTP transport, adding new compression formats is pretty
> > straightforward. Once you know what decompressor to use, you can
> > stream data into the decompressor until you reach a 0 size HTTP
> > chunk, at which point you are at end of stream.
> >
> > So our wire protocol changes for the HTTP transport are pretty
> > straightforward: the client and server advertise what compression
> > formats they support and an appropriate compression format is
> > chosen. We introduce a new HTTP media type to hold compressed
> > payloads. The first 2 bytes of the payload define the compression
> > format being used. Whoever is on the receiving end can sniff the
> > first 2 bytes and handle the remaining data accordingly.
> >
> > Support for multiple compression formats is advertised on both
> > server and client. The server advertises a "compression" capability
> > saying which compression formats it supports and in what order they
> > are preferred. Clients advertise their support for multiple
> > compression formats via the HTTP "Accept" header.
> >
> > Strictly speaking, servers don't need to advertise which compression
> > formats they support. But doing so allows clients to fail fast if
> > they don't support any of the formats the server does. This is useful
> > in situations like sending bundles, where the client may have to
> > perform expensive computation before sending data to the server.
> >
> > By advertising compression support on each request in the "Accept"
> > header and by introducing a new media type, the server is able
> > to gradually transition existing commands/responses to use compression,
> > even if they don't do so today. Contrast with the old world, where
> > "application/mercurial-0.1" may or may not use zlib compression
> > depending on the command being called. Compression is defined as
> > part of "application/mercurial-0.2," so if a client supports this
> > media type it supports compression.
> >
> > It's worth noting that we explicitly don't use "Accept-Encoding,"
> > "Content-Encoding," or "Transfer-Encoding" for handling compression.
> > People knowledgeable of the HTTP specifications will say that we
> > should use these because compression is a media or transfer encoding,
> > not a media type and dynamic compression is exactly what these
> > headers should be used for. They have a point and I sympathize with
> > the argument. However, my years of experience rolling out services
> > leveraging HTTP has taught me to not trust the HTTP layer, especially
> > if you are going outside the normal spec (such as using a custom
> > "Content-Encoding" value to represent zstd streams). I've seen load
> > balancers, proxies, and other network devices do very bad and
> > unexpected things to HTTP messages (like insisting zlib compressed
> > content is decoded and then re-encoded at a different compression level
> > or even stripping compression completely). I've found that the best
> > way to avoid surprises when writing protocols on top of HTTP is to use
> > HTTP as a dumb transport as much as possible to minimize the chances
> > that an "intelligent" agent between endpoints will muck with your data.
>
> Totally agreed on this front, in case others have qualms. Experience
> with git's smart-http protocol in the wild has motivated me to never
> trust invisible intermediate proxies to do reasonable things. I've got
> a variety of war stories debugging supposed bugs in the old google
> code stack only to discover that it was an http proxy problem between
> us and the client.
>
> > While the widespread use of TLS is mitigating many intermediate
> > network agents interfering with HTTP, there are still problems at the
> > edges, with e.g. the origin HTTP server needing to convert HTTP to and
> > from WSGI and buggy or feature-lacking HTTP client implementations.
> > I've found the best way to avoid these problems is to avoid using
> > headers like "Content-Encoding" and to bake as much logic as possible
> > into media types and HTTP message bodies. The protocol changes in this
> > commit do rely on the "Accept" and "Content-Type" headers. But we
> > used them before, so we shouldn't be increasing our exposure to "bad"
> > HTTP agents.
> >
> > What about SSH.
>
> s/\./?/
>
> >
> > For the SSH transport, we can't easily implement content negotiation
> > to determine compression formats because the SSH transport has no
> > content negotiation capabilities today. And without a framing protocol,
> > we don't know how much data to feed into a decompressor. So in order
> > to implement compression support on the SSH transport, we'd need to
> > invent a mechanism to represent content types and an outer framing
> > protocol to stream data robustly. While I'm fully capable of doing
> > that, it is a lot of work and not something that should be undertaken
> > lightly.
>
> > My opinion is that if we're going to change the SSH transport
> > protocol, we should take a long hard look at implementing a grand
> > unified protocol that attempts to address all the deficiencies with
> > the existing protocol.
>
> Yes, totally agreed.
>
> > While I want this to happen, that would be
> > massive scope bloat standing in the way of zstd support. So, I've
> > decided to take the easy solution: the SSH transport will not gain
> > support for multiple compression formats. Keep in mind it doesn't
> > support *any* compression today. So essentially nothing is changing
> > on the SSH front.
>
> This sounds like a reasonable approach here. I'd like to get a clean
> re-do on the wire protocol some day, but I suspect it'll be many moons
> before someone is willing to pay for that work (and it seems unlikely
> I'll get around to it for the laughs).
>
> I wonder if it'd be reasonable to have an sshv2 protocol just *be*
> http-over-ssh via stdin/stdout? I feel a touch dirty even suggesting
> it, but the framing rules etc are already there...
>

It's not a horrible idea. It's certainly convenient. Of course, that means
Python's built-in HTTP server will be generating HTTP messages. I'm not
sure how I feel about that. I assume most production Mercurial HTTP servers
today are doing WSGI into a "real" HTTP server. I don't want to think about
what differences in behavior the Python HTTP server stack brings to the
table.

The features of HTTP we're relying on in the protocol today are chunked
transfer and a limited set of HTTP headers.

Chunked transfer is kinda bad from a performance perspective because it
relies on readline() to find the next chunk's size. Some of the many
readline() implementations in the code today are so poorly done that they
show up as hotspots for some operations. I've seen it enough times that I
want to abolish reliance on readline() and used fixed width (or varint)
encoding for frame sizes everywhere possible.

Anyway, I don't think it's too difficult for us to devise a custom protocol
that provides the parts we need without all the HTTP baggage. We could even
optionally use this protocol on HTTP servers via the Upgrade header
(although I know there are good reasons to avoid that).

A new wire protocol is definitely something a few of us should set time
aside to discuss, possibly at the next sprint.


>
> >
> > diff --git a/mercurial/help/internals/wireprotocol.txt
> b/mercurial/help/internals/wireprotocol.txt
> > --- a/mercurial/help/internals/wireprotocol.txt
> > +++ b/mercurial/help/internals/wireprotocol.txt
> > @@ -68,8 +68,16 @@ Example HTTP requests::
> >  The ``Content-Type`` HTTP response header identifies the response as
> coming
> >  from Mercurial and can also be used to signal an error has occurred.
> >
> > -The ``application/mercurial-0.1`` media type indicates a generic
> Mercurial
> > -response. It matches the media type sent by the client.
> > +The ``application/mercurial-*`` media types indicate a generic Mercurial
> > +data type.
> > +
> > +The ``application/mercurial-0.1`` media type is raw Mercurial data.
>
> Perhaps the word legacy wants to be in this statement.
>
> > +
> > +The ``application/mercurial-0.2`` media type is compression framed
> Mercurial
> > +data. The first 2 bytes of the payload indicate the compression format
> > +used. The remaining bytes are compressed according to that compression
> > +format. The decompressed data behaves the same as with
> > +``application/mercurial-0.1``.
> >
> >  The ``application/hg-error`` media type indicates a generic error
> occurred.
> >  The content of the HTTP response body typically holds text describing
> the
> > @@ -81,15 +89,19 @@ type.
> >  Clients also accept the ``text/plain`` media type. All other media
> >  types should cause the client to error.
> >
> > +Behavior of media types is further described in the ``Content
> Negotiation``
> > +section below.
> > +
> >  Clients should issue a ``User-Agent`` request header that identifies
> the client.
> >  The server should not use the ``User-Agent`` for feature detection.
> >
> > -A command returning a ``string`` response issues the
> > -``application/mercurial-0.1`` media type and the HTTP response body
> contains
> > -the raw string value. A ``Content-Length`` header is typically issued.
> > +A command returning a ``string`` response issues a
> > +``application/mercurial-0.*`` media type and the HTTP response body
> contains
> > +the raw string value (after compression decoding, if used). A
> > +``Content-Length`` header is typically issued, but not required.
> >
> > -A command returning a ``stream`` response issues the
> > -``application/mercurial-0.1`` media type and the HTTP response is
> typically
> > +A command returning a ``stream`` response issues a
> > +``application/mercurial-0.*`` media type and the HTTP response is
> typically
> >  using *chunked transfer* (``Transfer-Encoding: chunked``).
> >
> >  SSH Transport
> > @@ -233,6 +245,29 @@ 2006).
> >  This capability was introduced at the same time as the ``lookup``
> >  capability/command.
> >
> > +compression
> > +-----------
> > +
> > +Declares support for negotiating compression formats.
> > +
> > +Presence of this capability indicates the server supports dynamic
> selection
> > +of compression formats based on the client request.
> > +
> > +Servers advertising this capability are required to support the
> > +``application/mercurial-0.2`` media type in response to commands
> returning
> > +streams. Servers may support this media type on any command.
> > +
> > +The value of the capability is a comma-delimited list of strings
> declaring
> > +supported compression formats. The order of the compression formats is
> in
> > +server-preferred order, most preferred first.
> > +
> > +The compression format strings are 2 byte identifiers. These are the
> same
> > +2 byte *header* values at the beginning of ``application/mercurial-0.2``
> > +media types (as used by the HTTP transport).
> > +
> > +This capability was introduced in Mercurial 4.1 (released February
> > +2017).
>
> Mention that as of that release it was not yet used over the ssh
> transport? Or state that it was only used over http?
>
> > +
> >  getbundle
> >  ---------
> >
> > @@ -416,6 +451,46 @@ Mercurial server replies to the client-i
> >  not conforming to the expected command responses is assumed to be not
> related
> >  to Mercurial and can be ignored.
> >
> > +Content Negotiation
> > +===================
> > +
> > +The wire protocol has some mechanisms to help peers determine what
> content
> > +types and encoding the other side will accept. Historically, these
> mechanisms
> > +have been built into commands themselves because most commands only
> send a
> > +well-defined response type and only certain commands needed to support
> > +functionality like compression.
> > +
> > +Currently, only the HTTP transport supports content negotiation at the
> protocol
> > +layer.
> > +
> > +HTTP requests advertise accepted media types via the ``Accept`` header.
> > +
> > +All clients should advertise an ``application/mercurial-0.1`` value.
> > +
> > +Clients supporting it can also advertise ``application/mercurial-0.2``.
> > +This media type supports the ``comp`` parameter to declare which
> compression
> > +formats the client accepts. The value is a ``quoted-string`` (defined by
> > +HTTP specification) containing a space-delimited list of 2 byte
> compression
> > +format identifiers. e.g. ``application/mercurial-0.2; comp="ZS ZL UN"``.
> > +If the ``comp`` parameter is absent, the server interprets this as
> equivalent
> > +to ``ZL UN``.
> > +
> > +Clients may choose to only advertise the ``application/mercurial-0.2``
> media
> > +type if the server advertises the ``compression`` capability.
> > +
> > +A server that doesn't receive an ``Accept`` header listing any
> > +``application/mercurial-*`` values should infer that
> > +``application/mercurial-0.1`` was sent, as this media type should be
> supported
> > +by all clients ever written.
>
> I'd like to be more cautious in the wording here, and give servers
> room to reject old clients for not understanding
> application/mercurial-0.2. It's not hard for me to envision a future
> where someone writes a modern-proto-only client in
> Java/Go/Rust/Piet/whatever, and it'd be nice to have a defined way for
> the client and server to identify each other in that case. I also want
> to be able to run servers that intentionally lock out old clients
> (potentially for scaling reasons, since we're talking about
> compression performance).
>
> > +
> > +A server receiving multiple ``application/mercurial-*`` values may
> choose any
> > +of them. For example, a server may issue ``application/mercurial-0.2``
> only
> > +for responses that it chooses to compress.
> > +
> > +A server may issue ``application/hg-*`` media types even though the
> client
> > +does not specify support for them in an ``Accept`` header. This is for
> > +backwards compatibility reasons.
>
> Can we provide guidance here that new servers should not issue these
> forms? Other than hg-error, we don't send them today, so I'd like to
> put a stop to any growth of that now (especially
> application/hg-changegroup, which looks super crufty and ancient).
>
> > +
> >  Commands
> >  ========
> >
>
Gregory Szorc - Nov. 26, 2016, 5:33 p.m.
On Mon, Nov 21, 2016 at 4:00 PM, Kyle Lippincott <spectral@pewpew.net>
wrote:

>
>
> On Sun, Nov 20, 2016 at 2:23 PM, Gregory Szorc <gregory.szorc@gmail.com>
> wrote:
>
>> # HG changeset patch
>> # User Gregory Szorc <gregory.szorc@gmail.com>
>> # Date 1479679271 28800
>> #      Sun Nov 20 14:01:11 2016 -0800
>> # Node ID 952478a50f2583be4400c0f6fcc156d73d46711c
>> # Parent  8d1b65503e8b360dd5121488f31d52a3587a0819
>> internals: document compression negotiation
>>
>> As part of adding zstd support to all of the things, we'll need
>> to teach the wire protocol to support non-zlib compression formats.
>>
>> This commit documents how we'll implement that.
>>
>> To understand how we arrived at this proposal, let's look at how
>> things are done today.
>>
>> The wire protocol today doesn't have a unified format. Instead,
>> there is a limited facility for differentiating replies as successful
>> or not. And, each command essentially defines its own response format.
>>
>> A significant deficiency in the current protocol is the lack of
>> payload framing over the SSH transport. In the HTTP transport,
>> chunked transfer is used and the end of an HTTP response body (and
>> the end of a Mercurial command response) can be identified by a 0
>> length chunk. This is how HTTP chunked transfer works. But in the
>> SSH transport, there is no such framing, at least for certain
>> responses (notably the response to "getbundle" requests). Clients
>> can't simply read until end of stream because the socket is
>> persistent and reused for multiple requests. Clients need to know
>> when they've encountered the end of a request but there is nothing
>> simple for them to key off of to detect this. So what happens is
>> the client must decode the payload (as opposed to being dumb and
>> forwarding frames/packets). This means the payload itself needs
>> to support identifying end of stream. In some cases (bundle2), it
>> also means the payload can encode "error" or "interrupt" events
>> telling the client to e.g. abort processing. The lack of framing
>> on the SSH transport and the transfer of its responsibilities to
>> e.g. bundle2 is a massive layering violation and a wart on the
>> protocol architecture. It needs to be fixed someday by inventing a
>> proper framing protocol.
>>
>> So about compression.
>>
>> The client transport abstractions have a "_callcompressable()"
>> API. This API is called to invoke a remote command that will
>> send a compressable response. The response is essentially a
>> "streaming" response (no framing data at the Mercurial layer)
>> that is fed into a decompressor.
>>
>> On the HTTP transport, the decompressor is zlib and only zlib.
>> There is currently no mechanism for the client to specify an
>> alternate compression format. And, clients don't advertise what
>> compression formats they support or ask the server to send a
>> specific compression format. Instead, it is assumed that non-error
>> responses to "compressable" commands are zlib compressed.
>>
>> On the SSH transport, there is no compression at the Mercurial
>> protocol layer. Instead, compression must be handled by SSH
>> itself (e.g. `ssh -C`) or within the payload data (e.g. bundle
>> compression).
>>
>> For the HTTP transport, adding new compression formats is pretty
>> straightforward. Once you know what decompressor to use, you can
>> stream data into the decompressor until you reach a 0 size HTTP
>> chunk, at which point you are at end of stream.
>>
>> So our wire protocol changes for the HTTP transport are pretty
>> straightforward: the client and server advertise what compression
>> formats they support and an appropriate compression format is
>> chosen. We introduce a new HTTP media type to hold compressed
>> payloads. The first 2 bytes of the payload define the compression
>> format being used. Whoever is on the receiving end can sniff the
>> first 2 bytes and handle the remaining data accordingly.
>>
>> Support for multiple compression formats is advertised on both
>> server and client. The server advertises a "compression" capability
>> saying which compression formats it supports and in what order they
>> are preferred. Clients advertise their support for multiple
>> compression formats via the HTTP "Accept" header.
>>
>> Strictly speaking, servers don't need to advertise which compression
>> formats they support. But doing so allows clients to fail fast if
>> they don't support any of the formats the server does. This is useful
>> in situations like sending bundles, where the client may have to
>> perform expensive computation before sending data to the server.
>>
>> By advertising compression support on each request in the "Accept"
>> header and by introducing a new media type, the server is able
>> to gradually transition existing commands/responses to use compression,
>> even if they don't do so today. Contrast with the old world, where
>> "application/mercurial-0.1" may or may not use zlib compression
>> depending on the command being called. Compression is defined as
>> part of "application/mercurial-0.2," so if a client supports this
>> media type it supports compression.
>>
>> It's worth noting that we explicitly don't use "Accept-Encoding,"
>> "Content-Encoding," or "Transfer-Encoding" for handling compression.
>> People knowledgeable of the HTTP specifications will say that we
>> should use these because compression is a media or transfer encoding,
>> not a media type and dynamic compression is exactly what these
>> headers should be used for. They have a point and I sympathize with
>> the argument. However, my years of experience rolling out services
>> leveraging HTTP has taught me to not trust the HTTP layer, especially
>> if you are going outside the normal spec (such as using a custom
>> "Content-Encoding" value to represent zstd streams). I've seen load
>> balancers, proxies, and other network devices do very bad and
>> unexpected things to HTTP messages (like insisting zlib compressed
>> content is decoded and then re-encoded at a different compression level
>> or even stripping compression completely). I've found that the best
>> way to avoid surprises when writing protocols on top of HTTP is to use
>> HTTP as a dumb transport as much as possible to minimize the chances
>> that an "intelligent" agent between endpoints will muck with your data.
>> While the widespread use of TLS is mitigating many intermediate
>> network agents interfering with HTTP, there are still problems at the
>> edges, with e.g. the origin HTTP server needing to convert HTTP to and
>> from WSGI and buggy or feature-lacking HTTP client implementations.
>> I've found the best way to avoid these problems is to avoid using
>> headers like "Content-Encoding" and to bake as much logic as possible
>> into media types and HTTP message bodies. The protocol changes in this
>> commit do rely on the "Accept" and "Content-Type" headers. But we
>> used them before, so we shouldn't be increasing our exposure to "bad"
>> HTTP agents.
>>
>> What about SSH.
>>
>> For the SSH transport, we can't easily implement content negotiation
>> to determine compression formats because the SSH transport has no
>> content negotiation capabilities today. And without a framing protocol,
>> we don't know how much data to feed into a decompressor. So in order
>> to implement compression support on the SSH transport, we'd need to
>> invent a mechanism to represent content types and an outer framing
>> protocol to stream data robustly. While I'm fully capable of doing
>> that, it is a lot of work and not something that should be undertaken
>> lightly. My opinion is that if we're going to change the SSH transport
>> protocol, we should take a long hard look at implementing a grand
>> unified protocol that attempts to address all the deficiencies with
>> the existing protocol. While I want this to happen, that would be
>> massive scope bloat standing in the way of zstd support. So, I've
>> decided to take the easy solution: the SSH transport will not gain
>> support for multiple compression formats. Keep in mind it doesn't
>> support *any* compression today. So essentially nothing is changing
>> on the SSH front.
>>
>> diff --git a/mercurial/help/internals/wireprotocol.txt
>> b/mercurial/help/internals/wireprotocol.txt
>> --- a/mercurial/help/internals/wireprotocol.txt
>> +++ b/mercurial/help/internals/wireprotocol.txt
>> @@ -68,8 +68,16 @@ Example HTTP requests::
>>  The ``Content-Type`` HTTP response header identifies the response as
>> coming
>>  from Mercurial and can also be used to signal an error has occurred.
>>
>> -The ``application/mercurial-0.1`` media type indicates a generic
>> Mercurial
>> -response. It matches the media type sent by the client.
>> +The ``application/mercurial-*`` media types indicate a generic Mercurial
>> +data type.
>> +
>> +The ``application/mercurial-0.1`` media type is raw Mercurial data.
>> +
>> +The ``application/mercurial-0.2`` media type is compression framed
>> Mercurial
>> +data. The first 2 bytes of the payload indicate the compression format
>> +used. The remaining bytes are compressed according to that compression
>> +format. The decompressed data behaves the same as with
>> +``application/mercurial-0.1``.
>>
>
> The 2 character limitation concerns me, because it doesn't give many
> usable values (considering that a lot of compression has a Z in it, this is
> perhaps fewer than you might expect) or mechanisms to describe variations.
>
> Examples:
> Would lz4 be encoded as z4, l4, or lz?  (lz seems bad, since lzma, lzf,
> lzo, quicklz)
> Would lz4-without-framing (currently in use by remotefilelog) be
> represented differently than lz4-with-framing?  How so?
>
> This wasn't a large problem with a single list of compressors (typically
> when interacting with bundles, such as HG10UN or whatever), but with
> pluggable compressors this becomes a bigger problem :)
>

I agree with this concern. I'll find a way to address it in v2.


>
> Regardless of how this ends up being implemented
> (X-MercurialCompressionFormat: zstd, or using a size byte ("4zstd<data>"),
> or using a delimiter ("zstd\0<data>"), etc.) we should document case
> sensitivity here.
>
> Is 'batch' handled in any special fashion by this?  Does the batch
> response need to be either entirely compressed or entirely uncompressed, or
> are we anticipating individual commands making the decision, and batch
> being an intermediate (non-compressed) framing format around the commands?
> It might be nice if I could do something like batch cmds=getfile
> foo;heads;getfile bar  and return "lz4\0<foo_data>;un\0<heads_data>;zstd\0<bar_data>"
> depending on which compressed files I happened to already have available.
> I know augie had been possibly re-thinking batch, we may want to have that
> discussion now :)
>

In the new world, any command /could/ send a streaming and compressible
application/mercurial-0.2 response. However, that would only work on the
HTTP transport. Furthermore, we'd need a new API on the server to indicate
a response had to be a simple string for SSH and legacy HTTP and streaming
for modern HTTP (because we can't just convert the response type of
existing commands).


>
> If we're going to be putting data at the beginning of these blocks (and if
> we're doing batch, I think we might need to), can we make it extensible?
> I'm at risk of sounding like a broken record, but any time we find a reason
> to put *some* metadata somewhere, we essentially always end up finding a
> reason to put some *other* metadata somewhere and then have to design for
> that.  Specifically, it might be better if instead of just assuming
> '<compressiontype>\0' prefixes the response, we have something like we have
> for bundlecaps: some format of specifying key/value pairs.  I can think of
> three that I want, already: 1) compressiontype, 2) batchnum [to handle
> out-of-order batch responses], 3) batchcode [if an individual batch message
> encounters an error, there needs to be some way of indicating the error
> status vs. the success status.  I think it's remotefilelog that currently
> does this inline, via <int>\0<data_or_error_str>, formalizing that and
> making it available to other things would be nice :)]
>

For HTTP, we kinda/sorta have that in HTTP headers. But it only works for
metadata known before any response body data is sent (I'm pretending HTTP
Trailers don't exist because that little-used HTTP feature should not be
used).

For SSH, we'd need to convey metadata as part of the response payload.

I agree that having a formalized place to stuff metadata would be nice. But
that seemingly requires a new wire protocol format to satisfy SSH. So I'm
going to call scope bloat.

Regarding batch responses, keep in mind that the existing batch command is
essentially a workaround for the fact that the protocol is a synchronous
request-response protocol. While that will always be true for HTTP/1, SSH,
HTTP/2, or a custom protocol tunneled over HTTP/1 via Upgrade header could
allow concurrent requests with out-of-order responses. However, I believe
we have issues with some servers still requiring HTTP/1 and not allowing
Upgrade header magic. So perhaps we'll always be stuck with the existing
batch command and its request semantics. That being said, if the client
doesn't signal end of request body, we could theoretically perform all
server interaction as part of a single HTTP request/response by having the
client keep sending new commands as part of a POST body *after* it has
received responses from that very same HTTP request. I don't want to think
about the ugly hacks we may have to employ for HTTP/1...




>
>
>>  The ``application/hg-error`` media type indicates a generic error
>> occurred.
>>  The content of the HTTP response body typically holds text describing the
>> @@ -81,15 +89,19 @@ type.
>>  Clients also accept the ``text/plain`` media type. All other media
>>  types should cause the client to error.
>>
>> +Behavior of media types is further described in the ``Content
>> Negotiation``
>> +section below.
>> +
>>  Clients should issue a ``User-Agent`` request header that identifies the
>> client.
>>  The server should not use the ``User-Agent`` for feature detection.
>>
>> -A command returning a ``string`` response issues the
>> -``application/mercurial-0.1`` media type and the HTTP response body
>> contains
>> -the raw string value. A ``Content-Length`` header is typically issued.
>> +A command returning a ``string`` response issues a
>> +``application/mercurial-0.*`` media type and the HTTP response body
>> contains
>> +the raw string value (after compression decoding, if used). A
>> +``Content-Length`` header is typically issued, but not required.
>>
>> -A command returning a ``stream`` response issues the
>> -``application/mercurial-0.1`` media type and the HTTP response is
>> typically
>> +A command returning a ``stream`` response issues a
>> +``application/mercurial-0.*`` media type and the HTTP response is
>> typically
>>  using *chunked transfer* (``Transfer-Encoding: chunked``).
>>
>>  SSH Transport
>> @@ -233,6 +245,29 @@ 2006).
>>  This capability was introduced at the same time as the ``lookup``
>>  capability/command.
>>
>> +compression
>> +-----------
>> +
>> +Declares support for negotiating compression formats.
>> +
>> +Presence of this capability indicates the server supports dynamic
>> selection
>> +of compression formats based on the client request.
>> +
>> +Servers advertising this capability are required to support the
>> +``application/mercurial-0.2`` media type in response to commands
>> returning
>> +streams. Servers may support this media type on any command.
>> +
>> +The value of the capability is a comma-delimited list of strings
>> declaring
>> +supported compression formats. The order of the compression formats is in
>> +server-preferred order, most preferred first.
>> +
>> +The compression format strings are 2 byte identifiers. These are the same
>> +2 byte *header* values at the beginning of ``application/mercurial-0.2``
>> +media types (as used by the HTTP transport).
>> +
>> +This capability was introduced in Mercurial 4.1 (released February
>> +2017).
>> +
>>  getbundle
>>  ---------
>>
>> @@ -416,6 +451,46 @@ Mercurial server replies to the client-i
>>  not conforming to the expected command responses is assumed to be not
>> related
>>  to Mercurial and can be ignored.
>>
>> +Content Negotiation
>> +===================
>> +
>> +The wire protocol has some mechanisms to help peers determine what
>> content
>> +types and encoding the other side will accept. Historically, these
>> mechanisms
>> +have been built into commands themselves because most commands only send
>> a
>> +well-defined response type and only certain commands needed to support
>> +functionality like compression.
>> +
>> +Currently, only the HTTP transport supports content negotiation at the
>> protocol
>> +layer.
>> +
>> +HTTP requests advertise accepted media types via the ``Accept`` header.
>> +
>> +All clients should advertise an ``application/mercurial-0.1`` value.
>> +
>> +Clients supporting it can also advertise ``application/mercurial-0.2``.
>> +This media type supports the ``comp`` parameter to declare which
>> compression
>> +formats the client accepts. The value is a ``quoted-string`` (defined by
>> +HTTP specification) containing a space-delimited list of 2 byte
>> compression
>> +format identifiers. e.g. ``application/mercurial-0.2; comp="ZS ZL UN"``.
>> +If the ``comp`` parameter is absent, the server interprets this as
>> equivalent
>> +to ``ZL UN``.
>> +
>> +Clients may choose to only advertise the ``application/mercurial-0.2``
>> media
>> +type if the server advertises the ``compression`` capability.
>> +
>> +A server that doesn't receive an ``Accept`` header listing any
>> +``application/mercurial-*`` values should infer that
>> +``application/mercurial-0.1`` was sent, as this media type should be
>> supported
>> +by all clients ever written.
>> +
>> +A server receiving multiple ``application/mercurial-*`` values may
>> choose any
>> +of them. For example, a server may issue ``application/mercurial-0.2``
>> only
>> +for responses that it chooses to compress.
>> +
>> +A server may issue ``application/hg-*`` media types even though the
>> client
>> +does not specify support for them in an ``Accept`` header. This is for
>> +backwards compatibility reasons.
>> +
>>  Commands
>>  ========
>>
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel@mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>
>

Patch

diff --git a/mercurial/help/internals/wireprotocol.txt b/mercurial/help/internals/wireprotocol.txt
--- a/mercurial/help/internals/wireprotocol.txt
+++ b/mercurial/help/internals/wireprotocol.txt
@@ -68,8 +68,16 @@  Example HTTP requests::
 The ``Content-Type`` HTTP response header identifies the response as coming
 from Mercurial and can also be used to signal an error has occurred.
 
-The ``application/mercurial-0.1`` media type indicates a generic Mercurial
-response. It matches the media type sent by the client.
+The ``application/mercurial-*`` media types indicate a generic Mercurial
+data type.
+
+The ``application/mercurial-0.1`` media type is raw Mercurial data.
+
+The ``application/mercurial-0.2`` media type is compression framed Mercurial
+data. The first 2 bytes of the payload indicate the compression format
+used. The remaining bytes are compressed according to that compression
+format. The decompressed data behaves the same as with
+``application/mercurial-0.1``.
 
 The ``application/hg-error`` media type indicates a generic error occurred.
 The content of the HTTP response body typically holds text describing the
@@ -81,15 +89,19 @@  type.
 Clients also accept the ``text/plain`` media type. All other media
 types should cause the client to error.
 
+Behavior of media types is further described in the ``Content Negotiation``
+section below.
+
 Clients should issue a ``User-Agent`` request header that identifies the client.
 The server should not use the ``User-Agent`` for feature detection.
 
-A command returning a ``string`` response issues the
-``application/mercurial-0.1`` media type and the HTTP response body contains
-the raw string value. A ``Content-Length`` header is typically issued.
+A command returning a ``string`` response issues a
+``application/mercurial-0.*`` media type and the HTTP response body contains
+the raw string value (after compression decoding, if used). A
+``Content-Length`` header is typically issued, but not required.
 
-A command returning a ``stream`` response issues the
-``application/mercurial-0.1`` media type and the HTTP response is typically
+A command returning a ``stream`` response issues a
+``application/mercurial-0.*`` media type and the HTTP response is typically
 using *chunked transfer* (``Transfer-Encoding: chunked``).
 
 SSH Transport
@@ -233,6 +245,29 @@  2006).
 This capability was introduced at the same time as the ``lookup``
 capability/command.
 
+compression
+-----------
+
+Declares support for negotiating compression formats.
+
+Presence of this capability indicates the server supports dynamic selection
+of compression formats based on the client request.
+
+Servers advertising this capability are required to support the
+``application/mercurial-0.2`` media type in response to commands returning
+streams. Servers may support this media type on any command.
+
+The value of the capability is a comma-delimited list of strings declaring
+supported compression formats. The order of the compression formats is in
+server-preferred order, most preferred first.
+
+The compression format strings are 2 byte identifiers. These are the same
+2 byte *header* values at the beginning of ``application/mercurial-0.2``
+media types (as used by the HTTP transport).
+
+This capability was introduced in Mercurial 4.1 (released February
+2017).
+
 getbundle
 ---------
 
@@ -416,6 +451,46 @@  Mercurial server replies to the client-i
 not conforming to the expected command responses is assumed to be not related
 to Mercurial and can be ignored.
 
+Content Negotiation
+===================
+
+The wire protocol has some mechanisms to help peers determine what content
+types and encoding the other side will accept. Historically, these mechanisms
+have been built into commands themselves because most commands only send a
+well-defined response type and only certain commands needed to support
+functionality like compression.
+
+Currently, only the HTTP transport supports content negotiation at the protocol
+layer.
+
+HTTP requests advertise accepted media types via the ``Accept`` header.
+
+All clients should advertise an ``application/mercurial-0.1`` value.
+
+Clients supporting it can also advertise ``application/mercurial-0.2``.
+This media type supports the ``comp`` parameter to declare which compression
+formats the client accepts. The value is a ``quoted-string`` (defined by
+HTTP specification) containing a space-delimited list of 2 byte compression
+format identifiers. e.g. ``application/mercurial-0.2; comp="ZS ZL UN"``.
+If the ``comp`` parameter is absent, the server interprets this as equivalent
+to ``ZL UN``.
+
+Clients may choose to only advertise the ``application/mercurial-0.2`` media
+type if the server advertises the ``compression`` capability.
+
+A server that doesn't receive an ``Accept`` header listing any
+``application/mercurial-*`` values should infer that
+``application/mercurial-0.1`` was sent, as this media type should be supported
+by all clients ever written.
+
+A server receiving multiple ``application/mercurial-*`` values may choose any
+of them. For example, a server may issue ``application/mercurial-0.2`` only
+for responses that it chooses to compress.
+
+A server may issue ``application/hg-*`` media types even though the client
+does not specify support for them in an ``Accept`` header. This is for
+backwards compatibility reasons.
+
 Commands
 ========