Proposal for introducing Protobuf serialization

2016-03-05 17:33:12 -05:00 · 2016-03-05 17:33:12 -05:00 · 95cf60be23
parent 791c694c21
commit 95cf60be23
1 changed files with 509 additions and 0 deletions
--- a/docs/proposals/protobuf.md
+++ b/docs/proposals/protobuf.md
@ -0,0 +1,509 @@
 <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 <!-- BEGIN STRIP_FOR_RELEASE -->
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 If you are using a released version of Kubernetes, you should
 refer to the docs that go with that version.
 Documentation for other releases can be found at
 [releases.k8s.io](http://releases.k8s.io).
 </strong>
 --
 <!-- END STRIP_FOR_RELEASE -->
 <!-- END MUNGE: UNVERSIONED_WARNING -->
 # Protobuf serialization and internal storage
@smarterclayton
 March 2016
 ## Proposal and Motivation
 The Kubernetes API server is a "dumb server" which offers storage, versioning,
 validation, update, and watch semantics on API resources. In a large cluster
 the API server must efficiently retrieve, store, and deliver large numbers
 of coarse-grained objects to many clients. In addition, Kubernetes traffic is
 heavily biased towards intra-cluster traffic - as much as 90% of the requests
 served by the APIs are for internal cluster components like nodes, controllers,
 and proxies. The primary format for intercluster API communication is JSON
 today for ease of client construction.
 At the current time, the latency of reaction to change in the cluster is
 dominated by the time required to load objects from persistent store (etcd),
 convert them to an output version, serialize them JSON over the network, and
 then perform the reverse operation in clients. The cost of
 serialization/deserialization and the size of the bytes on the wire, as well
 as the memory garbage created during those operations, dominate the CPU and
 network usage of the API servers.
 In order to reach clusters of 10k nodes, we need roughly an order of magnitude
 efficiency improvement in a number of areas of the cluster, starting with the
 masters but also including API clients like controllers, kubelets, and node
 proxies.
 We propose to introduce a Protobuf serialization for all common API objects
 that can optionally be used by intra-cluster components. Experiments have
 demonstrated a 10x reduction in CPU use during serialization and deserialization,
 a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount
 of objects created on the heap during serialization. The Protobuf schema
 for each object will be automatically generated from the external API Go structs
 we use to serialize to JSON.
 Benchmarking showed that the time spent on the server in a typical GET
 resembles:
          etcd -> decode -> defaulting -> convert to internal ->
    JSON          50us      5us           15us
    Proto         5us
    JSON          150allocs               80allocs
    Proto         100allocs
          process -> convert to external -> encode -> client
    JSON             15us                   40us
    Proto                                   5us
    JSON             80allocs               100allocs
    Proto                                   4allocs
 Protobuf has a huge benefit on encoding because it does not need to allocate
 temporary objects, just one large buffer. Changing to protobuf moves our
 hotspot back to conversion, not serialization.
 ## Design Points
 * Generate Protobuf schema from Go structs (like we do for JSON) to avoid
  manual schema update and drift
 * Generate Protobuf schema that is field equivalent to the JSON fields (no
  special types or enumerations), reducing drift for clients across formats.
 * Follow our existing API versioning rules (backwards compatible in major
  API versions, breaking changes across major versions) by creating one
  Protobuf schema per API type.
 * Continue to use the existing REST API patterns but offer an alternative
  serialization, which means existing client and server tooling can remain
  the same while benefiting from faster decoding.
 * Protobuf objects on disk or in etcd will need to be self identifying at
  rest, like JSON, in order for backwards compatibility in storage to work,
  so we must add an envelope with apiVersion and kind to wrap the nested
  object, and make the data format recognizable to clients.
 * Use the [gogo-protobuf](https://github.com/gogo/protobuf) Golang library to generate marshal/unmarshal
  operations, allowing us to bypass the expensive reflection used by the
  golang JSOn operation
 ## Alternatives
 * We considered JSON compression to reduce size on wire, but that does not
  reduce the amount of memory garbage created during serialization and
  deserialization.
 * More efficient formats like Msgpack were considered, but they only offer
  2x speed up vs the 10x observed for Protobuf
 * gRPC was considered, but is a larger change that requires more core
  refactoring. This approach does not eliminate the possibility of switching
  to gRPC in the future.
 * We considered attempting to improve JSON serialization, but the cost of
  implementing a more efficient serializer library than ugorji is
  significantly higher than creating a protobuf schema from our Go structs.
 ## Schema
 The Protobuf schema for each API group and version will be generated from
 the objects in that API group and version. The schema will be named using
 the package identifier of the Go package, i.e.
    k8s.io/kubernetes/pkg/api/v1
 Each top level object will be generated as a Protobuf message, i.e.:
    type Pod struct { ... }
    message Pod {}
 Since the Go structs are designed to be serialized to JSON (with only the
 int, string, bool, map, and array primitive types), we will use the
 canonical JSON serialization as the protobuf field type wherever possible,
 i.e.:
    JSON      Protobuf
    string -> string
    int    -> varint
    bool   -> bool
    array  -> repeating message|primitive
 We disallow the use of the Go `int` type in external fields because it is
 ambiguous depending on compiler platform, and instead always use `int32` or
 `int64`.
 We will use maps (a protobuf 3 extension that can serialize to protobuf 2)
 to represent JSON maps:
    JSON      Protobuf            Wire (proto2)
    map    -> map<string, ...> -> repeated Message { key string; value bytes }
 We will not convert known string constants to enumerations, since that
 would require extra logic we do not already have in JSOn.
 To begin with, we will use Protobuf 3 to generate a Protobuf 2 schema, and
 in the future investigate a Protobuf 3 serialization. We will introduce
 abstractions that let us have more than a single protobuf serialization if
 necessary. Protobuf 3 would require us to support message types for
 pointer primitive (nullable) fields, which is more complex than Protobuf 2's
 support for pointers.
 ### Example of generated proto IDL
 Without gogo extensions:
 ```
 syntax = 'proto2';
 package k8s.io.kubernetes.pkg.api.v1;
 import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
 import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
 import "k8s.io/kubernetes/pkg/runtime/generated.proto";
 import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
 // Package-wide variables from generator "generated".
 option go_package = "v1";
 // Represents a Persistent Disk resource in AWS.
 //
 // An AWS EBS disk must exist before mounting to a container. The disk
 // must also be in the same AWS zone as the kubelet. An AWS EBS disk
 // can only be mounted as read/write once. AWS EBS volumes support
 // ownership management and SELinux relabeling.
 message AWSElasticBlockStoreVolumeSource {
  // Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  optional string volumeID = 1;
  // Filesystem type of the volume that you want to mount.
  // Tip: Ensure that the filesystem type is supported by the host operating system.
  // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  // TODO: how do we prevent errors in the filesystem from compromising the machine
  optional string fsType = 2;
  // The partition in the volume that you want to mount.
  // If omitted, the default is to mount by volume name.
  // Examples: For volume /dev/sda1, you specify the partition as "1".
  // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
  optional int32 partition = 3;
  // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
  // If omitted, the default is "false".
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  optional bool readOnly = 4;
 }
 // Affinity is a group of affinity scheduling rules, currently
 // only node affinity, but in the future also inter-pod affinity.
 message Affinity {
  // Describes node affinity scheduling rules for the pod.
  optional NodeAffinity nodeAffinity = 1;
 }
 ```
 With extensions:
 ```
 syntax = 'proto2';
 package k8s.io.kubernetes.pkg.api.v1;
 import "github.com/gogo/protobuf/gogoproto/gogo.proto";
 import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
 import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
 import "k8s.io/kubernetes/pkg/runtime/generated.proto";
 import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
 // Package-wide variables from generator "generated".
 option (gogoproto.marshaler_all) = true;
 option (gogoproto.sizer_all) = true;
 option (gogoproto.unmarshaler_all) = true;
 option (gogoproto.goproto_unrecognized_all) = false;
 option (gogoproto.goproto_enum_prefix_all) = false;
 option (gogoproto.goproto_getters_all) = false;
 option go_package = "v1";
 // Represents a Persistent Disk resource in AWS.
 //
 // An AWS EBS disk must exist before mounting to a container. The disk
 // must also be in the same AWS zone as the kubelet. An AWS EBS disk
 // can only be mounted as read/write once. AWS EBS volumes support
 // ownership management and SELinux relabeling.
 message AWSElasticBlockStoreVolumeSource {
  // Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  optional string volumeID = 1 [(gogoproto.customname) = "VolumeID", (gogoproto.nullable) = false];
  // Filesystem type of the volume that you want to mount.
  // Tip: Ensure that the filesystem type is supported by the host operating system.
  // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  // TODO: how do we prevent errors in the filesystem from compromising the machine
  optional string fsType = 2 [(gogoproto.customname) = "FSType", (gogoproto.nullable) = false];
  // The partition in the volume that you want to mount.
  // If omitted, the default is to mount by volume name.
  // Examples: For volume /dev/sda1, you specify the partition as "1".
  // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
  optional int32 partition = 3 [(gogoproto.customname) = "Partition", (gogoproto.nullable) = false];
  // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
  // If omitted, the default is "false".
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  optional bool readOnly = 4 [(gogoproto.customname) = "ReadOnly", (gogoproto.nullable) = false];
 }
 // Affinity is a group of affinity scheduling rules, currently
 // only node affinity, but in the future also inter-pod affinity.
 message Affinity {
  // Describes node affinity scheduling rules for the pod.
  optional NodeAffinity nodeAffinity = 1 [(gogoproto.customname) = "NodeAffinity"];
 }
 ```
 ## Wire format
 In order to make Protobuf serialized objects recognizable in a binary form,
 the encoded object must be prefixed by a magic number, and then wrap the
 non-self-describing Protobuf object in a Protobuf object that contains
 schema information.  The protobuf object is referred to as the `raw` object
 and the encapsulation is referred to as `wrapper` object.
 The simplest serialization is the raw Protobuf object with no identifying
 information. In some use cases, we may wish to have the server identify the
 raw object type on the wire using a protocol dependent format (gRPC uses
 a type HTTP header). This works when all objects are of the same type, but
 we occasionally have reasons to encode different object types in the same
 context (watches, lists of objects on disk, and API calls that may return
 errors).
 To identify the type of a wrapped Protobuf object, we wrap it in a message
 in package `k8s.io/kubernetes/pkg/runtime` with message name `Unknown`
 having the following schema:
    message Unknown {
      optional TypeMeta typeMeta = 1;
      optional bytes value = 2;
      optional string contentEncoding = 3;
      optional string contentType = 4;
    }
    message TypeMeta {
      optional string apiVersion = 1;
      optional string kind = 2;
    }
 The `value` field is an encoded protobuf object that matches the schema
 defined in `typeMeta` and has optional `contentType` and `contentEncoding`
 fields.  `contentType` and `contentEncoding` have the same meaning as in
 HTTP, if unspecified `contentType` means "raw protobuf object", and
 `contentEncoding` defaults to no encoding. If `contentEncoding` is
 specified, the defined transformation should be applied to `value` before
 attempting to decode the value.
 The `contentType` field is required to support objects without a defined
 protobuf schema, like the ThirdPartyResource or templates. Those objects
 would have to be encoded as JSON or another structure compatible form
 when used with Protobuf. Generic clients must deal with the possibility
 that the returned value is not in the known type.
 We add the `contentEncoding` field here to preserve room for future
 optimizations like encryption-at-rest or compression of the nested content.
 Clients should error when receiving an encoding they do not support.
 Negotioting encoding is not defined here, but introducing new encodings
 is similar to introducing a schema change or new API version.
 A client should use the `kind` and `apiVersion` fields to identify the
 correct protobuf IDL for that message and version, and then decode the
 `bytes` field into that Protobuf message.
 Any Unknown value written to stable storage will be given a 4 byte prefix
 `0x6b, 0x38, 0x73, 0x00`, which correspond to `k8s` followed by a zero byte.
 The content-type `application/vnd.kubernetes.protobuf` is defined as
 representing the following schema:
    MESSAGE = '0x6b 0x38 0x73 0x00' UNKNOWN
    UNKNOWN = <protobuf serialization of k8s.io/kubernetes/pkg/runtime#Unknown>
 A client should check for the first four bytes, then perform a protobuf
 deserialization of the remaining bytes into the `runtime.Unknown` type.
 ## Streaming wire format
 While the majority of Kubernetes APIs return single objects that can vary
 in type (Pod vs Status, PodList vs Status), the watch APIs return a stream
 of identical objects (Events). At the time of this writing, this is the only
 current or anticipated streaming RESTful protocol (logging, port-forwarding,
 and exec protocols use a binary protocol over Websockets or SPDY).
 In JSON, this API is implemented as a stream of JSON objects that are
 separated by their syntax (the closing `}` brace is followed by whitespace
 and the opening `{` brace starts the next object). There is no formal
 specification covering this pattern, nor a unique content-type. Each object
 is expected to be of type `watch.Event`, and is currently not self describing.
 For expediency and consistency, we define a format for Protobuf watch Events
 that is similar. Since protobuf messages are not self describing, we must
 identify the boundaries between Events (a `frame`). We do that by prefixing
 each frame of N bytes with a 4-byte, big-endian, unsigned integer with the
 value N.
    frame  = length body
    length = 32-bit unsigned integer in big-endian order, denoting length of
             bytes of body
    body = <bytes>
    # frame containing a single byte 0a
    frame = 01 00 00 00 0a
    # equivalent JSON
    frame = {"type": "added", ...}
 The body of each frame is a serialized Protobuf message `Event` in package
 `k8s.io/kubernetes/pkg/watch/versioned`. The content type used for this
 format is `application/vnd.kubernetes.protobuf;type=watch`.
 ## Negotiation
 To allow clients to request protobuf serialization optionally, the `Accept`
 HTTP header is used by callers to indicate which serialization they wish
 returned in the response, and the `Content-Type` header is used to tell the
 server how to decode the bytes sent in the request (for DELETE/POST/PUT/PATCH
 requests). The server will return 406 if the `Accept` header is not
 recognized or 415 if the `Content-Type` is not recognized (as defined in
 RFC2616).
 To be backwards compatible, clients must consider that the server does not
 support protobuf serialization. A number of options are possible:
 ### Preconfigured
 Clients can have a configuration setting that instructs them which version
 to use. This is the simplest option, but requires intervention when the
 component upgrades to protobuf.
 ### Include serialization information in api-discovery
 Servers can define the list of content types they accept and return in
 their API discovery docs, and clients can use protobuf if they support it.
 Allows dynamic configuration during upgrade if the client is already using
 API-discovery.
 ### Optimistically attempt to send and receive requests using protobuf
 Using multiple `Accept` values:
    Accept: application/vnd.kubernetes.protobuf, application/json
 clients can indicate their preferences and handle the returned
 `Content-Type` using whatever the server responds. On update operations,
 clients can try protobuf and if they receive a 415 error, record that and
 fall back to JSON. Allows the client to be backwards compatible with
 any server, but comes at the cost of some implementation complexity.
 ## Generation process
 Generation proceeds in five phases:
 1. Generate a gogo-protobuf annotated IDL from the source Go struct.
 2. Generate temporary Go structs from the IDL using gogo-protobuf.
 3. Generate marshaller/unmarshallers based on the IDL using gogo-protobuf.
 4. Take all tag numbers generated for the IDL and apply them as struct tags
   to the original Go types.
 5. Generate a final IDL without gogo-protobuf annotations as the canonical IDL.
 The output is a `generated.proto` file in each package containing a standard
 proto2 IDL, and a `generated.pb.go` file in each package that contains the
 generated marshal/unmarshallers.
 The Go struct generated by gogo-protobuf from the first IDL must be identical
 to the origin struct - a number of changes have been made to gogo-protobuf
 to ensure exact 1-1 conversion. A small number of additions may be necessary
 in the future if we introduce more exotic field types (Go type aliases, maps
 with aliased Go types, and embedded fields were fixed). If they are identical,
 the output marshallers/unmarshallers can then work on the origin struct.
 Whenever a new field is added, generation will assign that field a unique tag
 and the 4th phase will write that tag back to the origin Go struct as a `protobuf`
 struct tag. This ensures subsequent generation passes are stable, even in the
 face of internal refactors. The first time a field is added, the author will
 need to check in both the new IDL AND the protobuf struct tag changes.
 The second IDL is generated without gogo-protobuf annotations to allow clients
 in other languages to generate easily.
 Any errors in the generation process are considered fatal and must be resolved
 early (being unable to identify a field type for conversion, duplicate fields,
 duplicate tags, protoc errors, etc). The conversion fuzzer is used to ensure
 that a Go struct can be round-tripped to protobuf and back, as we do for JSON
 and conversion testing.
 ## Changes to development process
 All existing API change rules would still apply. New fields added would be
 automatically assigned a tag by the generation process. New API versions will
 have a new proto IDL, and field name and changes across API versions would be
 handled using our existing API change rules. Tags cannot change within an
 API version.
 Generation would be done by developers and then checked into source control,
 like conversions and ugorji JSON codecs.
 Because protoc is not packaged well across all platforms, we will add it to
 the `kube-cross` Docker image and developers can use that to generate
 updated protobufs. Protobuf 3 beta is required.
 The generated protobuf will be checked with a verify script before merging.
 ## Implications
 * The generated marshal code is large and will increase build times and binary
  size. We may be able to remove ugorji after protobuf is added, since the
  bulk of our decoding would switch to protobuf.
 * The protobuf schema is naive, which means it may not be as a minimal as
  possible.
 * Debugging of protobuf related errors is harder due to the binary nature of
  the format.
 * Migrating API object storage from JSON to protobuf will require that all
  API servers are upgraded before beginning to write protobuf to disk, since
  old servers won't recognize protobuf.
 * Transport of protobuf between etcd and the api server will be less efficient
  in etcd2 than etcd3 (since etcd2 must encode binary values returned as JSON).
  Should still be smaller than current JSON request.
 * Third-party API objects must be stored as JSON inside of a protobuf wrapper
  in etcd, and the API endpoints will not benefit from clients that speak
  protobuf. Clients will have to deal with some API objects not supporting
  protobuf.
 ## Open Questions
 * Is supporting stored protobuf files on disk in the kubectl client worth it?
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/protobuf.md?pixel)]()
 <!-- END MUNGE: GENERATED_ANALYTICS -->