From 95cf60be23b40e35b8ab25e7873f366974dea1c4 Mon Sep 17 00:00:00 2001 From: Clayton Coleman Date: Sat, 5 Mar 2016 17:33:12 -0500 Subject: [PATCH] Proposal for introducing Protobuf serialization --- docs/proposals/protobuf.md | 509 +++++++++++++++++++++++++++++++++++++ 1 file changed, 509 insertions(+) create mode 100644 docs/proposals/protobuf.md diff --git a/docs/proposals/protobuf.md b/docs/proposals/protobuf.md new file mode 100644 index 0000000000..e169e08288 --- /dev/null +++ b/docs/proposals/protobuf.md @@ -0,0 +1,509 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Protobuf serialization and internal storage + +@smarterclayton + +March 2016 + +## Proposal and Motivation + +The Kubernetes API server is a "dumb server" which offers storage, versioning, +validation, update, and watch semantics on API resources. In a large cluster +the API server must efficiently retrieve, store, and deliver large numbers +of coarse-grained objects to many clients. In addition, Kubernetes traffic is +heavily biased towards intra-cluster traffic - as much as 90% of the requests +served by the APIs are for internal cluster components like nodes, controllers, +and proxies. The primary format for intercluster API communication is JSON +today for ease of client construction. + +At the current time, the latency of reaction to change in the cluster is +dominated by the time required to load objects from persistent store (etcd), +convert them to an output version, serialize them JSON over the network, and +then perform the reverse operation in clients. The cost of +serialization/deserialization and the size of the bytes on the wire, as well +as the memory garbage created during those operations, dominate the CPU and +network usage of the API servers. + +In order to reach clusters of 10k nodes, we need roughly an order of magnitude +efficiency improvement in a number of areas of the cluster, starting with the +masters but also including API clients like controllers, kubelets, and node +proxies. + +We propose to introduce a Protobuf serialization for all common API objects +that can optionally be used by intra-cluster components. Experiments have +demonstrated a 10x reduction in CPU use during serialization and deserialization, +a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount +of objects created on the heap during serialization. The Protobuf schema +for each object will be automatically generated from the external API Go structs +we use to serialize to JSON. + +Benchmarking showed that the time spent on the server in a typical GET +resembles: + + etcd -> decode -> defaulting -> convert to internal -> + JSON 50us 5us 15us + Proto 5us + JSON 150allocs 80allocs + Proto 100allocs + + process -> convert to external -> encode -> client + JSON 15us 40us + Proto 5us + JSON 80allocs 100allocs + Proto 4allocs + + Protobuf has a huge benefit on encoding because it does not need to allocate + temporary objects, just one large buffer. Changing to protobuf moves our + hotspot back to conversion, not serialization. + + +## Design Points + +* Generate Protobuf schema from Go structs (like we do for JSON) to avoid + manual schema update and drift +* Generate Protobuf schema that is field equivalent to the JSON fields (no + special types or enumerations), reducing drift for clients across formats. +* Follow our existing API versioning rules (backwards compatible in major + API versions, breaking changes across major versions) by creating one + Protobuf schema per API type. +* Continue to use the existing REST API patterns but offer an alternative + serialization, which means existing client and server tooling can remain + the same while benefiting from faster decoding. +* Protobuf objects on disk or in etcd will need to be self identifying at + rest, like JSON, in order for backwards compatibility in storage to work, + so we must add an envelope with apiVersion and kind to wrap the nested + object, and make the data format recognizable to clients. +* Use the [gogo-protobuf](https://github.com/gogo/protobuf) Golang library to generate marshal/unmarshal + operations, allowing us to bypass the expensive reflection used by the + golang JSOn operation + + +## Alternatives + +* We considered JSON compression to reduce size on wire, but that does not + reduce the amount of memory garbage created during serialization and + deserialization. +* More efficient formats like Msgpack were considered, but they only offer + 2x speed up vs the 10x observed for Protobuf +* gRPC was considered, but is a larger change that requires more core + refactoring. This approach does not eliminate the possibility of switching + to gRPC in the future. +* We considered attempting to improve JSON serialization, but the cost of + implementing a more efficient serializer library than ugorji is + significantly higher than creating a protobuf schema from our Go structs. + + +## Schema + +The Protobuf schema for each API group and version will be generated from +the objects in that API group and version. The schema will be named using +the package identifier of the Go package, i.e. + + k8s.io/kubernetes/pkg/api/v1 + +Each top level object will be generated as a Protobuf message, i.e.: + + type Pod struct { ... } + + message Pod {} + +Since the Go structs are designed to be serialized to JSON (with only the +int, string, bool, map, and array primitive types), we will use the +canonical JSON serialization as the protobuf field type wherever possible, +i.e.: + + JSON Protobuf + string -> string + int -> varint + bool -> bool + array -> repeating message|primitive + +We disallow the use of the Go `int` type in external fields because it is +ambiguous depending on compiler platform, and instead always use `int32` or +`int64`. + +We will use maps (a protobuf 3 extension that can serialize to protobuf 2) +to represent JSON maps: + + JSON Protobuf Wire (proto2) + map -> map -> repeated Message { key string; value bytes } + +We will not convert known string constants to enumerations, since that +would require extra logic we do not already have in JSOn. + +To begin with, we will use Protobuf 3 to generate a Protobuf 2 schema, and +in the future investigate a Protobuf 3 serialization. We will introduce +abstractions that let us have more than a single protobuf serialization if +necessary. Protobuf 3 would require us to support message types for +pointer primitive (nullable) fields, which is more complex than Protobuf 2's +support for pointers. + +### Example of generated proto IDL + +Without gogo extensions: + +``` +syntax = 'proto2'; + +package k8s.io.kubernetes.pkg.api.v1; + +import "k8s.io/kubernetes/pkg/api/resource/generated.proto"; +import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto"; +import "k8s.io/kubernetes/pkg/runtime/generated.proto"; +import "k8s.io/kubernetes/pkg/util/intstr/generated.proto"; + +// Package-wide variables from generator "generated". +option go_package = "v1"; + +// Represents a Persistent Disk resource in AWS. +// +// An AWS EBS disk must exist before mounting to a container. The disk +// must also be in the same AWS zone as the kubelet. An AWS EBS disk +// can only be mounted as read/write once. AWS EBS volumes support +// ownership management and SELinux relabeling. +message AWSElasticBlockStoreVolumeSource { + // Unique ID of the persistent disk resource in AWS (Amazon EBS volume). + // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore + optional string volumeID = 1; + + // Filesystem type of the volume that you want to mount. + // Tip: Ensure that the filesystem type is supported by the host operating system. + // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. + // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore + // TODO: how do we prevent errors in the filesystem from compromising the machine + optional string fsType = 2; + + // The partition in the volume that you want to mount. + // If omitted, the default is to mount by volume name. + // Examples: For volume /dev/sda1, you specify the partition as "1". + // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty). + optional int32 partition = 3; + + // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true". + // If omitted, the default is "false". + // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore + optional bool readOnly = 4; +} + +// Affinity is a group of affinity scheduling rules, currently +// only node affinity, but in the future also inter-pod affinity. +message Affinity { + // Describes node affinity scheduling rules for the pod. + optional NodeAffinity nodeAffinity = 1; +} +``` + +With extensions: + +``` +syntax = 'proto2'; + +package k8s.io.kubernetes.pkg.api.v1; + +import "github.com/gogo/protobuf/gogoproto/gogo.proto"; +import "k8s.io/kubernetes/pkg/api/resource/generated.proto"; +import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto"; +import "k8s.io/kubernetes/pkg/runtime/generated.proto"; +import "k8s.io/kubernetes/pkg/util/intstr/generated.proto"; + +// Package-wide variables from generator "generated". +option (gogoproto.marshaler_all) = true; +option (gogoproto.sizer_all) = true; +option (gogoproto.unmarshaler_all) = true; +option (gogoproto.goproto_unrecognized_all) = false; +option (gogoproto.goproto_enum_prefix_all) = false; +option (gogoproto.goproto_getters_all) = false; +option go_package = "v1"; + +// Represents a Persistent Disk resource in AWS. +// +// An AWS EBS disk must exist before mounting to a container. The disk +// must also be in the same AWS zone as the kubelet. An AWS EBS disk +// can only be mounted as read/write once. AWS EBS volumes support +// ownership management and SELinux relabeling. +message AWSElasticBlockStoreVolumeSource { + // Unique ID of the persistent disk resource in AWS (Amazon EBS volume). + // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore + optional string volumeID = 1 [(gogoproto.customname) = "VolumeID", (gogoproto.nullable) = false]; + + // Filesystem type of the volume that you want to mount. + // Tip: Ensure that the filesystem type is supported by the host operating system. + // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. + // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore + // TODO: how do we prevent errors in the filesystem from compromising the machine + optional string fsType = 2 [(gogoproto.customname) = "FSType", (gogoproto.nullable) = false]; + + // The partition in the volume that you want to mount. + // If omitted, the default is to mount by volume name. + // Examples: For volume /dev/sda1, you specify the partition as "1". + // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty). + optional int32 partition = 3 [(gogoproto.customname) = "Partition", (gogoproto.nullable) = false]; + + // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true". + // If omitted, the default is "false". + // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore + optional bool readOnly = 4 [(gogoproto.customname) = "ReadOnly", (gogoproto.nullable) = false]; +} + +// Affinity is a group of affinity scheduling rules, currently +// only node affinity, but in the future also inter-pod affinity. +message Affinity { + // Describes node affinity scheduling rules for the pod. + optional NodeAffinity nodeAffinity = 1 [(gogoproto.customname) = "NodeAffinity"]; +} +``` + +## Wire format + +In order to make Protobuf serialized objects recognizable in a binary form, +the encoded object must be prefixed by a magic number, and then wrap the +non-self-describing Protobuf object in a Protobuf object that contains +schema information. The protobuf object is referred to as the `raw` object +and the encapsulation is referred to as `wrapper` object. + +The simplest serialization is the raw Protobuf object with no identifying +information. In some use cases, we may wish to have the server identify the +raw object type on the wire using a protocol dependent format (gRPC uses +a type HTTP header). This works when all objects are of the same type, but +we occasionally have reasons to encode different object types in the same +context (watches, lists of objects on disk, and API calls that may return +errors). + +To identify the type of a wrapped Protobuf object, we wrap it in a message +in package `k8s.io/kubernetes/pkg/runtime` with message name `Unknown` +having the following schema: + + message Unknown { + optional TypeMeta typeMeta = 1; + optional bytes value = 2; + optional string contentEncoding = 3; + optional string contentType = 4; + } + + message TypeMeta { + optional string apiVersion = 1; + optional string kind = 2; + } + +The `value` field is an encoded protobuf object that matches the schema +defined in `typeMeta` and has optional `contentType` and `contentEncoding` +fields. `contentType` and `contentEncoding` have the same meaning as in +HTTP, if unspecified `contentType` means "raw protobuf object", and +`contentEncoding` defaults to no encoding. If `contentEncoding` is +specified, the defined transformation should be applied to `value` before +attempting to decode the value. + +The `contentType` field is required to support objects without a defined +protobuf schema, like the ThirdPartyResource or templates. Those objects +would have to be encoded as JSON or another structure compatible form +when used with Protobuf. Generic clients must deal with the possibility +that the returned value is not in the known type. + +We add the `contentEncoding` field here to preserve room for future +optimizations like encryption-at-rest or compression of the nested content. +Clients should error when receiving an encoding they do not support. +Negotioting encoding is not defined here, but introducing new encodings +is similar to introducing a schema change or new API version. + +A client should use the `kind` and `apiVersion` fields to identify the +correct protobuf IDL for that message and version, and then decode the +`bytes` field into that Protobuf message. + +Any Unknown value written to stable storage will be given a 4 byte prefix +`0x6b, 0x38, 0x73, 0x00`, which correspond to `k8s` followed by a zero byte. +The content-type `application/vnd.kubernetes.protobuf` is defined as +representing the following schema: + + MESSAGE = '0x6b 0x38 0x73 0x00' UNKNOWN + UNKNOWN = + +A client should check for the first four bytes, then perform a protobuf +deserialization of the remaining bytes into the `runtime.Unknown` type. + +## Streaming wire format + +While the majority of Kubernetes APIs return single objects that can vary +in type (Pod vs Status, PodList vs Status), the watch APIs return a stream +of identical objects (Events). At the time of this writing, this is the only +current or anticipated streaming RESTful protocol (logging, port-forwarding, +and exec protocols use a binary protocol over Websockets or SPDY). + +In JSON, this API is implemented as a stream of JSON objects that are +separated by their syntax (the closing `}` brace is followed by whitespace +and the opening `{` brace starts the next object). There is no formal +specification covering this pattern, nor a unique content-type. Each object +is expected to be of type `watch.Event`, and is currently not self describing. + +For expediency and consistency, we define a format for Protobuf watch Events +that is similar. Since protobuf messages are not self describing, we must +identify the boundaries between Events (a `frame`). We do that by prefixing +each frame of N bytes with a 4-byte, big-endian, unsigned integer with the +value N. + + frame = length body + length = 32-bit unsigned integer in big-endian order, denoting length of + bytes of body + body = + + # frame containing a single byte 0a + frame = 01 00 00 00 0a + + # equivalent JSON + frame = {"type": "added", ...} + +The body of each frame is a serialized Protobuf message `Event` in package +`k8s.io/kubernetes/pkg/watch/versioned`. The content type used for this +format is `application/vnd.kubernetes.protobuf;type=watch`. + +## Negotiation + +To allow clients to request protobuf serialization optionally, the `Accept` +HTTP header is used by callers to indicate which serialization they wish +returned in the response, and the `Content-Type` header is used to tell the +server how to decode the bytes sent in the request (for DELETE/POST/PUT/PATCH +requests). The server will return 406 if the `Accept` header is not +recognized or 415 if the `Content-Type` is not recognized (as defined in +RFC2616). + +To be backwards compatible, clients must consider that the server does not +support protobuf serialization. A number of options are possible: + +### Preconfigured + +Clients can have a configuration setting that instructs them which version +to use. This is the simplest option, but requires intervention when the +component upgrades to protobuf. + +### Include serialization information in api-discovery + +Servers can define the list of content types they accept and return in +their API discovery docs, and clients can use protobuf if they support it. +Allows dynamic configuration during upgrade if the client is already using +API-discovery. + +### Optimistically attempt to send and receive requests using protobuf + +Using multiple `Accept` values: + + Accept: application/vnd.kubernetes.protobuf, application/json + +clients can indicate their preferences and handle the returned +`Content-Type` using whatever the server responds. On update operations, +clients can try protobuf and if they receive a 415 error, record that and +fall back to JSON. Allows the client to be backwards compatible with +any server, but comes at the cost of some implementation complexity. + + +## Generation process + +Generation proceeds in five phases: + +1. Generate a gogo-protobuf annotated IDL from the source Go struct. +2. Generate temporary Go structs from the IDL using gogo-protobuf. +3. Generate marshaller/unmarshallers based on the IDL using gogo-protobuf. +4. Take all tag numbers generated for the IDL and apply them as struct tags + to the original Go types. +5. Generate a final IDL without gogo-protobuf annotations as the canonical IDL. + +The output is a `generated.proto` file in each package containing a standard +proto2 IDL, and a `generated.pb.go` file in each package that contains the +generated marshal/unmarshallers. + +The Go struct generated by gogo-protobuf from the first IDL must be identical +to the origin struct - a number of changes have been made to gogo-protobuf +to ensure exact 1-1 conversion. A small number of additions may be necessary +in the future if we introduce more exotic field types (Go type aliases, maps +with aliased Go types, and embedded fields were fixed). If they are identical, +the output marshallers/unmarshallers can then work on the origin struct. + +Whenever a new field is added, generation will assign that field a unique tag +and the 4th phase will write that tag back to the origin Go struct as a `protobuf` +struct tag. This ensures subsequent generation passes are stable, even in the +face of internal refactors. The first time a field is added, the author will +need to check in both the new IDL AND the protobuf struct tag changes. + +The second IDL is generated without gogo-protobuf annotations to allow clients +in other languages to generate easily. + +Any errors in the generation process are considered fatal and must be resolved +early (being unable to identify a field type for conversion, duplicate fields, +duplicate tags, protoc errors, etc). The conversion fuzzer is used to ensure +that a Go struct can be round-tripped to protobuf and back, as we do for JSON +and conversion testing. + + +## Changes to development process + +All existing API change rules would still apply. New fields added would be +automatically assigned a tag by the generation process. New API versions will +have a new proto IDL, and field name and changes across API versions would be +handled using our existing API change rules. Tags cannot change within an +API version. + +Generation would be done by developers and then checked into source control, +like conversions and ugorji JSON codecs. + +Because protoc is not packaged well across all platforms, we will add it to +the `kube-cross` Docker image and developers can use that to generate +updated protobufs. Protobuf 3 beta is required. + +The generated protobuf will be checked with a verify script before merging. + + +## Implications + +* The generated marshal code is large and will increase build times and binary + size. We may be able to remove ugorji after protobuf is added, since the + bulk of our decoding would switch to protobuf. +* The protobuf schema is naive, which means it may not be as a minimal as + possible. +* Debugging of protobuf related errors is harder due to the binary nature of + the format. +* Migrating API object storage from JSON to protobuf will require that all + API servers are upgraded before beginning to write protobuf to disk, since + old servers won't recognize protobuf. +* Transport of protobuf between etcd and the api server will be less efficient + in etcd2 than etcd3 (since etcd2 must encode binary values returned as JSON). + Should still be smaller than current JSON request. +* Third-party API objects must be stored as JSON inside of a protobuf wrapper + in etcd, and the API endpoints will not benefit from clients that speak + protobuf. Clients will have to deal with some API objects not supporting + protobuf. + + +## Open Questions + +* Is supporting stored protobuf files on disk in the kubectl client worth it? + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/protobuf.md?pixel)]() +