Merge pull request #15554 from prometheus/waldocs

docs: Added native histogram WAL record documentation.
pull/15556/head
Bartlomiej Plotka 2024-12-09 13:35:50 +01:00 committed by GitHub
commit 664177bc1f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 123 additions and 13 deletions

View File

@ -1,15 +1,32 @@
# WAL Disk Format
This document describes the official Prometheus WAL format.
The write ahead log operates in segments that are numbered and sequential,
e.g. `000000`, `000001`, `000002`, etc., and are limited to 128MB by default.
A segment is written to in pages of 32KB. Only the last page of the most recent segment
and are limited to 128MB by default.
## Segment filename
The sequence number is captured in the segment filename,
e.g. `000000`, `000001`, `000002`, etc. The first unsigned integer represents
the sequence number of the segment, typically encoded with six digits.
## Segment encoding
This section describes the segment encoding.
A segment encodes an array of records. It does not contain any header. A segment
is written to pages of 32KB. Only the last page of the most recent segment
may be partial. A WAL record is an opaque byte slice that gets split up into sub-records
should it exceed the remaining space of the current page. Records are never split across
segment boundaries. If a single record exceeds the default segment size, a segment with
a larger size will be created.
The encoding of pages is largely borrowed from [LevelDB's/RocksDB's write ahead log.](https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format)
Notable deviations are that the record fragment is encoded as:
### Records encoding
Each record fragment is encoded as:
```
┌───────────┬──────────┬────────────┬──────────────┐
@ -17,7 +34,8 @@ Notable deviations are that the record fragment is encoded as:
└───────────┴──────────┴────────────┴──────────────┘
```
The initial type byte is made up of three components: a 3-bit reserved field, a 1-bit zstd compression flag, a 1-bit snappy compression flag, and a 3-bit type flag.
The initial type byte is made up of three components: a 3-bit reserved field,
a 1-bit zstd compression flag, a 1-bit snappy compression flag, and a 3-bit type flag.
```
┌─────────────────┬──────────────────┬────────────────────┬──────────────────┐
@ -25,7 +43,7 @@ The initial type byte is made up of three components: a 3-bit reserved field, a
└─────────────────┴──────────────────┴────────────────────┴──────────────────┘
```
The lowest 3 bits within this flag represent the record type as follows:
The lowest 3 bits within the type flag represent the record type as follows:
* `0`: rest of page will be empty
* `1`: a full record encoded in a single fragment
@ -33,11 +51,16 @@ The lowest 3 bits within this flag represent the record type as follows:
* `3`: middle fragment of a record
* `4`: final fragment of a record
## Record encoding
After the type byte, 2-byte length and then 4-byte checksum of the following data are encoded.
The records written to the write ahead log are encoded as follows:
All float values are represented using the [IEEE 754 format](https://en.wikipedia.org/wiki/IEEE_754).
### Series records
### Record types
In the following sections, all the known record types are described. New types,
can be added in the future.
#### Series records
Series records encode the labels that identifies a series and its unique ID.
@ -58,7 +81,7 @@ Series records encode the labels that identifies a series and its unique ID.
└────────────────────────────────────────────┘
```
### Sample records
#### Sample records
Sample records encode samples as a list of triples `(series_id, timestamp, value)`.
Series reference and timestamp are encoded as deltas w.r.t the first sample.
@ -79,7 +102,7 @@ The first sample record begins at the second row.
└──────────────────────────────────────────────────────────────────┘
```
### Tombstone records
#### Tombstone records
Tombstone records encode tombstones as a list of triples `(series_id, min_time, max_time)`
and specify an interval for which samples of a series got deleted.
@ -95,9 +118,9 @@ and specify an interval for which samples of a series got deleted.
└─────────────────────────────────────────────────────┘
```
### Exemplar records
#### Exemplar records
Exemplar records encode exemplars as a list of triples `(series_id, timestamp, value)`
Exemplar records encode exemplars as a list of triples `(series_id, timestamp, value)`
plus the length of the labels list, and all the labels.
The first row stores the starting id and the starting timestamp.
Series reference and timestamp are encoded as deltas w.r.t the first exemplar.
@ -127,7 +150,7 @@ See: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/Op
└──────────────────────────────────────────────────────────────────┘
```
### Metadata records
#### Metadata records
Metadata records encode the metadata updates associated with a series.
@ -156,3 +179,90 @@ Metadata records encode the metadata updates associated with a series.
└────────────────────────────────────────────┘
```
#### Histogram records
Histogram records encode the integer and float native histogram samples.
A record with the integer native histograms with the exponential bucketing:
```
┌───────────────────────────────────────────────────────────────────────┐
│ type = 7 <1b>
├───────────────────────────────────────────────────────────────────────┤
│ ┌────────────────────┬───────────────────────────┐ │
│ │ id <8b> │ timestamp <8b> │ │
│ └────────────────────┴───────────────────────────┘ │
│ ┌────────────────────┬──────────────────────────────────────────────┐ │
│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ │
│ ├────────────────────┴────┬─────────────────────────────────────────┤ │
│ │ counter_reset_hint <1b> │ schema <varint> │ │
│ ├─────────────────────────┴────┬────────────────────────────────────┤ │
│ │ zero_threshold (float) <8b> │ zero_count <uvarint> │ │
│ ├─────────────────┬────────────┴────────────────────────────────────┤ │
│ │ count <uvarint> │ sum (float) <8b> │ │
│ ├─────────────────┴─────────────────────────────────────────────────┤ │
│ │ positive_spans_num <uvarint> │ │
│ ├─────────────────────────────────┬─────────────────────────────────┤ │
│ │ positive_span_offset_1 <varint> │ positive_span_len_1 <uvarint32> │ │
│ ├─────────────────────────────────┴─────────────────────────────────┤ │
│ │ . . . │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ negative_spans_num <uvarint> │ │
│ ├───────────────────────────────┬───────────────────────────────────┤ │
│ │ negative_span_offset <varint> │ negative_span_len <uvarint32> │ │
│ ├───────────────────────────────┴───────────────────────────────────┤ │
│ │ . . . │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ positive_bkts_num <uvarint> │ │
│ ├─────────────────────────┬───────┬─────────────────────────────────┤ │
│ │ positive_bkt_1 <varint> │ . . . │ positive_bkt_n <varint> │ │
│ ├─────────────────────────┴───────┴─────────────────────────────────┤ │
│ │ negative_bkts_num <uvarint> │ │
│ ├─────────────────────────┬───────┬─────────────────────────────────┤ │
│ │ negative_bkt_1 <varint> │ . . . │ negative_bkt_n <varint> │ │
│ └─────────────────────────┴───────┴─────────────────────────────────┘ │
│ . . . │
└───────────────────────────────────────────────────────────────────────┘
```
A records with the Float histograms:
```
┌───────────────────────────────────────────────────────────────────────┐
│ type = 8 <1b>
├───────────────────────────────────────────────────────────────────────┤
│ ┌────────────────────┬───────────────────────────┐ │
│ │ id <8b> │ timestamp <8b> │ │
│ └────────────────────┴───────────────────────────┘ │
│ ┌────────────────────┬──────────────────────────────────────────────┐ │
│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ │
│ ├────────────────────┴────┬─────────────────────────────────────────┤ │
│ │ counter_reset_hint <1b> │ schema <varint> │ │
│ ├─────────────────────────┴────┬────────────────────────────────────┤ │
│ │ zero_threshold (float) <8b> │ zero_count (float) <8b> │ │
│ ├────────────────────┬─────────┴────────────────────────────────────┤ │
│ │ count (float) <8b> │ sum (float) <8b> │ │
│ ├────────────────────┴──────────────────────────────────────────────┤ │
│ │ positive_spans_num <uvarint> │ │
│ ├─────────────────────────────────┬─────────────────────────────────┤ │
│ │ positive_span_offset_1 <varint> │ positive_span_len_1 <uvarint32> │ │
│ ├─────────────────────────────────┴─────────────────────────────────┤ │
│ │ . . . │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ negative_spans_num <uvarint> │ │
│ ├───────────────────────────────┬───────────────────────────────────┤ │
│ │ negative_span_offset <varint> │ negative_span_len <uvarint32> │ │
│ ├───────────────────────────────┴───────────────────────────────────┤ │
│ │ . . . │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ positive_bkts_num <uvarint> │ │
│ ├─────────────────────────────┬───────┬─────────────────────────────┤ │
│ │ positive_bkt_1 (float) <8b> │ . . . │ positive_bkt_n (float) <8b> │ │
│ ├─────────────────────────────┴───────┴─────────────────────────────┤ │
│ │ negative_bkts_num <uvarint> │ │
│ ├─────────────────────────────┬───────┬─────────────────────────────┤ │
│ │ negative_bkt_1 (float) <8b> │ . . . │ negative_bkt_n (float) <8b> │ │
│ └─────────────────────────────┴───────┴─────────────────────────────┘ │
│ . . . │
└───────────────────────────────────────────────────────────────────────┘
```