prometheus/tsdb/docs/format/memory_snapshot.md

# Memory Snapshot Format

Memory snapshot uses the WAL package and writes each series as a WAL record.
Below are the formats of the individual records.

The order of records in the snapshot is always:
1. Starts with series records, one per series, in an unsorted fashion.
2. After all series are done, we write a tombstone record containing all the tombstones.
3. At the end, we write one or more exemplar records while batching up the exemplars in each record. Exemplars are in the order they were written to the circular buffer.

### Series records

This record is a snapshot of a single series. Only one series exists per record.
It includes the metadata of the series and the in-memory chunk data if it exists.
The sampleBuf is the last 4 samples in the in-memory chunk.

```
┌──────────────────────────┬────────────────────────────┐
│     Record Type <byte>   │   Series Ref <uint64>      │
├──────────────────────────┴────────────────────────────┤
│               Number of Labels <uvarint>              │
├──────────────────────────────┬────────────────────────┤
│     len(name_1) <uvarint>    │    name_1 <bytes>      │
├──────────────────────────────┼────────────────────────┤
│     len(val_1) <uvarint>     │    val_1 <bytes>       │
├──────────────────────────────┴────────────────────────┤
│                         . . .                         │
├──────────────────────────────┬────────────────────────┤
│     len(name_N) <uvarint>    │    name_N <bytes>      │
├──────────────────────────────┼────────────────────────┤
│     len(val_N) <uvarint>     │    val_N <bytes>       │
├──────────────────────────────┴────────────────────────┤
│                  Chunk Range <int64>                  │
├───────────────────────────────────────────────────────┤
│                 Chunk Exists <uvarint>                │
│ # 1 if head chunk exists, 0 otherwise to detect a nil |
| # chunk. Below fields exists only when it's 1 here.   |
├───────────────────────────┬───────────────────────────┤
│     Chunk Mint <int64>    │    Chunk Maxt <int64>     │
├───────────────────────────┴───────────────────────────┤
│                 Chunk Encoding <byte>                 │
├──────────────────────────────┬────────────────────────┤
│      len(Chunk) <uvarint>    │    Chunk <bytes>       │
├──────────────────────────┬───┴────────────────────────┤
|  sampleBuf[0].t <int64>  |  sampleBuf[0].v <float64>  | 
├──────────────────────────┼────────────────────────────┤
|  sampleBuf[1].t <int64>  |  sampleBuf[1].v <float64>  | 
├──────────────────────────┼────────────────────────────┤
|  sampleBuf[2].t <int64>  |  sampleBuf[2].v <float64>  | 
├──────────────────────────┼────────────────────────────┤
|  sampleBuf[3].t <int64>  |  sampleBuf[3].v <float64>  | 
└──────────────────────────┴────────────────────────────┘
```

### Tombstone record

This includes all the tombstones in the Head block. A single record is written into
the snapshot for all the tombstones. The encoded tombstones uses the same encoding
as tombstone file in blocks.

```
┌─────────────────────────────────────────────────────────────────┐
│                        Record Type <byte>                       │
├───────────────────────────────────┬─────────────────────────────┤
│ len(Encoded Tombstones) <uvarint> │ Encoded Tombstones <bytes>  │
└───────────────────────────────────┴─────────────────────────────┘
```

### Exemplar record

A single exemplar record contains one or more exemplars, encoded in the same way as we do in WAL but with changed record type.

```
┌───────────────────────────────────────────────────────────────────┐
│                      Record Type <byte>                           │
├───────────────────────────────────────────────────────────────────┤
│ ┌────────────────────┬───────────────────────────┐                │
│ │ series ref <8b>    │ timestamp <8b>            │                │
│ └────────────────────┴───────────────────────────┘                │
│ ┌─────────────────────┬───────────────────────────┬─────────────┐ │
│ │ ref_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b>  │ │
│ ├─────────────────────┴───────────────────────────┴─────────────┤ │
│ │  n = len(labels) <uvarint>                                    │ │
│ ├───────────────────────────────┬───────────────────────────────┤ │
│ │     len(str_1) <uvarint>      │       str_1 <bytes>           │ │
│ ├───────────────────────────────┴───────────────────────────────┤ │
│ │                              ...                              │ │
│ ├───────────────────────────────┬───────────────────────────────┤ │
│ │     len(str_2n) <uvarint>     │       str_2n <bytes>          │ │
│ ├───────────────────────────────┴───────────────────────────────┤ │
│                               . . .                               │
└───────────────────────────────────────────────────────────────────┘
```
Sync sparsehistogram branch with main (#9189) * Fix `kuma_sd` targetgroup reporting (#9157) * Bundle all xDS targets into a single group Signed-off-by: austin ce <austin.cawley@gmail.com> * Snapshot in-memory chunks on shutdown for faster restarts (#7229) Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Rename links Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921) * Moved everything to nPending buffer Signed-off-by: Levi Harrison <git@leviharrison.dev> * Simplify exemplar capacity addition Signed-off-by: Levi Harrison <git@leviharrison.dev> * Added pre-allocation Signed-off-by: Levi Harrison <git@leviharrison.dev> * Don't allocate if not sending exemplars Signed-off-by: Levi Harrison <git@leviharrison.dev> * Avoid deadlock when processing duplicate series record (#9170) * Avoid deadlock when processing duplicate series record `processWALSamples()` needs to be able to send on its output channel before it can read the input channel, so reads to allow this in case the output channel is full. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * processWALSamples: update comment Previous text seems to relate to an earlier implementation. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Optimise WAL loading by removing extra map and caching min-time (#9160) * BenchmarkLoadWAL: close WAL after use So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: make series IDs co-prime with #workers Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: simulate mmapped chunks Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Fix comment Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Remove series map from processWALSamples() The locks that is commented to reduce contention in are now sharded 32,000 ways, so won't be contended. Removing the map saves memory and goes just as fast. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * loadWAL: Cache the last mmapped chunk time So we can skip calling append() for samples it will reject. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Improvements from code review Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Full stops and capitals on comments Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Cache max time in both places mmappedChunks is updated Including refactor to extract function `setMMappedChunks`, to reduce code duplication. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Update head min/max time when mmapped chunks added This ensures we have the correct values if no WAL samples are added for that series. Note that `mSeries.maxTime()` was always `math.MinInt64` before, since that function doesn't consider mmapped chunks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Split Go and React Tests (#8897) * Added go-ci and react-ci Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove search keymap from new expression editor (#9184) Signed-off-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com> Co-authored-by: Levi Harrison <git@leviharrison.dev> Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Julius Volz <julius.volz@gmail.com> 3 years ago			`# Memory Snapshot Format`

			`Memory snapshot uses the WAL package and writes each series as a WAL record.`
			`Below are the formats of the individual records.`

Exemplars in snapshot (#9255) * Exemplars in snapshot Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Fix lint Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Add docs Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Fix lint Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Fix comments Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> 3 years ago			`The order of records in the snapshot is always:`
			`1. Starts with series records, one per series, in an unsorted fashion.`
			`2. After all series are done, we write a tombstone record containing all the tombstones.`
			`3. At the end, we write one or more exemplar records while batching up the exemplars in each record. Exemplars are in the order they were written to the circular buffer.`

Sync sparsehistogram branch with main (#9189) * Fix `kuma_sd` targetgroup reporting (#9157) * Bundle all xDS targets into a single group Signed-off-by: austin ce <austin.cawley@gmail.com> * Snapshot in-memory chunks on shutdown for faster restarts (#7229) Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Rename links Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921) * Moved everything to nPending buffer Signed-off-by: Levi Harrison <git@leviharrison.dev> * Simplify exemplar capacity addition Signed-off-by: Levi Harrison <git@leviharrison.dev> * Added pre-allocation Signed-off-by: Levi Harrison <git@leviharrison.dev> * Don't allocate if not sending exemplars Signed-off-by: Levi Harrison <git@leviharrison.dev> * Avoid deadlock when processing duplicate series record (#9170) * Avoid deadlock when processing duplicate series record `processWALSamples()` needs to be able to send on its output channel before it can read the input channel, so reads to allow this in case the output channel is full. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * processWALSamples: update comment Previous text seems to relate to an earlier implementation. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Optimise WAL loading by removing extra map and caching min-time (#9160) * BenchmarkLoadWAL: close WAL after use So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: make series IDs co-prime with #workers Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: simulate mmapped chunks Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Fix comment Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Remove series map from processWALSamples() The locks that is commented to reduce contention in are now sharded 32,000 ways, so won't be contended. Removing the map saves memory and goes just as fast. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * loadWAL: Cache the last mmapped chunk time So we can skip calling append() for samples it will reject. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Improvements from code review Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Full stops and capitals on comments Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Cache max time in both places mmappedChunks is updated Including refactor to extract function `setMMappedChunks`, to reduce code duplication. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Update head min/max time when mmapped chunks added This ensures we have the correct values if no WAL samples are added for that series. Note that `mSeries.maxTime()` was always `math.MinInt64` before, since that function doesn't consider mmapped chunks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Split Go and React Tests (#8897) * Added go-ci and react-ci Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove search keymap from new expression editor (#9184) Signed-off-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com> Co-authored-by: Levi Harrison <git@leviharrison.dev> Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Julius Volz <julius.volz@gmail.com> 3 years ago			`### Series records`

			`This record is a snapshot of a single series. Only one series exists per record.`
			`It includes the metadata of the series and the in-memory chunk data if it exists.`
			`The sampleBuf is the last 4 samples in the in-memory chunk.`

			```
			`┌──────────────────────────┬────────────────────────────┐`
			`│ Record Type <byte> │ Series Ref <uint64> │`
			`├──────────────────────────┴────────────────────────────┤`
			`│ Number of Labels <uvarint> │`
			`├──────────────────────────────┬────────────────────────┤`
			`│ len(name_1) <uvarint> │ name_1 <bytes> │`
			`├──────────────────────────────┼────────────────────────┤`
			`│ len(val_1) <uvarint> │ val_1 <bytes> │`
			`├──────────────────────────────┴────────────────────────┤`
			`│ . . . │`
			`├──────────────────────────────┬────────────────────────┤`
			`│ len(name_N) <uvarint> │ name_N <bytes> │`
			`├──────────────────────────────┼────────────────────────┤`
			`│ len(val_N) <uvarint> │ val_N <bytes> │`
			`├──────────────────────────────┴────────────────────────┤`
			`│ Chunk Range <int64> │`
			`├───────────────────────────────────────────────────────┤`
			`│ Chunk Exists <uvarint> │`
			`│ # 1 if head chunk exists, 0 otherwise to detect a nil \|`
			`\| # chunk. Below fields exists only when it's 1 here. \|`
			`├───────────────────────────┬───────────────────────────┤`
			`│ Chunk Mint <int64> │ Chunk Maxt <int64> │`
			`├───────────────────────────┴───────────────────────────┤`
			`│ Chunk Encoding <byte> │`
			`├──────────────────────────────┬────────────────────────┤`
			`│ len(Chunk) <uvarint> │ Chunk <bytes> │`
			`├──────────────────────────┬───┴────────────────────────┤`
			`\| sampleBuf[0].t <int64> \| sampleBuf[0].v <float64> \|`
			`├──────────────────────────┼────────────────────────────┤`
			`\| sampleBuf[1].t <int64> \| sampleBuf[1].v <float64> \|`
			`├──────────────────────────┼────────────────────────────┤`
			`\| sampleBuf[2].t <int64> \| sampleBuf[2].v <float64> \|`
			`├──────────────────────────┼────────────────────────────┤`
			`\| sampleBuf[3].t <int64> \| sampleBuf[3].v <float64> \|`
			`└──────────────────────────┴────────────────────────────┘`
			```

			`### Tombstone record`

			`This includes all the tombstones in the Head block. A single record is written into`
			`the snapshot for all the tombstones. The encoded tombstones uses the same encoding`
			`as tombstone file in blocks.`

			```
			`┌─────────────────────────────────────────────────────────────────┐`
			`│ Record Type <byte> │`
			`├───────────────────────────────────┬─────────────────────────────┤`
			`│ len(Encoded Tombstones) <uvarint> │ Encoded Tombstones <bytes> │`
			`└───────────────────────────────────┴─────────────────────────────┘`
			```
Exemplars in snapshot (#9255) * Exemplars in snapshot Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Fix lint Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Add docs Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Fix lint Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Fix comments Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> 3 years ago
			`### Exemplar record`

			`A single exemplar record contains one or more exemplars, encoded in the same way as we do in WAL but with changed record type.`

			```
			`┌───────────────────────────────────────────────────────────────────┐`
			`│ Record Type <byte> │`
			`├───────────────────────────────────────────────────────────────────┤`
			`│ ┌────────────────────┬───────────────────────────┐ │`
			`│ │ series ref <8b> │ timestamp <8b> │ │`
			`│ └────────────────────┴───────────────────────────┘ │`
			`│ ┌─────────────────────┬───────────────────────────┬─────────────┐ │`
			`│ │ ref_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b> │ │`
			`│ ├─────────────────────┴───────────────────────────┴─────────────┤ │`
			`│ │ n = len(labels) <uvarint> │ │`
			`│ ├───────────────────────────────┬───────────────────────────────┤ │`
			`│ │ len(str_1) <uvarint> │ str_1 <bytes> │ │`
			`│ ├───────────────────────────────┴───────────────────────────────┤ │`
			`│ │ ... │ │`
			`│ ├───────────────────────────────┬───────────────────────────────┤ │`
			`│ │ len(str_2n) <uvarint> │ str_2n <bytes> │ │`
			`│ ├───────────────────────────────┴───────────────────────────────┤ │`
			`│ . . . │`
			`└───────────────────────────────────────────────────────────────────┘`
			```