Skip to content
nevzheng
Go back

Beyond Inline Values: Evolving Strata's Storage Engine

Strata is a single-node LSM storage engine I’m building in Rust, with a SQL layer on top. Right now every value is stored inline in the LSM — a deliberate bootstrapping decision to get SQL running end to end. Now that the basics work, there’s an opportunity to replace inline storage with something better.

This post is an Architectural Decision Record (ADR). It records a decision I’m proposing for Strata — to replace inline storage with a block-based abstraction — and walks through the reasoning: the problem, the requirements, the options I compared, and why I landed where I did.

Status: Proposed. The repo is at github.com/nevzheng/strata-db.

Table of contents

Open Table of contents

1. Context

Strata is a single-node LSM storage engine written in Rust with a SQL layer on top. Currently all values are stored inline in the LSM. This was a deliberate bootstrapping decision to get SQL running end-to-end. Now that the basics work, there is an opportunity to improve performance and flexibility by replacing inline storage with a better abstraction.

This document argues for replacing inline storage with a block-based abstraction. We define the problem, establish requirements, compare options, and make a decision.

2. Problem Statement

Inline storage couples keys, values, and metadata with no abstraction boundary, making it hard to build on and expensive to operate.

Flexibility

Performance

Good storage design maximizes I/O utilization. Every byte read or written should contribute to a query result or a durable write. Inline storage wastes I/O on data that doesn’t matter.

Roadmap Blockers

3. Requirements

In priority order:

🥇 1. Flexibility / Evolvability / Maintainability

New features should plug in without breaking existing ones. Changes should be isolated to their own component. Components should be replaceable without churn.

🥈 2. Correctness

The system should be easy to reason about and verify. Components should be testable in isolation. Invariants should be assertable from first principles. If we can’t test it, we can’t trust it.

🥉 3. Performance

We should be able to reason about performance from the design before measuring it. Key metrics: read amplification, write amplification, space amplification, p99 latency, cache hit rate, and scan throughput. Performance claims should be testable and backed by benchmarks.

4. Evaluation

Dimensions chosen where options differ meaningfully. Dimensions where all options score similarly (concurrency, correctness/verifiability) are not included.

DimensionA. InlineB. WiscKeyC. Block-Based
Flexibility 🥇🔴 No extension point. Any new capability requires touching the whole encoding.🟡 Key/value separation helps but the value log format is rigid and unstructured.🟢 High extensibility. New block types can be invented as needed without touching existing blocks.
MVCC Readiness 🥇🥈🔴 Each version = full tuple copy. N versions = N full copies in tree.🔴 Value log has no native version structure. MVCC requires additional indexing into the log. GC and snapshot isolation conflict directly.🟢 Natural fit. Stable block IDs enable delta chains without data copies.
Complexity 🥇🟢 Dead simple. Already working.🟡 Moderate. GC complexity. Two-path I/O.🔴 Most complex. Upfront design decisions required.
Read Amplification 🥉 useful work per read I/O🔴 Scans full tuple regardless of selectivity. We pay for data we don’t need on every read.🟡 Two I/Os per read, both may return irrelevant data. No opportunity to optimize what gets read.🟢 More seeks but locality and prefetching mean every byte read can contribute to the query result. Partial reads possible.
Write Amplification 🥉 useful work per write I/O🔴 Rewrites full tuple and metadata regardless of what changed. No delta support.🟢 Append-only write path. Only the new value is written. Read path becomes more complex but write usefulness is high.🟡 Delta blocks allow writing only what changed. Not yet implemented. Currently rewrites full blocks. High ceiling, not yet realized.
Compaction Amplification 🥉 reads and writes required to complete compaction vs live data kept🔴 Must read and rewrite full tuples. Physical layout offers no shortcuts.🟢 Compaction touches small keys only. Values untouched. Log GC is separate cost.🟢 Indirection lets compaction touch only blocks that need merging or deletion. Opportunities to skip rewriting unchanged blocks entirely.
Space Amplification 🥉 physical storage cost per logical tuple or MVCC chain🔴 Physical cost scales with tuple width × number of versions. Dead versions sit on disk until compaction. For wide rows, cost is dominated by value size.🟡 LSM stays lean — keys only. Value log accumulates dead values until GC runs. Physical cost depends heavily on GC frequency.🟢 Physical cost bounded at the block level. Dead blocks tracked and reclaimed during compaction. Indirection means logical deletes don’t require immediate physical rewrites.
Caching / Locality 🥉 ability to cache and exploit data locality🔴 No cache. Key-to-tuple is a poor basis for locality or prefetching.🔴 No natural unit of locality. A caching strategy would need to be designed from scratch with no good logical structure to build on.🟢 Buffer pool keyed by stable block ID. Natural unit for locality and prefetching.

5. Decision

We choose Option C — Block-Based Storage.

Block-based storage best balances our three requirements. It provides strong indirection, extensibility, and a clean abstraction boundary — new features are easy to reason about and plug in as new block types without touching existing storage.

On performance, there are real trade-offs. Lookups require additional I/O for block resolution and cache misses add latency. But the buffer pool and stable block IDs create meaningful opportunities: prefetching, locality exploitation, and smart caching strategies that maximize useful work per I/O. The performance ceiling is high and the trade-offs are measurable.

Inline storage is eliminated because its limitations are structural and not fixable. It is not a good foundation for a database storage engine.

WiscKey is eliminated because while the value log is fast for writes, it is hard to build additional features on top without effectively reinventing block-based storage. We want the insight — indirection and append-only writes — without the structural limitations.

6. Options

A. Status Quo — Inline Values

Row data is stored inline alongside keys in SSTable entries. The tuple is encoded directly in the entry — there is no indirection.

[ key_len | key_bytes | version | tuple_bytes ... ]
                                  ^--- full row encoded here

Key features:

B. WiscKey — Value Log

Lu et al., FAST ‘16

Keys stay in the LSM. Values are written to a separate append-only value log. The SSTable entry stores a pointer (log offset) instead of the value itself. LSM compaction only touches keys — values are never moved. Value log GC is a separate process responsible for reclaiming dead values and consolidating versions.

SSTable:  [ key | log_offset ] [ key | log_offset ] [ key | log_offset ] ...

Log:      [ tuple ]            [ tuple ]            [ tuple ]            ...

Key features:

C. Block-Based Storage — SSTable Format

LevelDB Table Format; RocksDB Block Cache; Pebble

Values are stored in fixed-size typed blocks on disk. The SSTable index stores a block ID and offset instead of an inline value. An in-memory buffer pool sits in front of disk I/O, keyed by stable block ID.

SSTable:  [ key | block_id | offset ] [ key | block_id | offset ] ...

Blocks:   [ header          ]         [ header          ]
          [----------------]         [----------------]
          [ tuple ][ tuple ]         [ tuple ][ tuple ]
          [ tuple ][ tuple ]         [ tuple ][ tuple ]
          ...                        ...

Key features:

Appendix A — Future Work & Out of Scope

This document is an Architectural Decision Record (ADR). It records the decision to adopt block-based storage and the reasoning behind it. Implementation details are out of scope and each warrants a dedicated design doc.

Performance claims in this document are reasoned from design principles. Benchmarks will be required to validate them.

Deferred to follow-on design docs:

References

  1. O’Neil et al. — The Log-Structured Merge-Tree (LSM-Tree), 1996
  2. Lu et al. — WiscKey: Separating Keys from Values in SSD-Conscious Storage, FAST ‘16
  3. Google — LevelDB Table Format
  4. Meta — RocksDB Block Cache Documentation
  5. CockroachDB — Pebble
  6. Matei et al. — CockroachDB: The Resilient Geo-Distributed SQL Database, SIGMOD ‘20
  7. ADR Format Reference

If you enjoyed this post or found it useful, consider buying me a coffee.

If you’re curious, the code is at github.com/nevzheng/strata-db.


Share this post on:

Next Post
Keys and Values Are All You Need