Storage Engines: B-Tree vs LSM Tree & Beyond

Understanding storage engines is fundamental to database performance and system design. Let's explore how different storage engines work, their trade-offs, and when to use each.

Overview of Storage Engines

What are Storage Engines?

Storage engines are the core components of databases that handle how data is stored, organized, and retrieved from disk. They determine the database's performance characteristics for different workloads.

Key Considerations

  • ⚑Write performance vs Read performance
  • πŸ’ΎStorage efficiency and compression
  • πŸ”Range query performance
  • ⏰Compaction and maintenance overhead

B-Tree Based Storage Engines

🌲 B-Tree & B+Tree

MySQL InnoDBPostgreSQLSQLite

How B-Trees Work

B-Trees are self-balancing tree data structures that maintain sorted data and allow searches, sequential access, insertions, and deletions in logarithmic time.

  • Each node contains multiple keys and child pointers
  • All leaf nodes are at the same depth (balanced tree)
  • Nodes are typically the size of a disk page (4KB or 8KB)
  • B+Trees store all data in leaf nodes with internal nodes only for navigation

βœ… Advantages

  • β€’ Excellent read performance (O(log n))
  • β€’ Efficient range queries
  • β€’ Predictable performance
  • β€’ No compaction needed
  • β€’ In-place updates possible

❌ Disadvantages

  • β€’ Random I/O for writes
  • β€’ Write amplification
  • β€’ Fragmentation over time
  • β€’ Less efficient for write-heavy workloads

Best Use Cases

  • β€’ OLTP systems with balanced read/write workloads
  • β€’ Applications requiring strong consistency
  • β€’ Systems with frequent range queries
  • β€’ Traditional relational databases

🌲B-Tree Split & Merge VisualizationDegree 3 (Max 2 keys per node)

πŸ“‹B-Tree Operations Sequence

Insert 10
Insert 20
Insert 1
Insert 5
Insert 40
Insert 90
Insert 12
Insert 7
Insert 19
Delete 90 (leaf, no merge)
Delete 19 (internal, no merge)
Delete 1 (leaf, merge needed)
Delete 10 (internal, merge)
Delete key from root
Delete from single root
Insert DoneDelete DoneCurrentPending
Step 1 of 16
βž•INSERT

Empty B-Tree of degree 3 ready for insertions (max 2 keys per node, max 3 children)

πŸ’‘ Split Operation

When a node exceeds maximum keys (degree-1), it splits into two nodes. The middle key moves up to the parent, maintaining balance.

πŸ”— Merge Operation

When deletion makes a node too small, it merges with a sibling. A parent key moves down to maintain minimum key requirements.

βš–οΈ Self-Balancing

All leaf nodes remain at the same level, ensuring O(log n) performance for all operations by maintaining tree balance.

πŸ“Š Degree = 3 (Order = 3)

Min keys per node: ⌈(degree-1)/2βŒ‰ = 1 key
Max keys per node: degree - 1 = 2 keys
Root exception: Can have 1 to degree-1 keys

πŸ—‘οΈ Deletion Cases

Leaf: Remove directly or merge
Internal: Replace with predecessor/successor
Underflow: Borrow from sibling or merge

πŸ”„ Always Sorted

Keys within nodes and across the tree remain sorted, enabling efficient range queries and ordered traversals.

Progress6%

Why Don't We Use B-Trees in DSA Problems?

πŸ€”Tell me Why: B-Trees vs BST in Coding Problemsβ–Ό

Great question! Even though B-Trees are amazing for databases, we rarely see them in coding contests or DSA problems. Here's why:

πŸ’Ύ Memory vs Disk

B-Trees: Optimized for disk I/O (databases)
BST/AVL: Optimized for memory operations
DSA problems: Focus on in-memory algorithms

βš–οΈ Complexity Trade-off

Same O(log n) time complexity, but:
BST: ~10 lines of code
B-Tree: ~100+ lines with splits/merges

πŸ“ Problem Scale

DSA problems: n ≀ 10⁢ (fits in memory)
Databases: n β‰₯ 10⁹ (requires disk)
Memory trees work fine for contest sizes!

🎯 Different Focus

DSA: Algorithm design & time complexity
Systems: I/O efficiency & storage optimization
B-Trees solve a different problem!

πŸ† When You DO See B-Trees:

System Design
"Design a database"
Database Courses
"Explain indexing"
Advanced DSA
"External sorting"

πŸ’‘ Key Insight: B-Trees minimize expensive disk I/O, while BST variants minimize CPU operations in memory!

LSM Tree Based Storage Engines

πŸ“š LSM Tree (Log-Structured Merge Tree)

RocksDBCassandraHBase

How LSM Trees Work

LSM Trees optimize for write performance by buffering writes in memory and periodically flushing them to disk in sorted batches.

  • Writes go to an in-memory table (MemTable)
  • When MemTable is full, it's flushed to disk as an SSTable
  • SSTables are immutable and sorted
  • Background compaction merges SSTables to maintain read performance
  • Reads check MemTable, then SSTables from newest to oldest

βœ… Advantages

  • β€’ Excellent write performance
  • β€’ Sequential I/O patterns
  • β€’ Better compression ratios
  • β€’ Efficient for append-only workloads
  • β€’ No write amplification for initial writes

❌ Disadvantages

  • β€’ Slower reads (multiple files to check)
  • β€’ Compaction overhead
  • β€’ Space amplification
  • β€’ Less predictable performance
  • β€’ Complex tuning parameters

Best Use Cases

  • β€’ Write-heavy workloads (logs, time-series data)
  • β€’ NoSQL databases and distributed systems
  • β€’ Systems where writes vastly outnumber reads
  • β€’ Applications that can tolerate eventual consistency

πŸ“šLSM Tree Operations VisualizationWrite-Optimized Storage

πŸ“‹LSM Tree Operations

Write A
Write B
Write C
Flush→L0
Write D
Write E
Write F
Flush→L0
Compact L0β†’L1
Read B
Step 1 of 11
✍️WRITE

Empty LSM Tree - writes go to MemTable (in-memory)

LSM Tree Structure

Memory Layer
MemTable (RAM)0/3
[empty]
[empty]
[empty]

πŸ“ˆWrite Amplification

1x

Data written vs original write

πŸ”Read Amplification

1x

Data read vs requested data

πŸ’ΎSpace Amplification

1x

Total storage vs live data

πŸ’‘ Write Path

Writes go to MemTable (fast!), then flush to L0 when full. Background compaction merges levels to maintain read performance.

πŸ” Read Path

Reads check MemTable first, then SSTables from newest (L0) to oldest (L2+). May need to check multiple files (read amplification).

πŸ”„ Compaction

Background process merges SSTables, removing deleted keys and reducing file count. Essential for read performance.

βš–οΈ Trade-offs

Excellent write performance, but reads get slower as data ages. Compaction helps but creates write amplification.

Progress9%

LSM Tree Level Population & Cascading Compaction

πŸ”„Tell me More: How LSM Tree Levels Get Populatedβ–Ό

Great question! LSM tree levels are populated through cascading compaction, not recursive processes.

πŸ“Š Level Population Flow

MemTable (3/3 full) ↓ Flush L0: [SST1] [SST2] [SST3] [SST4] ← Gets crowded (4+ SSTables) ↓ Size-tiered compaction L1: [────────────SST1────────────] ← Merged from L0 SSTables ↓ When L1 gets too large L2: [────────────SST1────────────] ← Compacted from L1 ↓ When L2 gets too large L3: [────────────SST1────────────] ← And so on...

βœ… Cascading (What Happens)

  • β€’ L0 β†’ L1 (independent operation)
  • β€’ Later: L1 β†’ L2 (when L1 gets full)
  • β€’ Later: L2 β†’ L3 (when L2 gets full)
  • β€’ Each triggered by size/count thresholds

❌ Recursive (Doesn't Happen)

  • β€’ L0 compaction triggers L1 compaction
  • β€’ Which triggers L2 compaction...
  • β€’ Chain reaction through all levels
  • β€’ (This would be too expensive!)

⚑ Compaction Triggers & SSTable Counts

L0 Characteristics:
  • β€’ Count: 0-10 SSTables typically
  • β€’ Trigger: When 4+ SSTables exist
  • β€’ Overlapping ranges: Yes! (A-Z), (D-M)
  • β€’ Source: MemTable flushes only
L1+ Characteristics:
  • β€’ Count: Usually 1 large SSTable per range
  • β€’ Trigger: When level exceeds size limit
  • β€’ Non-overlapping: (A-F), (G-M), (N-T)
  • β€’ Size growth: L1:10MB β†’ L2:100MB β†’ L3:1GB

πŸ’‘ Key Insight: Each level compacts independently when it gets "crowded", creating a natural cascading effect without expensive recursive operations!

Comparison: B-Tree vs LSM Tree

AspectB-TreeLSM Tree
Write PerformanceO(log n) with random I/OO(1) amortized with sequential I/O
Read PerformanceO(log n) - predictableO(log n) Γ— num_levels - variable
Space Efficiency~70% utilization typicalCan have space amplification
I/O PatternRandom reads and writesSequential writes, random reads
CompactionNot requiredRequired (background process)
Best ForOLTP, balanced workloadsWrite-heavy, append-only

Other Modern Storage Engines

🌿 Fractal Tree Index

TokuDB

Fractal Trees use message buffers at each internal node to batch changes, reducing I/O for both reads and writes.

  • β€’ Better write performance than B-Trees
  • β€’ Better read performance than LSM Trees
  • β€’ Good compression ratios
  • β€’ Used in TokuDB and PerconaFT

πŸ“ R-Tree

PostGIS, SQLite

R-Trees are specialized for spatial and multi-dimensional data, organizing data using bounding rectangles.

  • β€’ Optimized for spatial queries
  • β€’ Supports nearest neighbor search
  • β€’ Used in GIS applications
  • β€’ Found in spatial database extensions

πŸ—‚οΈ Bitcask

Riak

Bitcask is a log-structured hash table designed for fast key-value storage with predictable lookup times.

  • β€’ O(1) disk seeks for reads
  • β€’ All keys kept in memory
  • β€’ Append-only writes
  • β€’ Simple and predictable performance

πŸ“Š Column-Oriented Storage

Parquet, ClickHouse

Column-oriented storage engines store data by columns rather than rows, optimizing for analytical workloads.

  • β€’ Excellent compression ratios
  • β€’ Fast aggregation queries
  • β€’ Efficient for OLAP workloads
  • β€’ Poor for transactional updates

Choosing the Right Storage Engine

Decision Framework

1. Analyze Your Workload

  • β€’ Read/Write ratio
  • β€’ Query patterns (point lookups vs range scans)
  • β€’ Data size and growth rate
  • β€’ Consistency requirements

2. Consider Operational Aspects

  • β€’ Maintenance overhead
  • β€’ Backup and recovery
  • β€’ Monitoring and tuning complexity
  • β€’ Team expertise

3. Quick Decision Guide

Use B-Tree when:
  • β€’ Balanced read/write workload
  • β€’ Need predictable latency
  • β€’ Strong consistency required
Use LSM Tree when:
  • β€’ Write-heavy workload
  • β€’ Can tolerate read latency
  • β€’ Working with time-series data

Real-World Examples

MySQL InnoDB

B+Tree

Uses clustered B+Tree indexes where data is stored with the primary key index. Secondary indexes point to primary key values.

RocksDB

LSM Tree

Facebook's embedded database, powers many distributed systems including Apache Kafka, CockroachDB, and TiKV.

PostgreSQL

B-Tree

Uses heap files with B-Tree indexes. Supports multiple index types including GiST for spatial data and GIN for full-text search.

Key Takeaways

  • 1.No one-size-fits-all: Choose based on your specific workload patterns and requirements.
  • 2.B-Trees excel at reads: Use for OLTP systems with balanced workloads and strong consistency needs.
  • 3.LSM Trees excel at writes: Perfect for write-heavy workloads like logs and time-series data.
  • 4.Consider hybrid approaches: Some systems use multiple storage engines for different data types.
  • 5.Benchmark with your data: Always test with realistic workloads before making a final decision.

Further Reading

Recommended Resources

  • β€’ The Log-Structured Merge-Tree (LSM-Tree) - Original paper by O'Neil et al.
  • β€’ Log Structured Merge Trees - Ben Stopford's excellent explanation
  • β€’ RocksDB Wiki - Deep dive into LSM implementation
  • β€’ "Database Internals" by Alex Petrov - Comprehensive coverage of storage engines
  • β€’ "Designing Data-Intensive Applications" by Martin Kleppmann - Chapter 3 on storage engines