Storage Engines: B-Tree vs LSM Tree & Beyond

Understanding storage engines is fundamental to database performance and system design. Let's explore how different storage engines work, their trade-offs, and when to use each.

Overview of Storage Engines

What are Storage Engines?

Storage engines are the core components of databases that handle how data is stored, organized, and retrieved from disk. They determine the database's performance characteristics for different workloads.

Key Considerations

⚡Write performance vs Read performance
💾Storage efficiency and compression
🔍Range query performance
⏰Compaction and maintenance overhead

B-Tree Based Storage Engines

🌲 B-Tree & B+Tree

MySQL InnoDBPostgreSQLSQLite

How B-Trees Work

B-Trees are self-balancing tree data structures that maintain sorted data and allow searches, sequential access, insertions, and deletions in logarithmic time.

Each node contains multiple keys and child pointers
All leaf nodes are at the same depth (balanced tree)
Nodes are typically the size of a disk page (4KB or 8KB)
B+Trees store all data in leaf nodes with internal nodes only for navigation

✅ Advantages

• Excellent read performance (O(log n))
• Efficient range queries
• Predictable performance
• No compaction needed
• In-place updates possible

❌ Disadvantages

• Random I/O for writes
• Write amplification
• Fragmentation over time
• Less efficient for write-heavy workloads

Best Use Cases

• OLTP systems with balanced read/write workloads
• Applications requiring strong consistency
• Systems with frequent range queries
• Traditional relational databases

🌲B-Tree Split & Merge VisualizationDegree 3 (Max 2 keys per node)

📋B-Tree Operations Sequence

Insert 10

Insert 20

Insert 1

Insert 5

Insert 40

Insert 90

Insert 12

Insert 7

Insert 19

Delete 90 (leaf, no merge)

Delete 19 (internal, no merge)

Delete 1 (leaf, merge needed)

Delete 10 (internal, merge)

Delete key from root

Delete from single root

Insert DoneDelete DoneCurrentPending

Step 1 of 16

➕INSERT

Empty B-Tree of degree 3 ready for insertions (max 2 keys per node, max 3 children)

💡 Split Operation

When a node exceeds maximum keys (degree-1), it splits into two nodes. The middle key moves up to the parent, maintaining balance.

🔗 Merge Operation

When deletion makes a node too small, it merges with a sibling. A parent key moves down to maintain minimum key requirements.

⚖️ Self-Balancing

All leaf nodes remain at the same level, ensuring O(log n) performance for all operations by maintaining tree balance.

📊 Degree = 3 (Order = 3)

Min keys per node: ⌈(degree-1)/2⌉ = 1 key
Max keys per node: degree - 1 = 2 keys
Root exception: Can have 1 to degree-1 keys

🗑️ Deletion Cases

Leaf: Remove directly or merge
Internal: Replace with predecessor/successor
Underflow: Borrow from sibling or merge

🔄 Always Sorted

Keys within nodes and across the tree remain sorted, enabling efficient range queries and ordered traversals.

Progress6%

Why Don't We Use B-Trees in DSA Problems?

🤔Tell me Why: B-Trees vs BST in Coding Problems▼

Great question! Even though B-Trees are amazing for databases, we rarely see them in coding contests or DSA problems. Here's why:

💾 Memory vs Disk

B-Trees: Optimized for disk I/O (databases)
BST/AVL: Optimized for memory operations
DSA problems: Focus on in-memory algorithms

⚖️ Complexity Trade-off

Same O(log n) time complexity, but:
BST: ~10 lines of code
B-Tree: ~100+ lines with splits/merges

📏 Problem Scale

DSA problems: n ≤ 10⁶ (fits in memory)
Databases: n ≥ 10⁹ (requires disk)
Memory trees work fine for contest sizes!

🎯 Different Focus

DSA: Algorithm design & time complexity
Systems: I/O efficiency & storage optimization
B-Trees solve a different problem!

🏆 When You DO See B-Trees:

System Design

"Design a database"

Database Courses

"Explain indexing"

Advanced DSA

"External sorting"

💡 Key Insight: B-Trees minimize expensive disk I/O, while BST variants minimize CPU operations in memory!

LSM Tree Based Storage Engines

📚 LSM Tree (Log-Structured Merge Tree)

RocksDBCassandraHBase

How LSM Trees Work

LSM Trees optimize for write performance by buffering writes in memory and periodically flushing them to disk in sorted batches.

Writes go to an in-memory table (MemTable)
When MemTable is full, it's flushed to disk as an SSTable
SSTables are immutable and sorted
Background compaction merges SSTables to maintain read performance
Reads check MemTable, then SSTables from newest to oldest

✅ Advantages

• Excellent write performance
• Sequential I/O patterns
• Better compression ratios
• Efficient for append-only workloads
• No write amplification for initial writes

❌ Disadvantages

• Slower reads (multiple files to check)
• Compaction overhead
• Space amplification
• Less predictable performance
• Complex tuning parameters

Best Use Cases

• Write-heavy workloads (logs, time-series data)
• NoSQL databases and distributed systems
• Systems where writes vastly outnumber reads
• Applications that can tolerate eventual consistency

📚LSM Tree Operations VisualizationWrite-Optimized Storage

📋LSM Tree Operations

Write A

Write B

Write C

Flush→L0

Write D

Write E

Write F

Flush→L0

Compact L0→L1

Read B

Step 1 of 11

✍️WRITE

Empty LSM Tree - writes go to MemTable (in-memory)

LSM Tree Structure

Memory Layer

MemTable (RAM)0/3

[empty]

📈Write Amplification

Data written vs original write

🔍Read Amplification

Data read vs requested data

💾Space Amplification

Total storage vs live data

💡 Write Path

Writes go to MemTable (fast!), then flush to L0 when full. Background compaction merges levels to maintain read performance.

🔍 Read Path

Reads check MemTable first, then SSTables from newest (L0) to oldest (L2+). May need to check multiple files (read amplification).

🔄 Compaction

Background process merges SSTables, removing deleted keys and reducing file count. Essential for read performance.

⚖️ Trade-offs

Excellent write performance, but reads get slower as data ages. Compaction helps but creates write amplification.

Progress9%

LSM Tree Level Population & Cascading Compaction

🔄Tell me More: How LSM Tree Levels Get Populated▼

Great question! LSM tree levels are populated through cascading compaction, not recursive processes.

📊 Level Population Flow

MemTable (3/3 full) ↓ Flush L0: [SST1] [SST2] [SST3] [SST4] ← Gets crowded (4+ SSTables) ↓ Size-tiered compaction L1: [────────────SST1────────────] ← Merged from L0 SSTables ↓ When L1 gets too large L2: [────────────SST1────────────] ← Compacted from L1 ↓ When L2 gets too large L3: [────────────SST1────────────] ← And so on...

✅ Cascading (What Happens)

• L0 → L1 (independent operation)
• Later: L1 → L2 (when L1 gets full)
• Later: L2 → L3 (when L2 gets full)
• Each triggered by size/count thresholds

❌ Recursive (Doesn't Happen)

• L0 compaction triggers L1 compaction
• Which triggers L2 compaction...
• Chain reaction through all levels
• (This would be too expensive!)

⚡ Compaction Triggers & SSTable Counts

L0 Characteristics:

• Count: 0-10 SSTables typically
• Trigger: When 4+ SSTables exist
• Overlapping ranges: Yes! (A-Z), (D-M)
• Source: MemTable flushes only

L1+ Characteristics:

• Count: Usually 1 large SSTable per range
• Trigger: When level exceeds size limit
• Non-overlapping: (A-F), (G-M), (N-T)
• Size growth: L1:10MB → L2:100MB → L3:1GB

💡 Key Insight: Each level compacts independently when it gets "crowded", creating a natural cascading effect without expensive recursive operations!

Comparison: B-Tree vs LSM Tree

Aspect	B-Tree	LSM Tree
Write Performance	O(log n) with random I/O	O(1) amortized with sequential I/O
Read Performance	O(log n) - predictable	O(log n) × num_levels - variable
Space Efficiency	~70% utilization typical	Can have space amplification
I/O Pattern	Random reads and writes	Sequential writes, random reads
Compaction	Not required	Required (background process)
Best For	OLTP, balanced workloads	Write-heavy, append-only

Other Modern Storage Engines

🌿 Fractal Tree Index

TokuDB

Fractal Trees use message buffers at each internal node to batch changes, reducing I/O for both reads and writes.

• Better write performance than B-Trees
• Better read performance than LSM Trees
• Good compression ratios
• Used in TokuDB and PerconaFT

📍 R-Tree

PostGIS, SQLite

R-Trees are specialized for spatial and multi-dimensional data, organizing data using bounding rectangles.

• Optimized for spatial queries
• Supports nearest neighbor search
• Used in GIS applications
• Found in spatial database extensions

🗂️ Bitcask

Riak

Bitcask is a log-structured hash table designed for fast key-value storage with predictable lookup times.

• O(1) disk seeks for reads
• All keys kept in memory
• Append-only writes
• Simple and predictable performance

📊 Column-Oriented Storage

Parquet, ClickHouse

Column-oriented storage engines store data by columns rather than rows, optimizing for analytical workloads.

• Excellent compression ratios
• Fast aggregation queries
• Efficient for OLAP workloads
• Poor for transactional updates

Choosing the Right Storage Engine

Decision Framework

1. Analyze Your Workload

• Read/Write ratio
• Query patterns (point lookups vs range scans)
• Data size and growth rate
• Consistency requirements

2. Consider Operational Aspects

• Maintenance overhead
• Backup and recovery
• Monitoring and tuning complexity
• Team expertise

3. Quick Decision Guide

Use B-Tree when:

• Balanced read/write workload
• Need predictable latency
• Strong consistency required

Use LSM Tree when:

• Write-heavy workload
• Can tolerate read latency
• Working with time-series data

Real-World Examples

MySQL InnoDB

B+Tree

Uses clustered B+Tree indexes where data is stored with the primary key index. Secondary indexes point to primary key values.

RocksDB

LSM Tree

Facebook's embedded database, powers many distributed systems including Apache Kafka, CockroachDB, and TiKV.

PostgreSQL

B-Tree

Uses heap files with B-Tree indexes. Supports multiple index types including GiST for spatial data and GIN for full-text search.

Key Takeaways

1.No one-size-fits-all: Choose based on your specific workload patterns and requirements.
2.B-Trees excel at reads: Use for OLTP systems with balanced workloads and strong consistency needs.
3.LSM Trees excel at writes: Perfect for write-heavy workloads like logs and time-series data.
4.Consider hybrid approaches: Some systems use multiple storage engines for different data types.
5.Benchmark with your data: Always test with realistic workloads before making a final decision.

Storage Engines: B-Tree vs LSM Tree & Beyond

Overview of Storage Engines

What are Storage Engines?

Key Considerations

B-Tree Based Storage Engines

🌲 B-Tree & B+Tree

How B-Trees Work

✅ Advantages

❌ Disadvantages

Best Use Cases

🌲B-Tree Split & Merge VisualizationDegree 3 (Max 2 keys per node)

📋B-Tree Operations Sequence

💡 Split Operation

🔗 Merge Operation

⚖️ Self-Balancing

📊 Degree = 3 (Order = 3)

🗑️ Deletion Cases

🔄 Always Sorted

Why Don't We Use B-Trees in DSA Problems?

💾 Memory vs Disk

⚖️ Complexity Trade-off

📏 Problem Scale

🎯 Different Focus

🏆 When You DO See B-Trees:

LSM Tree Based Storage Engines

📚 LSM Tree (Log-Structured Merge Tree)

How LSM Trees Work

✅ Advantages

❌ Disadvantages

Best Use Cases

📚LSM Tree Operations VisualizationWrite-Optimized Storage

📋LSM Tree Operations

LSM Tree Structure

Memory Layer

📈Write Amplification

🔍Read Amplification

💾Space Amplification

💡 Write Path

🔍 Read Path

🔄 Compaction

⚖️ Trade-offs

LSM Tree Level Population & Cascading Compaction

📊 Level Population Flow

✅ Cascading (What Happens)

❌ Recursive (Doesn't Happen)

⚡ Compaction Triggers & SSTable Counts

L0 Characteristics:

L1+ Characteristics:

Comparison: B-Tree vs LSM Tree

Other Modern Storage Engines

🌿 Fractal Tree Index

📍 R-Tree

🗂️ Bitcask

📊 Column-Oriented Storage

Choosing the Right Storage Engine

Decision Framework

1. Analyze Your Workload

2. Consider Operational Aspects

3. Quick Decision Guide

Real-World Examples

MySQL InnoDB

RocksDB

PostgreSQL

Key Takeaways

Further Reading

Recommended Resources