Storage Engines: B-Tree vs LSM Tree & Beyond
Understanding storage engines is fundamental to database performance and system design. Let's explore how different storage engines work, their trade-offs, and when to use each.
Overview of Storage Engines
What are Storage Engines?
Storage engines are the core components of databases that handle how data is stored, organized, and retrieved from disk. They determine the database's performance characteristics for different workloads.
Key Considerations
- β‘Write performance vs Read performance
- πΎStorage efficiency and compression
- πRange query performance
- β°Compaction and maintenance overhead
B-Tree Based Storage Engines
π² B-Tree & B+Tree
How B-Trees Work
B-Trees are self-balancing tree data structures that maintain sorted data and allow searches, sequential access, insertions, and deletions in logarithmic time.
- Each node contains multiple keys and child pointers
- All leaf nodes are at the same depth (balanced tree)
- Nodes are typically the size of a disk page (4KB or 8KB)
- B+Trees store all data in leaf nodes with internal nodes only for navigation
β Advantages
- β’ Excellent read performance (O(log n))
- β’ Efficient range queries
- β’ Predictable performance
- β’ No compaction needed
- β’ In-place updates possible
β Disadvantages
- β’ Random I/O for writes
- β’ Write amplification
- β’ Fragmentation over time
- β’ Less efficient for write-heavy workloads
Best Use Cases
- β’ OLTP systems with balanced read/write workloads
- β’ Applications requiring strong consistency
- β’ Systems with frequent range queries
- β’ Traditional relational databases
π²B-Tree Split & Merge VisualizationDegree 3 (Max 2 keys per node)
πB-Tree Operations Sequence
Empty B-Tree of degree 3 ready for insertions (max 2 keys per node, max 3 children)
π‘ Split Operation
When a node exceeds maximum keys (degree-1), it splits into two nodes. The middle key moves up to the parent, maintaining balance.
π Merge Operation
When deletion makes a node too small, it merges with a sibling. A parent key moves down to maintain minimum key requirements.
βοΈ Self-Balancing
All leaf nodes remain at the same level, ensuring O(log n) performance for all operations by maintaining tree balance.
π Degree = 3 (Order = 3)
Min keys per node: β(degree-1)/2β = 1 key
Max keys per node: degree - 1 = 2 keys
Root exception: Can have 1 to degree-1 keys
ποΈ Deletion Cases
Leaf: Remove directly or merge
Internal: Replace with predecessor/successor
Underflow: Borrow from sibling or merge
π Always Sorted
Keys within nodes and across the tree remain sorted, enabling efficient range queries and ordered traversals.
Why Don't We Use B-Trees in DSA Problems?
π€Tell me Why: B-Trees vs BST in Coding ProblemsβΌ
Great question! Even though B-Trees are amazing for databases, we rarely see them in coding contests or DSA problems. Here's why:
πΎ Memory vs Disk
B-Trees: Optimized for disk I/O (databases)
BST/AVL: Optimized for memory operations
DSA problems: Focus on in-memory algorithms
βοΈ Complexity Trade-off
Same O(log n) time complexity, but:
BST: ~10 lines of code
B-Tree: ~100+ lines with splits/merges
π Problem Scale
DSA problems: n β€ 10βΆ (fits in memory)
Databases: n β₯ 10βΉ (requires disk)
Memory trees work fine for contest sizes!
π― Different Focus
DSA: Algorithm design & time complexity
Systems: I/O efficiency & storage optimization
B-Trees solve a different problem!
π When You DO See B-Trees:
π‘ Key Insight: B-Trees minimize expensive disk I/O, while BST variants minimize CPU operations in memory!
LSM Tree Based Storage Engines
π LSM Tree (Log-Structured Merge Tree)
How LSM Trees Work
LSM Trees optimize for write performance by buffering writes in memory and periodically flushing them to disk in sorted batches.
- Writes go to an in-memory table (MemTable)
- When MemTable is full, it's flushed to disk as an SSTable
- SSTables are immutable and sorted
- Background compaction merges SSTables to maintain read performance
- Reads check MemTable, then SSTables from newest to oldest
β Advantages
- β’ Excellent write performance
- β’ Sequential I/O patterns
- β’ Better compression ratios
- β’ Efficient for append-only workloads
- β’ No write amplification for initial writes
β Disadvantages
- β’ Slower reads (multiple files to check)
- β’ Compaction overhead
- β’ Space amplification
- β’ Less predictable performance
- β’ Complex tuning parameters
Best Use Cases
- β’ Write-heavy workloads (logs, time-series data)
- β’ NoSQL databases and distributed systems
- β’ Systems where writes vastly outnumber reads
- β’ Applications that can tolerate eventual consistency
πLSM Tree Operations VisualizationWrite-Optimized Storage
πLSM Tree Operations
Empty LSM Tree - writes go to MemTable (in-memory)
LSM Tree Structure
Memory Layer
πWrite Amplification
Data written vs original write
πRead Amplification
Data read vs requested data
πΎSpace Amplification
Total storage vs live data
π‘ Write Path
Writes go to MemTable (fast!), then flush to L0 when full. Background compaction merges levels to maintain read performance.
π Read Path
Reads check MemTable first, then SSTables from newest (L0) to oldest (L2+). May need to check multiple files (read amplification).
π Compaction
Background process merges SSTables, removing deleted keys and reducing file count. Essential for read performance.
βοΈ Trade-offs
Excellent write performance, but reads get slower as data ages. Compaction helps but creates write amplification.
LSM Tree Level Population & Cascading Compaction
πTell me More: How LSM Tree Levels Get PopulatedβΌ
Great question! LSM tree levels are populated through cascading compaction, not recursive processes.
π Level Population Flow
β Cascading (What Happens)
- β’ L0 β L1 (independent operation)
- β’ Later: L1 β L2 (when L1 gets full)
- β’ Later: L2 β L3 (when L2 gets full)
- β’ Each triggered by size/count thresholds
β Recursive (Doesn't Happen)
- β’ L0 compaction triggers L1 compaction
- β’ Which triggers L2 compaction...
- β’ Chain reaction through all levels
- β’ (This would be too expensive!)
β‘ Compaction Triggers & SSTable Counts
L0 Characteristics:
- β’ Count: 0-10 SSTables typically
- β’ Trigger: When 4+ SSTables exist
- β’ Overlapping ranges: Yes! (A-Z), (D-M)
- β’ Source: MemTable flushes only
L1+ Characteristics:
- β’ Count: Usually 1 large SSTable per range
- β’ Trigger: When level exceeds size limit
- β’ Non-overlapping: (A-F), (G-M), (N-T)
- β’ Size growth: L1:10MB β L2:100MB β L3:1GB
π‘ Key Insight: Each level compacts independently when it gets "crowded", creating a natural cascading effect without expensive recursive operations!
Comparison: B-Tree vs LSM Tree
| Aspect | B-Tree | LSM Tree |
|---|---|---|
| Write Performance | O(log n) with random I/O | O(1) amortized with sequential I/O |
| Read Performance | O(log n) - predictable | O(log n) Γ num_levels - variable |
| Space Efficiency | ~70% utilization typical | Can have space amplification |
| I/O Pattern | Random reads and writes | Sequential writes, random reads |
| Compaction | Not required | Required (background process) |
| Best For | OLTP, balanced workloads | Write-heavy, append-only |
Other Modern Storage Engines
πΏ Fractal Tree Index
TokuDBFractal Trees use message buffers at each internal node to batch changes, reducing I/O for both reads and writes.
- β’ Better write performance than B-Trees
- β’ Better read performance than LSM Trees
- β’ Good compression ratios
- β’ Used in TokuDB and PerconaFT
π R-Tree
PostGIS, SQLiteR-Trees are specialized for spatial and multi-dimensional data, organizing data using bounding rectangles.
- β’ Optimized for spatial queries
- β’ Supports nearest neighbor search
- β’ Used in GIS applications
- β’ Found in spatial database extensions
ποΈ Bitcask
RiakBitcask is a log-structured hash table designed for fast key-value storage with predictable lookup times.
- β’ O(1) disk seeks for reads
- β’ All keys kept in memory
- β’ Append-only writes
- β’ Simple and predictable performance
π Column-Oriented Storage
Parquet, ClickHouseColumn-oriented storage engines store data by columns rather than rows, optimizing for analytical workloads.
- β’ Excellent compression ratios
- β’ Fast aggregation queries
- β’ Efficient for OLAP workloads
- β’ Poor for transactional updates
Choosing the Right Storage Engine
Decision Framework
1. Analyze Your Workload
- β’ Read/Write ratio
- β’ Query patterns (point lookups vs range scans)
- β’ Data size and growth rate
- β’ Consistency requirements
2. Consider Operational Aspects
- β’ Maintenance overhead
- β’ Backup and recovery
- β’ Monitoring and tuning complexity
- β’ Team expertise
3. Quick Decision Guide
- β’ Balanced read/write workload
- β’ Need predictable latency
- β’ Strong consistency required
- β’ Write-heavy workload
- β’ Can tolerate read latency
- β’ Working with time-series data
Real-World Examples
MySQL InnoDB
B+TreeUses clustered B+Tree indexes where data is stored with the primary key index. Secondary indexes point to primary key values.
RocksDB
LSM TreeFacebook's embedded database, powers many distributed systems including Apache Kafka, CockroachDB, and TiKV.
PostgreSQL
B-TreeUses heap files with B-Tree indexes. Supports multiple index types including GiST for spatial data and GIN for full-text search.
Key Takeaways
- 1.No one-size-fits-all: Choose based on your specific workload patterns and requirements.
- 2.B-Trees excel at reads: Use for OLTP systems with balanced workloads and strong consistency needs.
- 3.LSM Trees excel at writes: Perfect for write-heavy workloads like logs and time-series data.
- 4.Consider hybrid approaches: Some systems use multiple storage engines for different data types.
- 5.Benchmark with your data: Always test with realistic workloads before making a final decision.
Further Reading
Recommended Resources
- β’ The Log-Structured Merge-Tree (LSM-Tree) - Original paper by O'Neil et al.
- β’ Log Structured Merge Trees - Ben Stopford's excellent explanation
- β’ RocksDB Wiki - Deep dive into LSM implementation
- β’ "Database Internals" by Alex Petrov - Comprehensive coverage of storage engines
- β’ "Designing Data-Intensive Applications" by Martin Kleppmann - Chapter 3 on storage engines