Distributed Key-Value Store
CAP Theorem Visualizer
Interactive CAP theorem demo coming soon...
Building a distributed key-value store requires understanding fundamental trade-offs in distributed systems. Let's design a system that can scale to handle billions of key-value pairs across hundreds of nodes.
Requirements
Our distributed key-value store must handle:
- Small key-value pairs: Less than 10 KB each
- Big data: Store and retrieve billions of items
- High availability: Respond quickly even during failures
- High scalability: Scale horizontally with traffic
- Auto-scaling: Add/remove nodes automatically
- Tunable consistency: Choose between strong and eventual consistency
- Low latency: Sub-millisecond response times
CAP Theorem: The Fundamental Trade-off
The CAP theorem states that a distributed system can only guarantee two of the following three properties:
- Consistency (C): All nodes see the same data simultaneously
- Availability (A): System remains operational 100% of the time
- Partition Tolerance (P): System continues despite network failures
Since network partitions are inevitable in distributed systems, we must choose between CP or AP:
⚠️CP Systems (Consistency + Partition Tolerance)
- Example: HBase, MongoDB, Redis
- Sacrifices availability for consistency
- Better for: Financial transactions, inventory systems
AP Systems (Availability + Partition Tolerance)
- Example: Cassandra, DynamoDB, CouchDB
- Sacrifices consistency for availability
- Better for: Social media, recommendations, analytics
Architecture Overview
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │ │ Client │ │ Client │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────────┴───────────────────┘
│
┌──────▼──────┐
│ Load Balancer│
└──────┬──────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ API Server │ │ API Server │ │ API Server │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────────┴───────────────────┘
│
┌──────▼──────┐
│ Coordinator │
└──────┬──────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│Storage Node1│ │Storage Node2│ │Storage Node3│
│ (Replica) │ │ (Replica) │ │ (Replica) │
└─────────────┘ └─────────────┘ └─────────────┘
Core Components
1. Data Partitioning
We use consistent hashing to distribute data across nodes:
Partitioning Demo
Interactive partitioning demo coming soon...
Benefits:
- Minimal data movement when nodes join/leave
- Even distribution of keys
- Each node manages ~1/N of total data
2. Replication Strategy
For high availability, we replicate each key to N nodes:
Replication Visualizer
Interactive replication demo coming soon...
Replication approaches:
- Simple: Next N nodes on hash ring
- Rack-aware: Different racks/data centers
- Zone-aware: Different availability zones
⚠️3. Consistency Levels
Different operations require different consistency guarantees:
Consistency Level Demo
Interactive consistency levels demo coming soon...
Write Consistency Levels:
ONE
: Write completes after one replica acknowledgesQUORUM
: Write completes after majority acknowledgeALL
: Write completes after all replicas acknowledge
Read Consistency Levels:
ONE
: Return after reading from one replicaQUORUM
: Return most recent from majorityALL
: Return after reading from all replicas
Quorum Formula: R + W > N ensures strong consistency
4. Versioning & Conflict Resolution
We use vector clocks to track causality:
Client A writes v1: [A:1]
Client B writes v2: [B:1]
Client A writes v3: [A:2, B:1]
Conflict detected: v1 and v2 are concurrent
Resolution: Last-write-wins or application-level merge
5. Failure Handling
Techniques for high availability:
- Sloppy Quorum: Use first N healthy nodes
- Hinted Handoff: Store temporarily on another node
- Read Repair: Fix inconsistencies during reads
- Anti-entropy: Background process to sync replicas
Storage Engine
LSM-Tree (Log-Structured Merge Tree)
Most distributed KV stores use LSM-trees for write optimization:
┌─────────────┐
│ MemTable │ ← In-memory writes (sorted)
└──────┬──────┘
│ Flush when full
┌──────▼──────┐
│ SSTable │ ← Immutable on-disk files
│ (Level 0) │
└──────┬──────┘
│ Compact & merge
┌──────▼──────┐
│ SSTable │
│ (Level 1) │
└─────────────┘
Write Path:
- Write to commit log (durability)
- Write to MemTable (fast access)
- Flush to SSTable when full
- Background compaction
Read Path:
- Check MemTable
- Check Bloom filters
- Binary search in SSTables
- Return value or not found
Auto-scaling
Monitor metrics and scale automatically:
func autoScale(metrics Metrics) {
if metrics.CPUUsage > 80 || metrics.LatencyP99 > 100 {
addNode()
} else if metrics.CPUUsage < 20 && nodeCount > minNodes {
removeNode()
}
}
Scaling triggers:
- CPU usage > 80%
- Memory usage > 85%
- P99 latency > threshold
- Disk usage > 80%
Performance Optimizations
1. Bloom Filters
Skip unnecessary disk reads:
if !bloomFilter.MightContain(key) {
return nil, NotFound
}
2. Compression
Reduce storage and network I/O:
- Snappy for speed
- Zstandard for ratio
- LZ4 for balance
3. Caching
Multi-level caching strategy:
- Row cache: Full key-value pairs
- Block cache: SSTable blocks
- Key cache: Key → disk location
4. Batch Operations
Reduce network round trips:
batch := NewBatch()
batch.Put("key1", "value1")
batch.Put("key2", "value2")
batch.Delete("key3")
db.Write(batch)
Monitoring & Operations
Key Metrics
- Latency: P50, P95, P99
- Throughput: Reads/writes per second
- Availability: Uptime percentage
- Consistency: Divergence between replicas
- Storage: Disk usage, compaction stats
Health Checks
func healthCheck() HealthStatus {
return HealthStatus{
Replication: checkReplicationLag(),
Storage: checkDiskSpace(),
Network: checkNodeConnectivity(),
Load: checkLoadDistribution(),
}
}
Implementation Example
type KVStore struct {
ring *ConsistentHash
storage StorageEngine
replicator Replicator
coordinator Coordinator
}
func (kv *KVStore) Put(key, value string, consistency Level) error {
// Find nodes for this key
nodes := kv.ring.GetNodes(key, kv.replicationFactor)
// Write based on consistency level
switch consistency {
case ONE:
return kv.writeOne(nodes, key, value)
case QUORUM:
return kv.writeQuorum(nodes, key, value)
case ALL:
return kv.writeAll(nodes, key, value)
}
}
func (kv *KVStore) Get(key string, consistency Level) (string, error) {
nodes := kv.ring.GetNodes(key, kv.replicationFactor)
switch consistency {
case ONE:
return kv.readOne(nodes, key)
case QUORUM:
return kv.readQuorum(nodes, key)
case ALL:
return kv.readAll(nodes, key)
}
}
Trade-offs & Decisions
| Decision | Option A | Option B | Our Choice | |----------|----------|----------|------------| | Consistency Model | Strong (CP) | Eventual (AP) | Tunable (Both) | | Partitioning | Range-based | Hash-based | Consistent Hash | | Storage Engine | B-tree | LSM-tree | LSM-tree | | Replication | Master-slave | Peer-to-peer | Peer-to-peer | | Conflict Resolution | Last-write-wins | Vector clocks | Vector clocks |
Real-World Examples
Amazon DynamoDB
- Eventually consistent by default
- Consistent hashing with virtual nodes
- Multi-master replication
- Auto-scaling based on traffic
Apache Cassandra
- Tunable consistency (ONE to ALL)
- Gossip protocol for membership
- LSM-tree storage engine
- No single point of failure
Redis Cluster
- Master-slave replication
- 16384 hash slots
- Synchronous replication option
- In-memory with persistence
Next Steps
- Implement a simple distributed KV store
- Learn about consensus algorithms
- Explore distributed transactions