Distributed Key-Value Store
Building a distributed key-value store requires understanding fundamental trade-offs in distributed systems. The CAP theorem is the foundation of all distributed system design decisions. Let's explore these trade-offs interactively before diving into the architecture.
CAP Theorem Interactive Demo
In distributed systems, you can only guarantee two of these three properties. Click on a combination to explore the trade-offs.
Key Insights
Why P is Essential
Network partitions are inevitable in distributed systems. Cables break, switches fail, and data centers lose connectivity. Therefore, you must choose between C and A.
CA is Theoretical
CA systems work only in non-distributed environments. Once you distribute across multiple nodes, network partitions become possible, making CA impractical.
Now that we understand the fundamental constraints, let's design a system that can scale to handle billions of key-value pairs across hundreds of nodes while making informed decisions about these trade-offs.
Requirements
Our distributed key-value store must handle:
- Small key-value pairs: Less than 10 KB each
- Big data: Store and retrieve billions of items
- High availability: Respond quickly even during failures
- High scalability: Scale horizontally with traffic
- Auto-scaling: Add/remove nodes automatically
- Tunable consistency: Choose between strong and eventual consistency
- Low latency: Sub-millisecond response times
CAP Theorem: The Fundamental Trade-off
The CAP theorem, proved by Eric Brewer, states that any distributed system can only guarantee two out of three properties:
- Consistency (C): All nodes see the same data at the same time - reads receive the most recent write
- Availability (A): System remains operational - every request receives a response (success or failure)
- Partition Tolerance (P): System continues operating despite network failures between nodes
Why You Must Choose
In real-world distributed systems, network partitions are inevitable. Cables break, switches fail, data centers lose connectivity, and network congestion causes delays. This means:
You cannot avoid P, so you must choose between C and A
The Two Practical Choices
CP Systems (Consistency + Partition Tolerance)
- Strategy: Reject requests when consistency cannot be guaranteed
- Examples: MongoDB, HBase, Redis Cluster, Consul, Zookeeper
- Use Cases: Financial transactions, inventory management, configuration stores
- Trade-off: System may become unavailable during partitions
AP Systems (Availability + Partition Tolerance)
- Strategy: Accept requests even if data might be inconsistent
- Examples: Cassandra, DynamoDB, CouchDB, Riak, DNS
- Use Cases: Social media feeds, content delivery, analytics, user preferences
- Trade-off: Data may be temporarily inconsistent across nodes
Why CA Systems Don't Work
CA Systems (Traditional RDBMS in single location) work only when:
- All data is on one machine, OR
- All machines are in the same rack with perfect network
Once you distribute across multiple locations, network partitions become possible, making pure CA systems impractical for distributed architectures.
Architecture Overview
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │ │ Client │ │ Client │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────────┴───────────────────┘
│
┌──────▼──────┐
│ Load Balancer│
└──────┬──────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ API Server │ │ API Server │ │ API Server │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────────┴───────────────────┘
│
┌──────▼──────┐
│ Coordinator │
└──────┬──────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│Storage Node1│ │Storage Node2│ │Storage Node3│
│ (Replica) │ │ (Replica) │ │ (Replica) │
└─────────────┘ └─────────────┘ └─────────────┘Core Components
1. Data Partitioning
We use consistent hashing to distribute data across nodes:
Partitioning Demo
Interactive partitioning demo coming soon...
Benefits:
- Minimal data movement when nodes join/leave
- Even distribution of keys
- Each node manages ~1/N of total data
2. Database Scaling Fundamentals
Before diving into our replication strategy, we need to understand database scaling concepts. For a comprehensive guide on replication vs sharding, MySQL scaling with Vitess, multi-master setups, and cross-shard join challenges, see:
📚 Database Concepts: Scaling Strategies
Essential reading covering replication, sharding, MySQL vs PostgreSQL, Vitess architecture, multi-master replication, write scaling challenges, and cross-shard join patterns.
Quick Summary for our context:
- Sharding: Splits data across nodes for storage/write scaling
- Replication: Copies data across nodes for availability
- Combined: Our KV store uses both - sharding for scale + replication for fault tolerance
1. Replication (MySQL's Traditional Approach):
┌─────────────┐ Async/Semi-sync ┌─────────────┐
│ Master │ ──────────────────────> │ Slave 1 │
│ (Writes) │ │ (Reads) │
└─────────────┘ ──────────────────────> └─────────────┘
┌─────────────┐
│ Slave 2 │
│ (Reads) │
└─────────────┘
- Master handles all writes
- Slaves handle read queries
- Asynchronous by default (lag possible)
- Semi-synchronous ensures at least one slave has data2. Sharding (MySQL Cluster/NDB or Application-level):
Application determines shard key (e.g., user_id % 3)
│
┌────┴────┬─────────┬─────────┐
│ │ │ │
┌───▼───┐ ┌──▼───┐ ┌──▼───┐ ┌──▼───┐
│Shard 1│ │Shard 2│ │Shard 3│ │Shard 4│
│US-West│ │US-East│ │Europe │ │ Asia │
└───────┘ └───────┘ └───────┘ └───────┘
- Often implemented at application level
- MySQL Fabric for automated sharding
- Vitess for large-scale shardingPostgreSQL
1. Replication (Streaming Replication):
┌─────────────┐ WAL Streaming ┌─────────────┐
│ Primary │ ──────────────────────>│ Standby 1 │
│ │ │ (Hot) │
└─────────────┘ ──────────────────────>└─────────────┘
┌─────────────┐
│ Standby 2 │
│ (Warm) │
└─────────────┘
- Write-Ahead Log (WAL) shipping
- Synchronous or asynchronous
- Hot standby can serve read queries
- Automatic failover with tools like Patroni2. Sharding (PostgreSQL 10+ with Partitioning):
┌─────────────────────────────┐
│ Users Table (Parent) │
└──────────────┬──────────────┘
│ Declarative Partitioning
┌──────────┼──────────┬──────────┐
│ │ │ │
┌───▼────┐ ┌──▼────┐ ┌──▼────┐ ┌──▼────┐
│Users_Q1│ │Users_Q2│ │Users_Q3│ │Users_Q4│
│Jan-Mar │ │Apr-Jun │ │Jul-Sep │ │Oct-Dec │
└────────┘ └────────┘ └────────┘ └────────┘
- Built-in table partitioning
- Foreign Data Wrappers for cross-node sharding
- Citus extension for true distributed shardingMySQL vs PostgreSQL: Replication & Sharding Comparison
| Feature | MySQL | PostgreSQL |
|---|---|---|
Replication | ||
| Primary Method | Master-Slave (Binary Log) | Streaming Replication (WAL) |
| Replication Types | • Asynchronous (default) • Semi-synchronous • Group Replication | • Asynchronous (default) • Synchronous • Logical Replication |
| Multi-Master | ✓ Yes(MySQL Cluster, Galera) | ⚠ Limited(BDR, pgLogical) |
| Failover | Manual / MHA / Orchestrator | pg_auto_failover / Patroni |
| Read Replicas | ✓ Unlimited | ✓ Unlimited + Hot Standby |
Sharding | ||
| Native Support | ❌ No(Application-level) | ⚠ Partitioning Only(Table-level) |
| Sharding Solutions | Vitess • MySQL Fabric • ProxySQL • Spider Engine | Citus • postgres_fdw • PL/Proxy • pg_shard (deprecated) |
| Cross-Shard Queries | Complex(App handles joins) | Better(FDW, Citus distributed queries) |
| Auto-Sharding | ✓ Vitess | ✓ Citus |
Performance & Scale | ||
| Replication Lag | Variable(Binary log based) | Lower(WAL streaming) |
| Write Scaling | Good(With sharding) | Good(With Citus) |
| Best For | • Web applications • Read-heavy workloads • Simple schemas | • Complex queries • Data integrity critical • Analytics workloads |
Key Takeaways
MySQL Strengths:
- Mature ecosystem with tools like Vitess for large-scale sharding
- True multi-master replication with Galera Cluster
- Simpler replication setup for basic use cases
- Better suited for simple, high-volume web applications
PostgreSQL Strengths:
- Superior built-in partitioning capabilities
- More reliable replication with WAL streaming
- Better cross-shard query capabilities with FDW
- Stronger consistency guarantees and data integrity
When to Choose:
- MySQL + Vitess: Large-scale web apps needing horizontal scaling
- PostgreSQL + Citus: Complex analytics with distributed queries
- MySQL Galera: Multi-master writes across regions
- PostgreSQL Streaming: High-availability with strong consistency
Deep Dive: Vitess - YouTube's Gift to MySQL Scaling
Vitess is a database clustering system for horizontal scaling of MySQL. Originally developed at YouTube to handle massive growth, it's now used by companies like Slack, Square, and GitHub.
How Vitess Works
┌─────────────────────────────────────────────────────────────┐
│ Application │
└───────────────────────┬─────────────────────────────────────┘
│ gRPC/MySQL Protocol
┌───────────────────────▼─────────────────────────────────────┐
│ VTGate (Proxy) │
│ • Query routing • Connection pooling │
│ • Query rewriting • Scatter-gather engine │
└───────────────┬───────────────┬─────────────────┬──────────┘
│ │ │
┌───────▼────┐ ┌──────▼────┐ ┌──────▼────┐
│ VTTablet │ │ VTTablet │ │ VTTablet │
│ (Primary) │ │ (Primary)│ │ (Primary)│
│ Shard -80 │ │ Shard 80-│ │ Shard c0- │
└─────┬──────┘ └─────┬────┘ └─────┬─────┘
│ │ │
┌─────▼────┐ ┌─────▼───┐ ┌─────▼────┐
│ MySQL │ │ MySQL │ │ MySQL │
│ Primary │ │ Primary │ │ Primary │
└─────┬────┘ └────┬────┘ └────┬─────┘
│ │ │
┌─────▼────┐ ┌────▼────┐ ┌────▼─────┐
│ Replica │ │ Replica │ │ Replica │
└──────────┘ └─────────┘ └──────────┘
Components:
• VTGate: Stateless proxy, routes queries to correct shards
• VTTablet: Controls MySQL instance, handles replication
• Topology Service: Stores metadata (uses etcd/Consul/ZooKeeper)Vitess Auto-Sharding Process
1. Initial State - Unsharded
┌────────────────────┐
│ Unsharded DB │
│ users (100GB) │
│ All data in one │
│ MySQL │
└────────────────────┘2. Prepare for Sharding
Step 1: Define Sharding Key (e.g., user_id)
Step 2: Create VSchema
{
"sharded": true,
"vindexes": {
"hash": {
"type": "hash"
}
},
"tables": {
"users": {
"column_vindexes": [{
"column": "user_id",
"name": "hash"
}]
}
}
}3. Split Process (Zero-Downtime)
Phase 1: Create Target Shards
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ Source DB │ │ Shard -80 │ │ Shard 80- │
│ (All Data) │ ──> │ (user_id │ │ (user_id │
│ │ │ hash < 80) │ │ hash >= 80)│
└──────────────┘ └─────────────┘ └─────────────┘
▲ ▲
│ │
└─────────┬───────────┘
│
VReplication
(Filtered Copy)
Phase 2: Switch Reads (instant)
Phase 3: Switch Writes (brief lock)
Phase 4: Clean up source4. Resharding (Splitting Further)
Before: 2 shards [-80, 80-]
After: 4 shards [-40, 40-80, 80-c0, c0-]
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Shard │ │ Shard │ │ Shard │ │ Shard │ │ Shard │
│ -80 │ ──> │ -40 │ │ 40-80 │ │ 80-c0 │ │ c0- │
└─────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
│
└── Splits into 2 shards automaticallyKey Vitess Features
1. Query Routing
-- Application sends:
SELECT * FROM users WHERE user_id = 12345;
-- VTGate determines shard:
user_id 12345 → hash → shard 40-80
-- Routes to correct MySQL:
/* vtgate:: keyspace:users shard:40-80 */
SELECT * FROM users WHERE user_id = 12345;2. Scatter-Gather for Cross-Shard Queries
-- Query needing all shards:
SELECT COUNT(*) FROM users WHERE status = 'active';
-- VTGate scatter phase:
┌────────┐ ┌────────┐ ┌────────┐
│Shard -40│ │Shard 40-│ │Shard 80-│
│COUNT: 25│ │COUNT: 30│ │COUNT: 45│
└────────┘ └────────┘ └────────┘
│ │ │
└───────────────┼───────────────┘
▼
VTGate Gather
Total: 100Deep Dive: Multi-Master Replication
Multi-master replication allows writes to any node, unlike traditional master-slave setups. Let's explore the two main approaches:
1. MySQL Group Replication / Galera Cluster
Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Write to ANY Node │
└─────────────┬──────────────┬──────────────┬────────────────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ Primary │◄───│ Primary │───►│ Primary │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└──────────────┼──────────────┘
│
Group Communication
(Consensus Protocol)How Writes Work:
1. Client writes to Node 1
UPDATE users SET status='active' WHERE id=123;
2. Certification Process:
┌─────────┐ ┌─────────────────┐ ┌──────────┐
│ Node 1 │────►│ Broadcast Write │────►│ Node 2,3 │
│(Origin) │ │ Transaction │ │(Certify) │
└─────────┘ └─────────────────┘ └──────────┘
│ │
│ ┌─────────────────┐ │
└──────────│ Conflict Check │◄─────────┘
│ - Same row? │
│ - Write order? │
└────────┬────────┘
│
┌────────▼────────┐
│ Commit or Abort │
└─────────────────┘
3. If certified: All nodes apply in same order
4. If conflict: Transaction rolled backConflict Resolution Example:
Simultaneous writes to same row:
Node 1: UPDATE users SET age=25 WHERE id=1; (Time: 10:00:00.100)
Node 2: UPDATE users SET age=26 WHERE id=1; (Time: 10:00:00.101)
Resolution:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ age=25 │ │ age=26 │ │ Arbiter │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────┼────────────────┘
▼
Global Order Applied
(First write wins)
All nodes: age=252. Application-Level Multi-Master
Some systems implement multi-master at the application level using techniques like CRDTs (Conflict-free Replicated Data Types):
Shopping Cart Example - Merge on Read:
Node A (US): Node B (EU):
┌──────────────┐ ┌──────────────┐
│ Cart_User123 │ │ Cart_User123 │
│ + iPhone │ │ + iPad │
│ + AirPods │ │ + MacBook │
└──────────────┘ └──────────────┘
│ │
└───────────┬───────────────┘
▼
Merge Strategy
┌──────────────┐
│ Cart_User123 │
│ + iPhone │
│ + AirPods │
│ + iPad │
│ + MacBook │
└──────────────┘Multi-Master Trade-offs
✅ Advantages
- • Write anywhere - no single point of failure
- • Better geographic distribution
- • Read scaling across all nodes
- • Automatic failover
❌ Challenges
- • Write conflicts need resolution
- • Lower write throughput (consensus overhead)
- • Complex debugging
- • Network partitions can cause split-brain
When to Use Multi-Master
Good Fit:
- Geographic distribution with local writes
- High availability requirements
- Read-heavy workloads
- Systems that can tolerate/resolve conflicts
Poor Fit:
- High-volume writes to same data
- Strong consistency requirements
- Complex transactions
- Limited network bandwidth
Vitess vs Multi-Master: When Do You Need Both?
Great question! While Vitess solves horizontal scaling through sharding, multi-master addresses a different problem: write availability and geographic distribution. Let's break down when you need each:
Vitess Solves: Data Volume Scaling
Problem: Single database can't handle data size
┌─────────────────────────────────────────┐
│ Single MySQL Instance │
│ users table: 500GB, 1B records │
│ 💥 Too big for one machine │
└─────────────────────────────────────────┘
Vitess Solution: Horizontal partitioning
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ users │ │ users │ │ users │
│ (1-333M) │ │ (334-666M) │ │ (667M-1B) │
└─────────────┘ └─────────────┘ └─────────────┘Multi-Master Solves: Write Availability & Geography
Problem: Single write point fails or is too far
┌─────────────────────────────────────────┐
│ US West (Primary) │
│ All writes must go here │
│ 💥 Far from EU/Asia users │
│ 💥 Single point of failure │
└─────────────────────────────────────────┘
Multi-Master Solution: Write anywhere
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ US West │ │ EU Central │ │ Asia Pac │
│ (Master) │ │ (Master) │ │ (Master) │
│ Low latency│ │ Low latency│ │ Low latency│
│ for US │ │ for EU │ │ for Asia │
└─────────────┘ └─────────────┘ └─────────────┘The Reality: You Often Need BOTH
Here's how real-world systems combine them:
Example: Global Social Media Platform
Geographic Multi-Master Setup
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ US Region │ │ EU Region │ │ Asia Region │
│ │ │ │ │ │
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │
│ │User │ │◄──►│ │User │ │◄──►│ │User │ │
│ │Posts│ │ │ │Posts│ │ │ │Posts│ │
│ └─────┘ │ │ └─────┘ │ │ └─────┘ │
│ │ │ │ │ │ │ │ │
│ Vitess │ │ Vitess │ │ Vitess │
│ Sharding │ │ Sharding │ │ Sharding │
│ │ │ │ │ │ │ │ │
│ ┌───┬─┴─┬───┐│ │ ┌───┬─┴─┬───┐│ │ ┌───┬─┴─┬───┐│
│ │S1 │S2 │S3 ││ │ │S1 │S2 │S3 ││ │ │S1 │S2 │S3 ││
│ └───┴───┴───┘│ │ └───┴───┴───┘│ │ └───┴───┴───┘│
└──────────────┘ └──────────────┘ └──────────────┘
Each region:
- Multi-Master: Handles regional writes, replicates globally
- Vitess: Shards data within each region for scaleDecision Matrix
| Scenario | Vitess | Multi-Master | Both |
|---|---|---|---|
| Single region, large dataset | ✓ Perfect | Not needed | Overkill |
| Multi-region, small dataset | Not needed | ✓ Perfect | Overkill |
| Global scale, large dataset | Helps | Helps | ✓ Best |
| High write availability needed | Doesn't help | ✓ Perfect | If also large |
| Latency-sensitive writes | Doesn't help | ✓ Perfect | If also large |
Real-World Examples
Vitess Only:
YouTube Analytics Dashboard
- Massive historical data (petabytes)
- Complex aggregation queries
- Single region acceptable
- Vitess shards by time/user_idMulti-Master Only:
Global User Preferences Service
- Small dataset (user settings)
- Needs local writes everywhere
- Galera Cluster across regions
- No sharding neededBoth Together:
Instagram/Facebook
- Massive media storage (Vitess for sharding)
- Global user base (Multi-master for regions)
- Each region shards its data locally
- Cross-region replication for user data#### MySQL Write Throughput Challenge
Critical Point: Even with Vitess, MySQL's fundamental limitation remains - writes only happen on the primary node. This creates a single point of write operations per shard.
For scaling write throughput in a single region, you have limited options:
⚠️ MySQL Write Scaling Options
1. Multi-Master Replication
- • Multiple nodes accept writes
- • True write throughput scaling
- • Conflict resolution complexity
- • Network partition challenges
2. Alternative Approaches
- • Write buffering/batching
- • Asynchronous processing
- • Move to NoSQL (MongoDB, Cassandra)
- • Event sourcing patterns
Reality Check:
Vitess Sharding: Multi-Master:
┌─────────────────────┐ ┌─────────────────────┐
│ Shard 1: [Primary] │ │ Master 1: ✓ Writes │
│ [Replica] │ VS │ Master 2: ✓ Writes │
│ Shard 2: [Primary] │ │ Master 3: ✓ Writes │
│ [Replica] │ │ (All can write) │
└─────────────────────┘ └─────────────────────┘
↑ ↑
Single writer per shard Multiple concurrent writersKey Insights
Vitess Limitations:
- Still has single write point per shard
- Cross-region writes still slow
- Master failure affects that shard's writes
- Write throughput = sum of shard primaries
Multi-Master Limitations:
- Doesn't solve data volume scaling
- Complex conflict resolution
- Write performance decreases with scale
- But scales write throughput linearly
Why Both:
Multi-Master handles: Vitess handles:
┌─────────────────────┐ ┌─────────────────────┐
│ • Geographic writes │ + │ • Data volume │
│ • Write availability│ │ • Query distribution│
│ • Network latency │ │ • Storage scaling │
└─────────────────────┘ └─────────────────────┘
│ │
└────────────┬───────────────┘
▼
Complete Distributed SolutionCross-Shard Joins: The Efficiency Challenge
When using sharded systems like Vitess, cross-shard joins are problematic and should be avoided when possible.
How Cross-Shard Joins Work
⚠️ Cross-Shard Join Process
-- Query: Get user details with their recent orders
SELECT u.name, u.email, o.order_id, o.total
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.region = 'US'What Vitess has to do:
1. Scatter Phase:
┌─────────────────────────────────────────────────────┐
│ Query sent to ALL shards (because we don't know │
│ which shards contain matching users/orders) │
└─────────────────────────────────────────────────────┘
↓
Shard 1 Shard 2 Shard 3 Shard 4
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│Users│ │Users│ │Users│ │Users│
│Orders│ │Orders│ │Orders│ │Orders│
└─────┘ └─────┘ └─────┘ └─────┘
2. Gather Phase:
┌─────────────────────────────────────────────────────┐
│ Collect partial results from each shard │
│ Vitess proxy performs the JOIN operation │
│ Sort, filter, and return final results │
└─────────────────────────────────────────────────────┘Efficiency Problems
🐌 Performance Issues
- • Network overhead: Data from all shards
- • Memory usage: Vitess proxy holds all data
- • No index usage: Can't use MySQL join optimizations
- • Serial processing: Wait for slowest shard
- • Bandwidth waste: Moving large datasets
📊 Complexity Issues
- • Query planning: Vitess must understand joins
- • Transaction boundaries: ACID across shards
- • Deadlocks: Cross-shard dependencies
- • Limited SQL support: Not all joins supported
- • Debugging difficulty: Multi-shard query plans
Vitess Join Support Levels
Same Shard Joins
Fast & efficient. Uses MySQL's native join optimization.
Cross-Shard Joins
Supported but slow. Avoid for high-traffic queries.
Complex Joins
Limited support. Subqueries, CTEs often not supported.
Design Patterns to Avoid Cross-Shard Joins
1. Denormalization:
-- Instead of joining across shards
-- users (shard by user_id) + orders (shard by user_id)
-- Store user info in orders table
CREATE TABLE orders (
order_id BIGINT,
user_id BIGINT,
user_name VARCHAR(100), -- Denormalized
user_email VARCHAR(100), -- Denormalized
total DECIMAL(10,2),
created_at TIMESTAMP
) -- Shard by user_id2. Application-Level Joins:
// Fetch from multiple shards in application
users := fetchUsers(userIDs) // From user shards
orders := fetchOrders(userIDs) // From order shards
result := joinInMemory(users, orders) // App does the join3. Materialized Views:
-- Pre-computed join results
CREATE TABLE user_order_summary (
user_id BIGINT,
total_orders INT,
total_spent DECIMAL(12,2),
last_order_date DATE
) -- Updated via events/batch jobs4. Event-Driven Consistency:
User Update Event → Update orders.user_name
Order Created Event → Update user_order_summaryWhen Cross-Shard Joins are Acceptable
✅ Acceptable Use Cases
- • Admin/Analytics queries: Low frequency, can be slow
- • Small result sets: Limited data movement
- • Non-critical paths: Batch processing, reports
- • Development/Testing: Temporary queries
Performance Comparison:
Same-Shard Join: 1-10ms (MySQL native optimization)
Cross-Shard Join: 100-1000ms (Network + coordination overhead)
Application Join: 50-200ms (Parallel fetches + in-memory join)Bottom Line:
- Vitess = Scale your data horizontally
- Multi-Master = Write from anywhere with high availability
- Both = Global, massive-scale applications (Facebook, Instagram, YouTube)
You choose based on your specific constraints: data size, geographic distribution, write patterns, and availability requirements.
Key Differences Summary
| Aspect | Sharding | Replication |
|---|---|---|
| Purpose | Scalability | Availability |
| Data Distribution | Different data on each node | Same data on all nodes |
| Storage Efficiency | ✓ High(data divided) | ⚠ Low(data duplicated) |
| Write Performance | 📈 Scales linearly | 🔒 Limited by slowest replica |
| Read Performance | Scales for different data | Scales for same data |
| Failure Impact | ❌ Partial data loss | ✓ No data loss |
| Complexity | High(routing, joins) | Medium(consistency) |
3. Our Replication Strategy
Now that we understand the concepts, for our distributed KV store, we'll use both techniques. First, let's look at how we replicate each key to N nodes for high availability:
Replication Visualizer
Interactive replication demo coming soon...
Replication approaches:
- Simple: Next N nodes on hash ring
- Rack-aware: Different racks/data centers
- Zone-aware: Different availability zones
⚠️4. Consistency Levels
Different operations require different consistency guarantees:
Consistency Level Demo
Interactive consistency levels demo coming soon...
Write Consistency Levels:
ONE: Write completes after one replica acknowledgesQUORUM: Write completes after majority acknowledgeALL: Write completes after all replicas acknowledge
Read Consistency Levels:
ONE: Return after reading from one replicaQUORUM: Return most recent from majorityALL: Return after reading from all replicas
The Quorum Formula Explained
The formula R + W > N ensures strong consistency. Let's break this down:
- N = Total number of replicas (replication factor)
- W = Write consistency level (number of nodes that must acknowledge a write)
- R = Read consistency level (number of nodes that must respond to a read)
Why does R + W > N guarantee strong consistency?
When R + W > N, there's guaranteed overlap between the nodes that wrote the data and the nodes you're reading from. This means at least one node in your read set has the latest value.
Visual Example with N=3:
Write with W=2: Read with R=2:
┌─────┐ ✓ Written ┌─────┐ ✓ Read
│Node1│ │Node1│
└─────┘ └─────┘
┌─────┐ ✓ Written ┌─────┐
│Node2│ │Node2│
└─────┘ └─────┘
┌─────┐ ✗ Not yet ┌─────┐ ✓ Read
│Node3│ │Node3│
└─────┘ └─────┘
W=2 + R=2 = 4 > N=3 ✓
Overlap guaranteed: Node1 has latest dataCommon Quorum Configurations:
| Config | N | W | R | R+W | Consistency | Use Case | |--------|---|---|---|-----|-------------|----------| | Strong | 3 | 2 | 2 | 4 > 3 | ✓ Strong | Critical data | | Read-Heavy | 3 | 3 | 1 | 4 > 3 | ✓ Strong | Read-optimized | | Write-Heavy | 3 | 1 | 3 | 4 > 3 | ✓ Strong | Write-optimized | | Eventual | 3 | 1 | 1 | 2 < 3 | ✗ Eventual | Performance-critical |
Trade-offs:
- Higher W: Slower writes, better durability
- Higher R: Slower reads, fresher data
- W=N: Maximum durability, writes fail if any node is down
- R=N: Always read latest, reads fail if any node is down
5. Versioning & Conflict Resolution
We use vector clocks to track causality:
Client A writes v1: [A:1]
Client B writes v2: [B:1]
Client A writes v3: [A:2, B:1]
Conflict detected: v1 and v2 are concurrent
Resolution: Last-write-wins or application-level merge6. Failure Handling
Techniques for high availability:
- Sloppy Quorum: Use first N healthy nodes
- Hinted Handoff: Store temporarily on another node
- Read Repair: Fix inconsistencies during reads
- Anti-entropy: Background process to sync replicas
Storage Engine
LSM-Tree (Log-Structured Merge Tree)
Most distributed KV stores use LSM-trees for write optimization:
┌─────────────┐
│ MemTable │ ← In-memory writes (sorted)
└──────┬──────┘
│ Flush when full
┌──────▼──────┐
│ SSTable │ ← Immutable on-disk files
│ (Level 0) │
└──────┬──────┘
│ Compact & merge
┌──────▼──────┐
│ SSTable │
│ (Level 1) │
└─────────────┘Write Path:
- Write to commit log (durability)
- Write to MemTable (fast access)
- Flush to SSTable when full
- Background compaction
Read Path:
- Check MemTable
- Check Bloom filters
- Binary search in SSTables
- Return value or not found
Auto-scaling
Monitor metrics and scale automatically:
func autoScale(metrics Metrics) {
if metrics.CPUUsage > 80 || metrics.LatencyP99 > 100 {
addNode()
} else if metrics.CPUUsage < 20 && nodeCount > minNodes {
removeNode()
}
}Scaling triggers:
- CPU usage > 80%
- Memory usage > 85%
- P99 latency > threshold
- Disk usage > 80%
Performance Optimizations
1. Bloom Filters
Skip unnecessary disk reads:
if !bloomFilter.MightContain(key) {
return nil, NotFound
}2. Compression
Reduce storage and network I/O:
- Snappy for speed
- Zstandard for ratio
- LZ4 for balance
3. Caching
Multi-level caching strategy:
- Row cache: Full key-value pairs
- Block cache: SSTable blocks
- Key cache: Key → disk location
4. Batch Operations
Reduce network round trips:
batch := NewBatch()
batch.Put("key1", "value1")
batch.Put("key2", "value2")
batch.Delete("key3")
db.Write(batch)Monitoring & Operations
Key Metrics
- Latency: P50, P95, P99
- Throughput: Reads/writes per second
- Availability: Uptime percentage
- Consistency: Divergence between replicas
- Storage: Disk usage, compaction stats
Health Checks
func healthCheck() HealthStatus {
return HealthStatus{
Replication: checkReplicationLag(),
Storage: checkDiskSpace(),
Network: checkNodeConnectivity(),
Load: checkLoadDistribution(),
}
}Implementation Example
type KVStore struct {
ring *ConsistentHash
storage StorageEngine
replicator Replicator
coordinator Coordinator
}
func (kv *KVStore) Put(key, value string, consistency Level) error {
// Find nodes for this key
nodes := kv.ring.GetNodes(key, kv.replicationFactor)
// Write based on consistency level
switch consistency {
case ONE:
return kv.writeOne(nodes, key, value)
case QUORUM:
return kv.writeQuorum(nodes, key, value)
case ALL:
return kv.writeAll(nodes, key, value)
}
}
func (kv *KVStore) Get(key string, consistency Level) (string, error) {
nodes := kv.ring.GetNodes(key, kv.replicationFactor)
switch consistency {
case ONE:
return kv.readOne(nodes, key)
case QUORUM:
return kv.readQuorum(nodes, key)
case ALL:
return kv.readAll(nodes, key)
}
}Trade-offs & Decisions
| Decision | Option A | Option B | Our Choice | |----------|----------|----------|------------| | Consistency Model | Strong (CP) | Eventual (AP) | Tunable (Both) | | Partitioning | Range-based | Hash-based | Consistent Hash | | Storage Engine | B-tree | LSM-tree | LSM-tree | | Replication | Master-slave | Peer-to-peer | Peer-to-peer | | Conflict Resolution | Last-write-wins | Vector clocks | Vector clocks |
Real-World Examples
Amazon DynamoDB
- Eventually consistent by default
- Consistent hashing with virtual nodes
- Multi-master replication
- Auto-scaling based on traffic
Apache Cassandra
- Tunable consistency (ONE to ALL)
- Gossip protocol for membership
- LSM-tree storage engine
- No single point of failure
Redis Cluster
- Master-slave replication
- 16384 hash slots
- Synchronous replication option
- In-memory with persistence
Next Steps
- Implement a simple distributed KV store
- Learn about consensus algorithms
- Explore distributed transactions