Distributed Key-Value Store

Building a distributed key-value store requires understanding fundamental trade-offs in distributed systems. The CAP theorem is the foundation of all distributed system design decisions. Let's explore these trade-offs interactively before diving into the architecture.

CAP Theorem Interactive Demo

In distributed systems, you can only guarantee two of these three properties. Click on a combination to explore the trade-offs.

Key Insights

Why P is Essential

Network partitions are inevitable in distributed systems. Cables break, switches fail, and data centers lose connectivity. Therefore, you must choose between C and A.

CA is Theoretical

CA systems work only in non-distributed environments. Once you distribute across multiple nodes, network partitions become possible, making CA impractical.

Now that we understand the fundamental constraints, let's design a system that can scale to handle billions of key-value pairs across hundreds of nodes while making informed decisions about these trade-offs.

Requirements

Our distributed key-value store must handle:

Small key-value pairs: Less than 10 KB each
Big data: Store and retrieve billions of items
High availability: Respond quickly even during failures
High scalability: Scale horizontally with traffic
Auto-scaling: Add/remove nodes automatically
Tunable consistency: Choose between strong and eventual consistency
Low latency: Sub-millisecond response times

CAP Theorem: The Fundamental Trade-off

The CAP theorem, proved by Eric Brewer, states that any distributed system can only guarantee two out of three properties:

Consistency (C): All nodes see the same data at the same time - reads receive the most recent write
Availability (A): System remains operational - every request receives a response (success or failure)
Partition Tolerance (P): System continues operating despite network failures between nodes

Why You Must Choose

In real-world distributed systems, network partitions are inevitable. Cables break, switches fail, data centers lose connectivity, and network congestion causes delays. This means:

You cannot avoid P, so you must choose between C and A

The Two Practical Choices

CP Systems (Consistency + Partition Tolerance)

Strategy: Reject requests when consistency cannot be guaranteed
Examples: MongoDB, HBase, Redis Cluster, Consul, Zookeeper
Use Cases: Financial transactions, inventory management, configuration stores
Trade-off: System may become unavailable during partitions

AP Systems (Availability + Partition Tolerance)

Strategy: Accept requests even if data might be inconsistent
Examples: Cassandra, DynamoDB, CouchDB, Riak, DNS
Use Cases: Social media feeds, content delivery, analytics, user preferences
Trade-off: Data may be temporarily inconsistent across nodes

Why CA Systems Don't Work

CA Systems (Traditional RDBMS in single location) work only when:

All data is on one machine, OR
All machines are in the same rack with perfect network

Once you distribute across multiple locations, network partitions become possible, making pure CA systems impractical for distributed architectures.

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │     │   Client    │     │   Client    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                    ┌──────▼──────┐
                    │ Load Balancer│
                    └──────┬──────┘
                           │
       ┌───────────────────┼───────────────────┐
       │                   │                   │
┌──────▼──────┐     ┌──────▼──────┐     ┌──────▼──────┐
│  API Server │     │  API Server │     │  API Server │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                    ┌──────▼──────┐
                    │ Coordinator │
                    └──────┬──────┘
                           │
       ┌───────────────────┼───────────────────┐
       │                   │                   │
┌──────▼──────┐     ┌──────▼──────┐     ┌──────▼──────┐
│Storage Node1│     │Storage Node2│     │Storage Node3│
│  (Replica)  │     │  (Replica)  │     │  (Replica)  │
└─────────────┘     └─────────────┘     └─────────────┘

Core Components

1. Data Partitioning

We use consistent hashing to distribute data across nodes:

Partitioning Demo

Interactive partitioning demo coming soon...

Benefits:

Minimal data movement when nodes join/leave
Even distribution of keys
Each node manages ~1/N of total data

2. Database Scaling Fundamentals

Before diving into our replication strategy, we need to understand database scaling concepts. For a comprehensive guide on replication vs sharding, MySQL scaling with Vitess, multi-master setups, and cross-shard join challenges, see:

📚 Database Concepts: Scaling Strategies
Essential reading covering replication, sharding, MySQL vs PostgreSQL, Vitess architecture, multi-master replication, write scaling challenges, and cross-shard join patterns.

Quick Summary for our context:

Sharding: Splits data across nodes for storage/write scaling
Replication: Copies data across nodes for availability
Combined: Our KV store uses both - sharding for scale + replication for fault tolerance

1. Replication (MySQL's Traditional Approach):

┌─────────────┐     Async/Semi-sync     ┌─────────────┐
│   Master    │ ──────────────────────> │   Slave 1   │
│  (Writes)   │                         │  (Reads)    │
└─────────────┘ ──────────────────────> └─────────────┘
                                        ┌─────────────┐
                                        │   Slave 2   │
                                        │  (Reads)    │
                                        └─────────────┘

- Master handles all writes
- Slaves handle read queries
- Asynchronous by default (lag possible)
- Semi-synchronous ensures at least one slave has data

2. Sharding (MySQL Cluster/NDB or Application-level):

Application determines shard key (e.g., user_id % 3)
         │
    ┌────┴────┬─────────┬─────────┐
    │         │         │         │
┌───▼───┐ ┌──▼───┐ ┌──▼───┐ ┌──▼───┐
│Shard 1│ │Shard 2│ │Shard 3│ │Shard 4│
│US-West│ │US-East│ │Europe │ │ Asia  │
└───────┘ └───────┘ └───────┘ └───────┘

- Often implemented at application level
- MySQL Fabric for automated sharding
- Vitess for large-scale sharding

PostgreSQL

1. Replication (Streaming Replication):

┌─────────────┐     WAL Streaming      ┌─────────────┐
│   Primary   │ ──────────────────────>│  Standby 1  │
│             │                        │  (Hot)      │
└─────────────┘ ──────────────────────>└─────────────┘
                                       ┌─────────────┐
                                       │  Standby 2  │
                                       │  (Warm)     │
                                       └─────────────┘

- Write-Ahead Log (WAL) shipping
- Synchronous or asynchronous
- Hot standby can serve read queries
- Automatic failover with tools like Patroni

2. Sharding (PostgreSQL 10+ with Partitioning):

┌─────────────────────────────┐
│    Users Table (Parent)     │
└──────────────┬──────────────┘
               │ Declarative Partitioning
    ┌──────────┼──────────┬──────────┐
    │          │          │          │
┌───▼────┐ ┌──▼────┐ ┌──▼────┐ ┌──▼────┐
│Users_Q1│ │Users_Q2│ │Users_Q3│ │Users_Q4│
│Jan-Mar │ │Apr-Jun │ │Jul-Sep │ │Oct-Dec │
└────────┘ └────────┘ └────────┘ └────────┘

- Built-in table partitioning
- Foreign Data Wrappers for cross-node sharding
- Citus extension for true distributed sharding

MySQL vs PostgreSQL: Replication & Sharding Comparison

Feature	MySQL	PostgreSQL
Replication
Primary Method	Master-Slave (Binary Log)	Streaming Replication (WAL)
Replication Types	• Asynchronous (default) • Semi-synchronous • Group Replication	• Asynchronous (default) • Synchronous • Logical Replication
Multi-Master	✓ Yes(MySQL Cluster, Galera)	⚠ Limited(BDR, pgLogical)
Failover	Manual / MHA / Orchestrator	pg_auto_failover / Patroni
Read Replicas	✓ Unlimited	✓ Unlimited + Hot Standby
Sharding
Native Support	❌ No(Application-level)	⚠ Partitioning Only(Table-level)
Sharding Solutions	Vitess • MySQL Fabric • ProxySQL • Spider Engine	Citus • postgres_fdw • PL/Proxy • pg_shard (deprecated)
Cross-Shard Queries	Complex(App handles joins)	Better(FDW, Citus distributed queries)
Auto-Sharding	✓ Vitess	✓ Citus
Performance & Scale
Replication Lag	Variable(Binary log based)	Lower(WAL streaming)
Write Scaling	Good(With sharding)	Good(With Citus)
Best For	• Web applications • Read-heavy workloads • Simple schemas	• Complex queries • Data integrity critical • Analytics workloads

Key Takeaways

MySQL Strengths:

Mature ecosystem with tools like Vitess for large-scale sharding
True multi-master replication with Galera Cluster
Simpler replication setup for basic use cases
Better suited for simple, high-volume web applications

PostgreSQL Strengths:

Superior built-in partitioning capabilities
More reliable replication with WAL streaming
Better cross-shard query capabilities with FDW
Stronger consistency guarantees and data integrity

When to Choose:

MySQL + Vitess: Large-scale web apps needing horizontal scaling
PostgreSQL + Citus: Complex analytics with distributed queries
MySQL Galera: Multi-master writes across regions
PostgreSQL Streaming: High-availability with strong consistency

Deep Dive: Vitess - YouTube's Gift to MySQL Scaling

Vitess is a database clustering system for horizontal scaling of MySQL. Originally developed at YouTube to handle massive growth, it's now used by companies like Slack, Square, and GitHub.

How Vitess Works

┌─────────────────────────────────────────────────────────────┐
│                        Application                          │
└───────────────────────┬─────────────────────────────────────┘
                        │ gRPC/MySQL Protocol
┌───────────────────────▼─────────────────────────────────────┐
│                      VTGate (Proxy)                         │
│  • Query routing      • Connection pooling                  │
│  • Query rewriting    • Scatter-gather engine              │
└───────────────┬───────────────┬─────────────────┬──────────┘
                │               │                 │
        ┌───────▼────┐   ┌──────▼────┐    ┌──────▼────┐
        │  VTTablet  │   │  VTTablet │    │  VTTablet │
        │  (Primary) │   │  (Primary)│    │  (Primary)│
        │  Shard -80 │   │ Shard 80-│    │ Shard c0- │
        └─────┬──────┘   └─────┬────┘    └─────┬─────┘
              │                │                │
        ┌─────▼────┐     ┌─────▼───┐     ┌─────▼────┐
        │  MySQL   │     │  MySQL  │     │  MySQL   │
        │ Primary  │     │ Primary │     │ Primary  │
        └─────┬────┘     └────┬────┘     └────┬─────┘
              │               │               │
        ┌─────▼────┐     ┌────▼────┐     ┌────▼─────┐
        │ Replica  │     │ Replica │     │ Replica  │
        └──────────┘     └─────────┘     └──────────┘

Components:
• VTGate: Stateless proxy, routes queries to correct shards
• VTTablet: Controls MySQL instance, handles replication
• Topology Service: Stores metadata (uses etcd/Consul/ZooKeeper)

Vitess Auto-Sharding Process

1. Initial State - Unsharded

┌────────────────────┐
│   Unsharded DB     │
│  users (100GB)     │
│  All data in one   │
│     MySQL          │
└────────────────────┘

2. Prepare for Sharding

Step 1: Define Sharding Key (e.g., user_id)
Step 2: Create VSchema
{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "users": {
      "column_vindexes": [{
        "column": "user_id",
        "name": "hash"
      }]
    }
  }
}

3. Split Process (Zero-Downtime)

Phase 1: Create Target Shards
┌──────────────┐     ┌─────────────┐     ┌─────────────┐
│  Source DB   │     │  Shard -80  │     │  Shard 80-  │
│ (All Data)   │ ──> │ (user_id    │     │ (user_id    │
│              │     │  hash < 80) │     │  hash >= 80)│
└──────────────┘     └─────────────┘     └─────────────┘
                           ▲                     ▲
                           │                     │
                           └─────────┬───────────┘
                                    │
                            VReplication
                            (Filtered Copy)

Phase 2: Switch Reads (instant)
Phase 3: Switch Writes (brief lock)
Phase 4: Clean up source

4. Resharding (Splitting Further)

Before: 2 shards [-80, 80-]
After:  4 shards [-40, 40-80, 80-c0, c0-]

┌─────────┐     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ Shard   │     │ Shard    │  │ Shard    │  │ Shard    │  │ Shard    │
│  -80    │ ──> │  -40     │  │  40-80   │  │  80-c0   │  │  c0-     │
└─────────┘     └──────────┘  └──────────┘  └──────────┘  └──────────┘
     │
     └── Splits into 2 shards automatically

Key Vitess Features

1. Query Routing

-- Application sends:
SELECT * FROM users WHERE user_id = 12345;

-- VTGate determines shard:
user_id 12345 → hash → shard 40-80

-- Routes to correct MySQL:
/* vtgate:: keyspace:users shard:40-80 */
SELECT * FROM users WHERE user_id = 12345;

2. Scatter-Gather for Cross-Shard Queries

-- Query needing all shards:
SELECT COUNT(*) FROM users WHERE status = 'active';

-- VTGate scatter phase:
┌────────┐      ┌────────┐      ┌────────┐
│Shard -40│     │Shard 40-│     │Shard 80-│
│COUNT: 25│     │COUNT: 30│     │COUNT: 45│
└────────┘      └────────┘      └────────┘
     │               │               │
     └───────────────┼───────────────┘
                     ▼
              VTGate Gather
              Total: 100

Deep Dive: Multi-Master Replication

Multi-master replication allows writes to any node, unlike traditional master-slave setups. Let's explore the two main approaches:

1. MySQL Group Replication / Galera Cluster

Architecture:

┌─────────────────────────────────────────────────────────────┐
│                     Write to ANY Node                       │
└─────────────┬──────────────┬──────────────┬────────────────┘
              │              │              │
         ┌────▼────┐    ┌────▼────┐   ┌────▼────┐
         │ Node 1  │    │ Node 2  │   │ Node 3  │
         │ Primary │◄───│ Primary │───►│ Primary │
         └────┬────┘    └────┬────┘   └────┬────┘
              │              │              │
              └──────────────┼──────────────┘
                            │
                    Group Communication
                    (Consensus Protocol)

How Writes Work:

1. Client writes to Node 1
   UPDATE users SET status='active' WHERE id=123;

2. Certification Process:
   ┌─────────┐     ┌─────────────────┐     ┌──────────┐
   │ Node 1  │────►│ Broadcast Write │────►│ Node 2,3 │
   │(Origin) │     │   Transaction   │     │(Certify) │
   └─────────┘     └─────────────────┘     └──────────┘
        │                                        │
        │          ┌─────────────────┐          │
        └──────────│ Conflict Check  │◄─────────┘
                   │ - Same row?     │
                   │ - Write order?  │
                   └────────┬────────┘
                            │
                   ┌────────▼────────┐
                   │ Commit or Abort │
                   └─────────────────┘

3. If certified: All nodes apply in same order
4. If conflict: Transaction rolled back

Conflict Resolution Example:

Simultaneous writes to same row:

Node 1: UPDATE users SET age=25 WHERE id=1; (Time: 10:00:00.100)
Node 2: UPDATE users SET age=26 WHERE id=1; (Time: 10:00:00.101)

Resolution:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │     │ Node 2  │     │ Node 3  │
│ age=25  │     │ age=26  │     │ Arbiter │
└────┬────┘     └────┬────┘     └────┬────┘
     │               │                │
     └───────────────┼────────────────┘
                     ▼
              Global Order Applied
              (First write wins)
              All nodes: age=25

2. Application-Level Multi-Master

Some systems implement multi-master at the application level using techniques like CRDTs (Conflict-free Replicated Data Types):

Shopping Cart Example - Merge on Read:

Node A (US):                Node B (EU):
┌──────────────┐           ┌──────────────┐
│ Cart_User123 │           │ Cart_User123 │
│ + iPhone     │           │ + iPad       │
│ + AirPods    │           │ + MacBook    │
└──────────────┘           └──────────────┘
       │                           │
       └───────────┬───────────────┘
                   ▼
            Merge Strategy
            ┌──────────────┐
            │ Cart_User123 │
            │ + iPhone     │
            │ + AirPods    │
            │ + iPad       │
            │ + MacBook    │
            └──────────────┘

Multi-Master Trade-offs

✅ Advantages

• Write anywhere - no single point of failure
• Better geographic distribution
• Read scaling across all nodes
• Automatic failover

❌ Challenges

• Write conflicts need resolution
• Lower write throughput (consensus overhead)
• Complex debugging
• Network partitions can cause split-brain

When to Use Multi-Master

Good Fit:

Geographic distribution with local writes
High availability requirements
Read-heavy workloads
Systems that can tolerate/resolve conflicts

Poor Fit:

High-volume writes to same data
Strong consistency requirements
Complex transactions
Limited network bandwidth

Vitess vs Multi-Master: When Do You Need Both?

Great question! While Vitess solves horizontal scaling through sharding, multi-master addresses a different problem: write availability and geographic distribution. Let's break down when you need each:

Vitess Solves: Data Volume Scaling

Problem: Single database can't handle data size
┌─────────────────────────────────────────┐
│        Single MySQL Instance           │
│    users table: 500GB, 1B records      │
│    💥 Too big for one machine          │
└─────────────────────────────────────────┘

Vitess Solution: Horizontal partitioning
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Shard 1    │  │  Shard 2    │  │  Shard 3    │
│  users      │  │  users      │  │  users      │
│  (1-333M)   │  │  (334-666M) │  │  (667M-1B)  │
└─────────────┘  └─────────────┘  └─────────────┘

Multi-Master Solves: Write Availability & Geography

Problem: Single write point fails or is too far
┌─────────────────────────────────────────┐
│         US West (Primary)               │
│    All writes must go here              │
│    💥 Far from EU/Asia users            │
│    💥 Single point of failure           │
└─────────────────────────────────────────┘

Multi-Master Solution: Write anywhere
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  US West    │  │  EU Central │  │  Asia Pac   │
│  (Master)   │  │  (Master)   │  │  (Master)   │
│  Low latency│  │  Low latency│  │  Low latency│
│  for US     │  │  for EU     │  │  for Asia   │
└─────────────┘  └─────────────┘  └─────────────┘

The Reality: You Often Need BOTH

Here's how real-world systems combine them:

Example: Global Social Media Platform

             Geographic Multi-Master Setup
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   US Region  │    │  EU Region   │    │ Asia Region  │
│              │    │              │    │              │
│   ┌─────┐    │    │   ┌─────┐    │    │   ┌─────┐    │
│   │User │    │◄──►│   │User │    │◄──►│   │User │    │
│   │Posts│    │    │   │Posts│    │    │   │Posts│    │
│   └─────┘    │    │   └─────┘    │    │   └─────┘    │
│       │      │    │       │      │    │       │      │
│    Vitess    │    │    Vitess    │    │    Vitess    │
│   Sharding   │    │   Sharding   │    │   Sharding   │
│       │      │    │       │      │    │       │      │
│ ┌───┬─┴─┬───┐│    │ ┌───┬─┴─┬───┐│    │ ┌───┬─┴─┬───┐│
│ │S1 │S2 │S3 ││    │ │S1 │S2 │S3 ││    │ │S1 │S2 │S3 ││
│ └───┴───┴───┘│    │ └───┴───┴───┘│    │ └───┴───┴───┘│
└──────────────┘    └──────────────┘    └──────────────┘

Each region:
- Multi-Master: Handles regional writes, replicates globally
- Vitess: Shards data within each region for scale

Decision Matrix

Scenario	Vitess	Multi-Master	Both
Single region, large dataset	✓ Perfect	Not needed	Overkill
Multi-region, small dataset	Not needed	✓ Perfect	Overkill
Global scale, large dataset	Helps	Helps	✓ Best
High write availability needed	Doesn't help	✓ Perfect	If also large
Latency-sensitive writes	Doesn't help	✓ Perfect	If also large

Real-World Examples

Vitess Only:

YouTube Analytics Dashboard
- Massive historical data (petabytes)
- Complex aggregation queries
- Single region acceptable
- Vitess shards by time/user_id

Multi-Master Only:

Global User Preferences Service
- Small dataset (user settings)
- Needs local writes everywhere
- Galera Cluster across regions
- No sharding needed

Both Together:

Instagram/Facebook
- Massive media storage (Vitess for sharding)
- Global user base (Multi-master for regions)
- Each region shards its data locally
- Cross-region replication for user data

#### MySQL Write Throughput Challenge

Critical Point: Even with Vitess, MySQL's fundamental limitation remains - writes only happen on the primary node. This creates a single point of write operations per shard.

For scaling write throughput in a single region, you have limited options:

⚠️ MySQL Write Scaling Options

1. Multi-Master Replication

• Multiple nodes accept writes
• True write throughput scaling
• Conflict resolution complexity
• Network partition challenges

2. Alternative Approaches

• Write buffering/batching
• Asynchronous processing
• Move to NoSQL (MongoDB, Cassandra)
• Event sourcing patterns

Reality Check:

Vitess Sharding:               Multi-Master:
┌─────────────────────┐        ┌─────────────────────┐
│ Shard 1: [Primary] │        │ Master 1: ✓ Writes │
│         [Replica]  │   VS   │ Master 2: ✓ Writes │
│ Shard 2: [Primary] │        │ Master 3: ✓ Writes │
│         [Replica]  │        │ (All can write)     │
└─────────────────────┘        └─────────────────────┘
     ↑                                ↑
 Single writer per shard      Multiple concurrent writers

Key Insights

Vitess Limitations:

Still has single write point per shard
Cross-region writes still slow
Master failure affects that shard's writes
Write throughput = sum of shard primaries

Multi-Master Limitations:

Doesn't solve data volume scaling
Complex conflict resolution
Write performance decreases with scale
But scales write throughput linearly

Why Both:

Multi-Master handles:           Vitess handles:
┌─────────────────────┐        ┌─────────────────────┐
│ • Geographic writes │   +    │ • Data volume       │
│ • Write availability│        │ • Query distribution│
│ • Network latency   │        │ • Storage scaling   │
└─────────────────────┘        └─────────────────────┘
           │                            │
           └────────────┬───────────────┘
                        ▼
            Complete Distributed Solution

Cross-Shard Joins: The Efficiency Challenge

When using sharded systems like Vitess, cross-shard joins are problematic and should be avoided when possible.

How Cross-Shard Joins Work

⚠️ Cross-Shard Join Process

-- Query: Get user details with their recent orders
SELECT u.name, u.email, o.order_id, o.total 
FROM users u 
JOIN orders o ON u.user_id = o.user_id 
WHERE u.region = 'US'

What Vitess has to do:

1. Scatter Phase:
   ┌─────────────────────────────────────────────────────┐
   │ Query sent to ALL shards (because we don't know     │
   │ which shards contain matching users/orders)         │
   └─────────────────────────────────────────────────────┘
                           ↓
   Shard 1    Shard 2    Shard 3    Shard 4
   ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
   │Users│    │Users│    │Users│    │Users│
   │Orders│   │Orders│   │Orders│   │Orders│
   └─────┘    └─────┘    └─────┘    └─────┘

2. Gather Phase:
   ┌─────────────────────────────────────────────────────┐
   │ Collect partial results from each shard            │
   │ Vitess proxy performs the JOIN operation           │
   │ Sort, filter, and return final results             │
   └─────────────────────────────────────────────────────┘

Efficiency Problems

🐌 Performance Issues

• Network overhead: Data from all shards
• Memory usage: Vitess proxy holds all data
• No index usage: Can't use MySQL join optimizations
• Serial processing: Wait for slowest shard
• Bandwidth waste: Moving large datasets

📊 Complexity Issues

• Query planning: Vitess must understand joins
• Transaction boundaries: ACID across shards
• Deadlocks: Cross-shard dependencies
• Limited SQL support: Not all joins supported
• Debugging difficulty: Multi-shard query plans

Vitess Join Support Levels

✅

Same Shard Joins

Fast & efficient. Uses MySQL's native join optimization.

⚠️

Cross-Shard Joins

Supported but slow. Avoid for high-traffic queries.

❌

Complex Joins

Limited support. Subqueries, CTEs often not supported.

Design Patterns to Avoid Cross-Shard Joins

1. Denormalization:

-- Instead of joining across shards
-- users (shard by user_id) + orders (shard by user_id)
-- Store user info in orders table

CREATE TABLE orders (
  order_id BIGINT,
  user_id BIGINT,
  user_name VARCHAR(100),  -- Denormalized
  user_email VARCHAR(100), -- Denormalized  
  total DECIMAL(10,2),
  created_at TIMESTAMP
) -- Shard by user_id

2. Application-Level Joins:

// Fetch from multiple shards in application
users := fetchUsers(userIDs)      // From user shards
orders := fetchOrders(userIDs)    // From order shards
result := joinInMemory(users, orders) // App does the join

3. Materialized Views:

-- Pre-computed join results
CREATE TABLE user_order_summary (
  user_id BIGINT,
  total_orders INT,
  total_spent DECIMAL(12,2),
  last_order_date DATE
) -- Updated via events/batch jobs

4. Event-Driven Consistency:

User Update Event → Update orders.user_name
Order Created Event → Update user_order_summary

When Cross-Shard Joins are Acceptable

✅ Acceptable Use Cases

• Admin/Analytics queries: Low frequency, can be slow
• Small result sets: Limited data movement
• Non-critical paths: Batch processing, reports
• Development/Testing: Temporary queries

Performance Comparison:

Same-Shard Join:     1-10ms    (MySQL native optimization)
Cross-Shard Join:    100-1000ms (Network + coordination overhead)
Application Join:    50-200ms   (Parallel fetches + in-memory join)

Bottom Line:

Vitess = Scale your data horizontally
Multi-Master = Write from anywhere with high availability
Both = Global, massive-scale applications (Facebook, Instagram, YouTube)

You choose based on your specific constraints: data size, geographic distribution, write patterns, and availability requirements.

Key Differences Summary

Aspect	Sharding	Replication
Purpose	Scalability	Availability
Data Distribution	Different data on each node	Same data on all nodes
Storage Efficiency	✓ High(data divided)	⚠ Low(data duplicated)
Write Performance	📈 Scales linearly	🔒 Limited by slowest replica
Read Performance	Scales for different data	Scales for same data
Failure Impact	❌ Partial data loss	✓ No data loss
Complexity	High(routing, joins)	Medium(consistency)

3. Our Replication Strategy

Now that we understand the concepts, for our distributed KV store, we'll use both techniques. First, let's look at how we replicate each key to N nodes for high availability:

Replication Visualizer

Interactive replication demo coming soon...

Replication approaches:

Simple: Next N nodes on hash ring
Rack-aware: Different racks/data centers
Zone-aware: Different availability zones

⚠️
4. Consistency Levels

Different operations require different consistency guarantees:

Consistency Level Demo

Interactive consistency levels demo coming soon...

Write Consistency Levels:

ONE: Write completes after one replica acknowledges
QUORUM: Write completes after majority acknowledge
ALL: Write completes after all replicas acknowledge

Read Consistency Levels:

ONE: Return after reading from one replica
QUORUM: Return most recent from majority
ALL: Return after reading from all replicas

The Quorum Formula Explained

The formula R + W > N ensures strong consistency. Let's break this down:

N = Total number of replicas (replication factor)
W = Write consistency level (number of nodes that must acknowledge a write)
R = Read consistency level (number of nodes that must respond to a read)

Why does R + W > N guarantee strong consistency?

When R + W > N, there's guaranteed overlap between the nodes that wrote the data and the nodes you're reading from. This means at least one node in your read set has the latest value.

Visual Example with N=3:

Write with W=2:            Read with R=2:
┌─────┐ ✓ Written          ┌─────┐ ✓ Read
│Node1│                    │Node1│
└─────┘                    └─────┘
┌─────┐ ✓ Written          ┌─────┐ 
│Node2│                    │Node2│ 
└─────┘                    └─────┘
┌─────┐ ✗ Not yet          ┌─────┐ ✓ Read
│Node3│                    │Node3│
└─────┘                    └─────┘

W=2 + R=2 = 4 > N=3 ✓
Overlap guaranteed: Node1 has latest data

Common Quorum Configurations:

| Config | N | W | R | R+W | Consistency | Use Case | |--------|---|---|---|-----|-------------|----------| | Strong | 3 | 2 | 2 | 4 > 3 | ✓ Strong | Critical data | | Read-Heavy | 3 | 3 | 1 | 4 > 3 | ✓ Strong | Read-optimized | | Write-Heavy | 3 | 1 | 3 | 4 > 3 | ✓ Strong | Write-optimized | | Eventual | 3 | 1 | 1 | 2 < 3 | ✗ Eventual | Performance-critical |

Trade-offs:

Higher W: Slower writes, better durability
Higher R: Slower reads, fresher data
W=N: Maximum durability, writes fail if any node is down
R=N: Always read latest, reads fail if any node is down

5. Versioning & Conflict Resolution

We use vector clocks to track causality:

Client A writes v1: [A:1]
Client B writes v2: [B:1]
Client A writes v3: [A:2, B:1]

Conflict detected: v1 and v2 are concurrent
Resolution: Last-write-wins or application-level merge

6. Failure Handling

Techniques for high availability:

Sloppy Quorum: Use first N healthy nodes
Hinted Handoff: Store temporarily on another node
Read Repair: Fix inconsistencies during reads
Anti-entropy: Background process to sync replicas

Storage Engine

LSM-Tree (Log-Structured Merge Tree)

Most distributed KV stores use LSM-trees for write optimization:

┌─────────────┐
│  MemTable   │ ← In-memory writes (sorted)
└──────┬──────┘
       │ Flush when full
┌──────▼──────┐
│   SSTable   │ ← Immutable on-disk files
│  (Level 0)  │
└──────┬──────┘
       │ Compact & merge
┌──────▼──────┐
│   SSTable   │
│  (Level 1)  │
└─────────────┘

Write Path:

Write to commit log (durability)
Write to MemTable (fast access)
Flush to SSTable when full
Background compaction

Read Path:

Check MemTable
Check Bloom filters
Binary search in SSTables
Return value or not found

Auto-scaling

Monitor metrics and scale automatically:

func autoScale(metrics Metrics) {
    if metrics.CPUUsage > 80 || metrics.LatencyP99 > 100 {
        addNode()
    } else if metrics.CPUUsage < 20 && nodeCount > minNodes {
        removeNode()
    }
}

Scaling triggers:

CPU usage > 80%
Memory usage > 85%
P99 latency > threshold
Disk usage > 80%

Performance Optimizations

1. Bloom Filters

Skip unnecessary disk reads:

if !bloomFilter.MightContain(key) {
    return nil, NotFound
}

2. Compression

Reduce storage and network I/O:

Snappy for speed
Zstandard for ratio
LZ4 for balance

3. Caching

Multi-level caching strategy:

Row cache: Full key-value pairs
Block cache: SSTable blocks
Key cache: Key → disk location

4. Batch Operations

Reduce network round trips:

batch := NewBatch()
batch.Put("key1", "value1")
batch.Put("key2", "value2")
batch.Delete("key3")
db.Write(batch)

Monitoring & Operations

Key Metrics

Latency: P50, P95, P99
Throughput: Reads/writes per second
Availability: Uptime percentage
Consistency: Divergence between replicas
Storage: Disk usage, compaction stats

Health Checks

func healthCheck() HealthStatus {
    return HealthStatus{
        Replication: checkReplicationLag(),
        Storage:     checkDiskSpace(),
        Network:     checkNodeConnectivity(),
        Load:        checkLoadDistribution(),
    }
}

Implementation Example

type KVStore struct {
    ring        *ConsistentHash
    storage     StorageEngine
    replicator  Replicator
    coordinator Coordinator
}

func (kv *KVStore) Put(key, value string, consistency Level) error {
    // Find nodes for this key
    nodes := kv.ring.GetNodes(key, kv.replicationFactor)
    
    // Write based on consistency level
    switch consistency {
    case ONE:
        return kv.writeOne(nodes, key, value)
    case QUORUM:
        return kv.writeQuorum(nodes, key, value)
    case ALL:
        return kv.writeAll(nodes, key, value)
    }
}

func (kv *KVStore) Get(key string, consistency Level) (string, error) {
    nodes := kv.ring.GetNodes(key, kv.replicationFactor)
    
    switch consistency {
    case ONE:
        return kv.readOne(nodes, key)
    case QUORUM:
        return kv.readQuorum(nodes, key)
    case ALL:
        return kv.readAll(nodes, key)
    }
}

Trade-offs & Decisions

| Decision | Option A | Option B | Our Choice | |----------|----------|----------|------------| | Consistency Model | Strong (CP) | Eventual (AP) | Tunable (Both) | | Partitioning | Range-based | Hash-based | Consistent Hash | | Storage Engine | B-tree | LSM-tree | LSM-tree | | Replication | Master-slave | Peer-to-peer | Peer-to-peer | | Conflict Resolution | Last-write-wins | Vector clocks | Vector clocks |

Distributed Key-Value Store

CAP Theorem Interactive Demo

Key Insights

Why P is Essential

CA is Theoretical

Requirements

CAP Theorem: The Fundamental Trade-off

Why You Must Choose

The Two Practical Choices

Why CA Systems Don't Work

Architecture Overview

Core Components

1. Data Partitioning

Partitioning Demo

2. Database Scaling Fundamentals

PostgreSQL

MySQL vs PostgreSQL: Replication & Sharding Comparison

Key Takeaways

Deep Dive: Vitess - YouTube's Gift to MySQL Scaling

How Vitess Works

Vitess Auto-Sharding Process

Key Vitess Features

Deep Dive: Multi-Master Replication

1. MySQL Group Replication / Galera Cluster

2. Application-Level Multi-Master

Multi-Master Trade-offs

✅ Advantages

❌ Challenges

When to Use Multi-Master

Vitess vs Multi-Master: When Do You Need Both?

Vitess Solves: Data Volume Scaling

Multi-Master Solves: Write Availability & Geography

The Reality: You Often Need BOTH

Decision Matrix

Real-World Examples

#### MySQL Write Throughput Challenge

⚠️ MySQL Write Scaling Options

1. Multi-Master Replication

2. Alternative Approaches

Cross-Shard Joins: The Efficiency Challenge

How Cross-Shard Joins Work

⚠️ Cross-Shard Join Process

Efficiency Problems

🐌 Performance Issues

📊 Complexity Issues

Vitess Join Support Levels

Same Shard Joins

Cross-Shard Joins

Complex Joins

Design Patterns to Avoid Cross-Shard Joins

When Cross-Shard Joins are Acceptable

✅ Acceptable Use Cases

Key Differences Summary

3. Our Replication Strategy

Replication Visualizer

⚠️4. Consistency Levels

Consistency Level Demo

The Quorum Formula Explained

5. Versioning & Conflict Resolution

6. Failure Handling

Storage Engine

LSM-Tree (Log-Structured Merge Tree)

Auto-scaling

Performance Optimizations

1. Bloom Filters

2. Compression

3. Caching

4. Batch Operations

Monitoring & Operations

Key Metrics

Health Checks

Implementation Example

Trade-offs & Decisions

Real-World Examples

Amazon DynamoDB

Apache Cassandra

Redis Cluster

Next Steps

References

⚠️
4. Consistency Levels