Distributed Key-Value Store

Building a distributed key-value store requires understanding fundamental trade-offs in distributed systems. The CAP theorem is the foundation of all distributed system design decisions. Let's explore these trade-offs interactively before diving into the architecture.

CAP Theorem Interactive Demo

In distributed systems, you can only guarantee two of these three properties. Click on a combination to explore the trade-offs.

ConsistencyAll nodes see the same dataAvailabilitySystem stays operationalPartitionToleranceWorks despite network failuresCPMongoDB, HBaseAPCassandra, DynamoDBCATraditional RDBMS

Key Insights

Why P is Essential

Network partitions are inevitable in distributed systems. Cables break, switches fail, and data centers lose connectivity. Therefore, you must choose between C and A.

CA is Theoretical

CA systems work only in non-distributed environments. Once you distribute across multiple nodes, network partitions become possible, making CA impractical.

Now that we understand the fundamental constraints, let's design a system that can scale to handle billions of key-value pairs across hundreds of nodes while making informed decisions about these trade-offs.

Requirements

Our distributed key-value store must handle:

  • Small key-value pairs: Less than 10 KB each
  • Big data: Store and retrieve billions of items
  • High availability: Respond quickly even during failures
  • High scalability: Scale horizontally with traffic
  • Auto-scaling: Add/remove nodes automatically
  • Tunable consistency: Choose between strong and eventual consistency
  • Low latency: Sub-millisecond response times

CAP Theorem: The Fundamental Trade-off

The CAP theorem, proved by Eric Brewer, states that any distributed system can only guarantee two out of three properties:

  • Consistency (C): All nodes see the same data at the same time - reads receive the most recent write
  • Availability (A): System remains operational - every request receives a response (success or failure)
  • Partition Tolerance (P): System continues operating despite network failures between nodes

Why You Must Choose

In real-world distributed systems, network partitions are inevitable. Cables break, switches fail, data centers lose connectivity, and network congestion causes delays. This means:

You cannot avoid P, so you must choose between C and A

The Two Practical Choices

CP Systems (Consistency + Partition Tolerance)

  • Strategy: Reject requests when consistency cannot be guaranteed
  • Examples: MongoDB, HBase, Redis Cluster, Consul, Zookeeper
  • Use Cases: Financial transactions, inventory management, configuration stores
  • Trade-off: System may become unavailable during partitions

AP Systems (Availability + Partition Tolerance)

  • Strategy: Accept requests even if data might be inconsistent
  • Examples: Cassandra, DynamoDB, CouchDB, Riak, DNS
  • Use Cases: Social media feeds, content delivery, analytics, user preferences
  • Trade-off: Data may be temporarily inconsistent across nodes

Why CA Systems Don't Work

CA Systems (Traditional RDBMS in single location) work only when:

  • All data is on one machine, OR
  • All machines are in the same rack with perfect network

Once you distribute across multiple locations, network partitions become possible, making pure CA systems impractical for distributed architectures.

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │     │   Client    │     │   Client    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                    ┌──────▼──────┐
                    │ Load Balancer│
                    └──────┬──────┘
       ┌───────────────────┼───────────────────┐
       │                   │                   │
┌──────▼──────┐     ┌──────▼──────┐     ┌──────▼──────┐
│  API Server │     │  API Server │     │  API Server │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                    ┌──────▼──────┐
                    │ Coordinator │
                    └──────┬──────┘
       ┌───────────────────┼───────────────────┐
       │                   │                   │
┌──────▼──────┐     ┌──────▼──────┐     ┌──────▼──────┐
│Storage Node1│     │Storage Node2│     │Storage Node3│
│  (Replica)  │     │  (Replica)  │     │  (Replica)  │
└─────────────┘     └─────────────┘     └─────────────┘

Core Components

1. Data Partitioning

We use consistent hashing to distribute data across nodes:

Partitioning Demo

Interactive partitioning demo coming soon...

Benefits:

  • Minimal data movement when nodes join/leave
  • Even distribution of keys
  • Each node manages ~1/N of total data

2. Database Scaling Fundamentals

Before diving into our replication strategy, we need to understand database scaling concepts. For a comprehensive guide on replication vs sharding, MySQL scaling with Vitess, multi-master setups, and cross-shard join challenges, see:

📚 Database Concepts: Scaling Strategies
Essential reading covering replication, sharding, MySQL vs PostgreSQL, Vitess architecture, multi-master replication, write scaling challenges, and cross-shard join patterns.

Quick Summary for our context:

  • Sharding: Splits data across nodes for storage/write scaling
  • Replication: Copies data across nodes for availability
  • Combined: Our KV store uses both - sharding for scale + replication for fault tolerance

1. Replication (MySQL's Traditional Approach):

┌─────────────┐     Async/Semi-sync     ┌─────────────┐
│   Master    │ ──────────────────────> │   Slave 1   │
│  (Writes)   │                         │  (Reads)    │
└─────────────┘ ──────────────────────> └─────────────┘
                                        ┌─────────────┐
                                        │   Slave 2   │
                                        │  (Reads)    │
                                        └─────────────┘

- Master handles all writes
- Slaves handle read queries
- Asynchronous by default (lag possible)
- Semi-synchronous ensures at least one slave has data

2. Sharding (MySQL Cluster/NDB or Application-level):

Application determines shard key (e.g., user_id % 3)
    ┌────┴────┬─────────┬─────────┐
    │         │         │         │
┌───▼───┐ ┌──▼───┐ ┌──▼───┐ ┌──▼───┐
│Shard 1│ │Shard 2│ │Shard 3│ │Shard 4│
│US-West│ │US-East│ │Europe │ │ Asia  │
└───────┘ └───────┘ └───────┘ └───────┘

- Often implemented at application level
- MySQL Fabric for automated sharding
- Vitess for large-scale sharding
PostgreSQL

1. Replication (Streaming Replication):

┌─────────────┐     WAL Streaming      ┌─────────────┐
│   Primary   │ ──────────────────────>│  Standby 1  │
│             │                        │  (Hot)      │
└─────────────┘ ──────────────────────>└─────────────┘
                                       ┌─────────────┐
                                       │  Standby 2  │
                                       │  (Warm)     │
                                       └─────────────┘

- Write-Ahead Log (WAL) shipping
- Synchronous or asynchronous
- Hot standby can serve read queries
- Automatic failover with tools like Patroni

2. Sharding (PostgreSQL 10+ with Partitioning):

┌─────────────────────────────┐
│    Users Table (Parent)     │
└──────────────┬──────────────┘
               │ Declarative Partitioning
    ┌──────────┼──────────┬──────────┐
    │          │          │          │
┌───▼────┐ ┌──▼────┐ ┌──▼────┐ ┌──▼────┐
│Users_Q1│ │Users_Q2│ │Users_Q3│ │Users_Q4│
│Jan-Mar │ │Apr-Jun │ │Jul-Sep │ │Oct-Dec │
└────────┘ └────────┘ └────────┘ └────────┘

- Built-in table partitioning
- Foreign Data Wrappers for cross-node sharding
- Citus extension for true distributed sharding

MySQL vs PostgreSQL: Replication & Sharding Comparison

Feature
MySQL
PostgreSQL
Replication
Primary MethodMaster-Slave (Binary Log)Streaming Replication (WAL)
Replication Types
• Asynchronous (default)
• Semi-synchronous
• Group Replication
• Asynchronous (default)
• Synchronous
• Logical Replication
Multi-Master✓ Yes(MySQL Cluster, Galera)⚠ Limited(BDR, pgLogical)
FailoverManual / MHA / Orchestratorpg_auto_failover / Patroni
Read Replicas✓ Unlimited✓ Unlimited + Hot Standby
Sharding
Native Support❌ No(Application-level)⚠ Partitioning Only(Table-level)
Sharding Solutions
Vitess
• MySQL Fabric
• ProxySQL
• Spider Engine
Citus
• postgres_fdw
• PL/Proxy
• pg_shard (deprecated)
Cross-Shard QueriesComplex(App handles joins)Better(FDW, Citus distributed queries)
Auto-Sharding✓ Vitess✓ Citus
Performance & Scale
Replication LagVariable(Binary log based)Lower(WAL streaming)
Write ScalingGood(With sharding)Good(With Citus)
Best For
• Web applications
• Read-heavy workloads
• Simple schemas
• Complex queries
• Data integrity critical
• Analytics workloads

Key Takeaways

MySQL Strengths:

  • Mature ecosystem with tools like Vitess for large-scale sharding
  • True multi-master replication with Galera Cluster
  • Simpler replication setup for basic use cases
  • Better suited for simple, high-volume web applications

PostgreSQL Strengths:

  • Superior built-in partitioning capabilities
  • More reliable replication with WAL streaming
  • Better cross-shard query capabilities with FDW
  • Stronger consistency guarantees and data integrity

When to Choose:

  • MySQL + Vitess: Large-scale web apps needing horizontal scaling
  • PostgreSQL + Citus: Complex analytics with distributed queries
  • MySQL Galera: Multi-master writes across regions
  • PostgreSQL Streaming: High-availability with strong consistency

Deep Dive: Vitess - YouTube's Gift to MySQL Scaling

Vitess is a database clustering system for horizontal scaling of MySQL. Originally developed at YouTube to handle massive growth, it's now used by companies like Slack, Square, and GitHub.

How Vitess Works

┌─────────────────────────────────────────────────────────────┐
│                        Application                          │
└───────────────────────┬─────────────────────────────────────┘
                        │ gRPC/MySQL Protocol
┌───────────────────────▼─────────────────────────────────────┐
│                      VTGate (Proxy)                         │
│  • Query routing      • Connection pooling                  │
│  • Query rewriting    • Scatter-gather engine              │
└───────────────┬───────────────┬─────────────────┬──────────┘
                │               │                 │
        ┌───────▼────┐   ┌──────▼────┐    ┌──────▼────┐
        │  VTTablet  │   │  VTTablet │    │  VTTablet │
        │  (Primary) │   │  (Primary)│    │  (Primary)│
        │  Shard -80 │   │ Shard 80-│    │ Shard c0- │
        └─────┬──────┘   └─────┬────┘    └─────┬─────┘
              │                │                │
        ┌─────▼────┐     ┌─────▼───┐     ┌─────▼────┐
        │  MySQL   │     │  MySQL  │     │  MySQL   │
        │ Primary  │     │ Primary │     │ Primary  │
        └─────┬────┘     └────┬────┘     └────┬─────┘
              │               │               │
        ┌─────▼────┐     ┌────▼────┐     ┌────▼─────┐
        │ Replica  │     │ Replica │     │ Replica  │
        └──────────┘     └─────────┘     └──────────┘

Components:
• VTGate: Stateless proxy, routes queries to correct shards
• VTTablet: Controls MySQL instance, handles replication
• Topology Service: Stores metadata (uses etcd/Consul/ZooKeeper)

Vitess Auto-Sharding Process

1. Initial State - Unsharded

┌────────────────────┐
│   Unsharded DB     │
│  users (100GB)     │
│  All data in one   │
│     MySQL          │
└────────────────────┘

2. Prepare for Sharding

Step 1: Define Sharding Key (e.g., user_id)
Step 2: Create VSchema
{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "users": {
      "column_vindexes": [{
        "column": "user_id",
        "name": "hash"
      }]
    }
  }
}

3. Split Process (Zero-Downtime)

Phase 1: Create Target Shards
┌──────────────┐     ┌─────────────┐     ┌─────────────┐
│  Source DB   │     │  Shard -80  │     │  Shard 80-  │
│ (All Data)   │ ──> │ (user_id    │     │ (user_id    │
│              │     │  hash < 80) │     │  hash >= 80)│
└──────────────┘     └─────────────┘     └─────────────┘
                           ▲                     ▲
                           │                     │
                           └─────────┬───────────┘
                            VReplication
                            (Filtered Copy)

Phase 2: Switch Reads (instant)
Phase 3: Switch Writes (brief lock)
Phase 4: Clean up source

4. Resharding (Splitting Further)

Before: 2 shards [-80, 80-]
After:  4 shards [-40, 40-80, 80-c0, c0-]

┌─────────┐     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ Shard   │     │ Shard    │  │ Shard    │  │ Shard    │  │ Shard    │
│  -80    │ ──> │  -40     │  │  40-80   │  │  80-c0   │  │  c0-     │
└─────────┘     └──────────┘  └──────────┘  └──────────┘  └──────────┘
     └── Splits into 2 shards automatically

Key Vitess Features

1. Query Routing

-- Application sends:
SELECT * FROM users WHERE user_id = 12345;

-- VTGate determines shard:
user_id 12345hash → shard 40-80

-- Routes to correct MySQL:
/* vtgate:: keyspace:users shard:40-80 */
SELECT * FROM users WHERE user_id = 12345;

2. Scatter-Gather for Cross-Shard Queries

-- Query needing all shards:
SELECT COUNT(*) FROM users WHERE status = 'active';

-- VTGate scatter phase:
┌────────┐      ┌────────┐      ┌────────┐
│Shard -40│     │Shard 40-│     │Shard 80-│COUNT: 25│     │COUNT: 30│     │COUNT: 45└────────┘      └────────┘      └────────┘
     │               │               │
     └───────────────┼───────────────┘
              VTGate Gather
              Total: 100

Deep Dive: Multi-Master Replication

Multi-master replication allows writes to any node, unlike traditional master-slave setups. Let's explore the two main approaches:

1. MySQL Group Replication / Galera Cluster

Architecture:

┌─────────────────────────────────────────────────────────────┐
│                     Write to ANY Node                       │
└─────────────┬──────────────┬──────────────┬────────────────┘
              │              │              │
         ┌────▼────┐    ┌────▼────┐   ┌────▼────┐
         │ Node 1  │    │ Node 2  │   │ Node 3  │
         │ Primary │◄───│ Primary │───►│ Primary │
         └────┬────┘    └────┬────┘   └────┬────┘
              │              │              │
              └──────────────┼──────────────┘
                    Group Communication
                    (Consensus Protocol)

How Writes Work:

1. Client writes to Node 1
   UPDATE users SET status='active' WHERE id=123;

2. Certification Process:
   ┌─────────┐     ┌─────────────────┐     ┌──────────┐
   │ Node 1  │────►│ Broadcast Write │────►│ Node 2,3 │
   │(Origin) │     │   Transaction   │     │(Certify) │
   └─────────┘     └─────────────────┘     └──────────┘
        │                                        │
        │          ┌─────────────────┐          │
        └──────────│ Conflict Check  │◄─────────┘
                   │ - Same row?     │
                   │ - Write order?  │
                   └────────┬────────┘
                   ┌────────▼────────┐
                   │ Commit or Abort │
                   └─────────────────┘

3. If certified: All nodes apply in same order
4. If conflict: Transaction rolled back

Conflict Resolution Example:

Simultaneous writes to same row:

Node 1: UPDATE users SET age=25 WHERE id=1; (Time: 10:00:00.100)
Node 2: UPDATE users SET age=26 WHERE id=1; (Time: 10:00:00.101)

Resolution:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │     │ Node 2  │     │ Node 3  │
│ age=25  │     │ age=26  │     │ Arbiter │
└────┬────┘     └────┬────┘     └────┬────┘
     │               │                │
     └───────────────┼────────────────┘
              Global Order Applied
              (First write wins)
              All nodes: age=25

2. Application-Level Multi-Master

Some systems implement multi-master at the application level using techniques like CRDTs (Conflict-free Replicated Data Types):

Shopping Cart Example - Merge on Read:

Node A (US):                Node B (EU):
┌──────────────┐           ┌──────────────┐
│ Cart_User123 │           │ Cart_User123 │
│ + iPhone     │           │ + iPad       │
│ + AirPods    │           │ + MacBook    │
└──────────────┘           └──────────────┘
       │                           │
       └───────────┬───────────────┘
            Merge Strategy
            ┌──────────────┐
            │ Cart_User123 │
            │ + iPhone     │
            │ + AirPods    │
            │ + iPad       │
            │ + MacBook    │
            └──────────────┘

Multi-Master Trade-offs

✅ Advantages
  • • Write anywhere - no single point of failure
  • • Better geographic distribution
  • • Read scaling across all nodes
  • • Automatic failover
❌ Challenges
  • • Write conflicts need resolution
  • • Lower write throughput (consensus overhead)
  • • Complex debugging
  • • Network partitions can cause split-brain

When to Use Multi-Master

Good Fit:

  • Geographic distribution with local writes
  • High availability requirements
  • Read-heavy workloads
  • Systems that can tolerate/resolve conflicts

Poor Fit:

  • High-volume writes to same data
  • Strong consistency requirements
  • Complex transactions
  • Limited network bandwidth

Vitess vs Multi-Master: When Do You Need Both?

Great question! While Vitess solves horizontal scaling through sharding, multi-master addresses a different problem: write availability and geographic distribution. Let's break down when you need each:

Vitess Solves: Data Volume Scaling

Problem: Single database can't handle data size
┌─────────────────────────────────────────┐
│        Single MySQL Instance           │
│    users table: 500GB, 1B records      │
│    💥 Too big for one machine          │
└─────────────────────────────────────────┘

Vitess Solution: Horizontal partitioning
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Shard 1    │  │  Shard 2    │  │  Shard 3    │
│  users      │  │  users      │  │  users      │
│  (1-333M)   │  │  (334-666M) │  │  (667M-1B)  │
└─────────────┘  └─────────────┘  └─────────────┘

Multi-Master Solves: Write Availability & Geography

Problem: Single write point fails or is too far
┌─────────────────────────────────────────┐
│         US West (Primary)               │
│    All writes must go here              │
│    💥 Far from EU/Asia users            │
│    💥 Single point of failure           │
└─────────────────────────────────────────┘

Multi-Master Solution: Write anywhere
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  US West    │  │  EU Central │  │  Asia Pac   │
│  (Master)   │  │  (Master)   │  │  (Master)   │
│  Low latency│  │  Low latency│  │  Low latency│
│  for US     │  │  for EU     │  │  for Asia   │
└─────────────┘  └─────────────┘  └─────────────┘

The Reality: You Often Need BOTH

Here's how real-world systems combine them:

Example: Global Social Media Platform

             Geographic Multi-Master Setup
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   US Region  │    │  EU Region   │    │ Asia Region  │
│              │    │              │    │              │
│   ┌─────┐    │    │   ┌─────┐    │    │   ┌─────┐    │
│   │User │    │◄──►│   │User │    │◄──►│   │User │    │
│   │Posts│    │    │   │Posts│    │    │   │Posts│    │
│   └─────┘    │    │   └─────┘    │    │   └─────┘    │
│       │      │    │       │      │    │       │      │
│    Vitess    │    │    Vitess    │    │    Vitess    │
│   Sharding   │    │   Sharding   │    │   Sharding   │
│       │      │    │       │      │    │       │      │
│ ┌───┬─┴─┬───┐│    │ ┌───┬─┴─┬───┐│    │ ┌───┬─┴─┬───┐│
│ │S1 │S2 │S3 ││    │ │S1 │S2 │S3 ││    │ │S1 │S2 │S3 ││
│ └───┴───┴───┘│    │ └───┴───┴───┘│    │ └───┴───┴───┘│
└──────────────┘    └──────────────┘    └──────────────┘

Each region:
- Multi-Master: Handles regional writes, replicates globally
- Vitess: Shards data within each region for scale

Decision Matrix

ScenarioVitessMulti-MasterBoth
Single region, large dataset✓ PerfectNot neededOverkill
Multi-region, small datasetNot needed✓ PerfectOverkill
Global scale, large datasetHelpsHelps✓ Best
High write availability neededDoesn't help✓ PerfectIf also large
Latency-sensitive writesDoesn't help✓ PerfectIf also large

Real-World Examples

Vitess Only:

YouTube Analytics Dashboard
- Massive historical data (petabytes)
- Complex aggregation queries
- Single region acceptable
- Vitess shards by time/user_id

Multi-Master Only:

Global User Preferences Service
- Small dataset (user settings)
- Needs local writes everywhere
- Galera Cluster across regions
- No sharding needed

Both Together:

Instagram/Facebook
- Massive media storage (Vitess for sharding)
- Global user base (Multi-master for regions)
- Each region shards its data locally
- Cross-region replication for user data

#### MySQL Write Throughput Challenge

Critical Point: Even with Vitess, MySQL's fundamental limitation remains - writes only happen on the primary node. This creates a single point of write operations per shard.

For scaling write throughput in a single region, you have limited options:

⚠️ MySQL Write Scaling Options
1. Multi-Master Replication
  • • Multiple nodes accept writes
  • • True write throughput scaling
  • • Conflict resolution complexity
  • • Network partition challenges
2. Alternative Approaches
  • • Write buffering/batching
  • • Asynchronous processing
  • • Move to NoSQL (MongoDB, Cassandra)
  • • Event sourcing patterns

Reality Check:

Vitess Sharding:               Multi-Master:
┌─────────────────────┐        ┌─────────────────────┐
│ Shard 1: [Primary] │        │ Master 1: ✓ Writes │
│         [Replica]  │   VS   │ Master 2: ✓ Writes │
│ Shard 2: [Primary] │        │ Master 3: ✓ Writes │
│         [Replica]  │        │ (All can write)     │
└─────────────────────┘        └─────────────────────┘
     ↑                                ↑
 Single writer per shard      Multiple concurrent writers

Key Insights

Vitess Limitations:

  • Still has single write point per shard
  • Cross-region writes still slow
  • Master failure affects that shard's writes
  • Write throughput = sum of shard primaries

Multi-Master Limitations:

  • Doesn't solve data volume scaling
  • Complex conflict resolution
  • Write performance decreases with scale
  • But scales write throughput linearly

Why Both:

Multi-Master handles:           Vitess handles:
┌─────────────────────┐        ┌─────────────────────┐
│ • Geographic writes │   +    │ • Data volume       │
│ • Write availability│        │ • Query distribution│
│ • Network latency   │        │ • Storage scaling   │
└─────────────────────┘        └─────────────────────┘
           │                            │
           └────────────┬───────────────┘
            Complete Distributed Solution

Cross-Shard Joins: The Efficiency Challenge

When using sharded systems like Vitess, cross-shard joins are problematic and should be avoided when possible.

How Cross-Shard Joins Work

⚠️ Cross-Shard Join Process

-- Query: Get user details with their recent orders
SELECT u.name, u.email, o.order_id, o.total 
FROM users u 
JOIN orders o ON u.user_id = o.user_id 
WHERE u.region = 'US'

What Vitess has to do:

1. Scatter Phase:
   ┌─────────────────────────────────────────────────────┐
   │ Query sent to ALL shards (because we don't know     │
   │ which shards contain matching users/orders)         │
   └─────────────────────────────────────────────────────┘
   Shard 1    Shard 2    Shard 3    Shard 4
   ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
   │Users│    │Users│    │Users│    │Users│
   │Orders│   │Orders│   │Orders│   │Orders│
   └─────┘    └─────┘    └─────┘    └─────┘

2. Gather Phase:
   ┌─────────────────────────────────────────────────────┐
   │ Collect partial results from each shard            │
   │ Vitess proxy performs the JOIN operation           │
   │ Sort, filter, and return final results             │
   └─────────────────────────────────────────────────────┘

Efficiency Problems

🐌 Performance Issues
  • Network overhead: Data from all shards
  • Memory usage: Vitess proxy holds all data
  • No index usage: Can't use MySQL join optimizations
  • Serial processing: Wait for slowest shard
  • Bandwidth waste: Moving large datasets
📊 Complexity Issues
  • Query planning: Vitess must understand joins
  • Transaction boundaries: ACID across shards
  • Deadlocks: Cross-shard dependencies
  • Limited SQL support: Not all joins supported
  • Debugging difficulty: Multi-shard query plans

Vitess Join Support Levels

Same Shard Joins

Fast & efficient. Uses MySQL's native join optimization.

⚠️
Cross-Shard Joins

Supported but slow. Avoid for high-traffic queries.

Complex Joins

Limited support. Subqueries, CTEs often not supported.

Design Patterns to Avoid Cross-Shard Joins

1. Denormalization:

-- Instead of joining across shards
-- users (shard by user_id) + orders (shard by user_id)
-- Store user info in orders table

CREATE TABLE orders (
  order_id BIGINT,
  user_id BIGINT,
  user_name VARCHAR(100),  -- Denormalized
  user_email VARCHAR(100), -- Denormalized  
  total DECIMAL(10,2),
  created_at TIMESTAMP
) -- Shard by user_id

2. Application-Level Joins:

// Fetch from multiple shards in application
users := fetchUsers(userIDs)      // From user shards
orders := fetchOrders(userIDs)    // From order shards
result := joinInMemory(users, orders) // App does the join

3. Materialized Views:

-- Pre-computed join results
CREATE TABLE user_order_summary (
  user_id BIGINT,
  total_orders INT,
  total_spent DECIMAL(12,2),
  last_order_date DATE
) -- Updated via events/batch jobs

4. Event-Driven Consistency:

User Update Event → Update orders.user_name
Order Created Event → Update user_order_summary

When Cross-Shard Joins are Acceptable

✅ Acceptable Use Cases
  • Admin/Analytics queries: Low frequency, can be slow
  • Small result sets: Limited data movement
  • Non-critical paths: Batch processing, reports
  • Development/Testing: Temporary queries

Performance Comparison:

Same-Shard Join:     1-10ms    (MySQL native optimization)
Cross-Shard Join:    100-1000ms (Network + coordination overhead)
Application Join:    50-200ms   (Parallel fetches + in-memory join)

Bottom Line:

  • Vitess = Scale your data horizontally
  • Multi-Master = Write from anywhere with high availability
  • Both = Global, massive-scale applications (Facebook, Instagram, YouTube)

You choose based on your specific constraints: data size, geographic distribution, write patterns, and availability requirements.

Key Differences Summary

Aspect
Sharding
Replication
Purpose

Scalability

Availability

Data DistributionDifferent data on each nodeSame data on all nodes
Storage Efficiency✓ High(data divided)⚠ Low(data duplicated)
Write Performance📈 Scales linearly🔒 Limited by slowest replica
Read PerformanceScales for different dataScales for same data
Failure Impact❌ Partial data loss✓ No data loss
ComplexityHigh(routing, joins)Medium(consistency)

3. Our Replication Strategy

Now that we understand the concepts, for our distributed KV store, we'll use both techniques. First, let's look at how we replicate each key to N nodes for high availability:

Replication Visualizer

Interactive replication demo coming soon...

Replication approaches:

  • Simple: Next N nodes on hash ring
  • Rack-aware: Different racks/data centers
  • Zone-aware: Different availability zones

⚠️
4. Consistency Levels

Different operations require different consistency guarantees:

Consistency Level Demo

Interactive consistency levels demo coming soon...

Write Consistency Levels:

  • ONE: Write completes after one replica acknowledges
  • QUORUM: Write completes after majority acknowledge
  • ALL: Write completes after all replicas acknowledge

Read Consistency Levels:

  • ONE: Return after reading from one replica
  • QUORUM: Return most recent from majority
  • ALL: Return after reading from all replicas

The Quorum Formula Explained

The formula R + W > N ensures strong consistency. Let's break this down:

  • N = Total number of replicas (replication factor)
  • W = Write consistency level (number of nodes that must acknowledge a write)
  • R = Read consistency level (number of nodes that must respond to a read)

Why does R + W > N guarantee strong consistency?

When R + W > N, there's guaranteed overlap between the nodes that wrote the data and the nodes you're reading from. This means at least one node in your read set has the latest value.

Visual Example with N=3:

Write with W=2:            Read with R=2:
┌─────┐ ✓ Written          ┌─────┐ ✓ Read
│Node1│                    │Node1│
└─────┘                    └─────┘
┌─────┐ ✓ Written          ┌─────┐ 
│Node2│                    │Node2│ 
└─────┘                    └─────┘
┌─────┐ ✗ Not yet          ┌─────┐ ✓ Read
│Node3│                    │Node3│
└─────┘                    └─────┘

W=2 + R=2 = 4 > N=3 ✓
Overlap guaranteed: Node1 has latest data

Common Quorum Configurations:

| Config | N | W | R | R+W | Consistency | Use Case | |--------|---|---|---|-----|-------------|----------| | Strong | 3 | 2 | 2 | 4 > 3 | ✓ Strong | Critical data | | Read-Heavy | 3 | 3 | 1 | 4 > 3 | ✓ Strong | Read-optimized | | Write-Heavy | 3 | 1 | 3 | 4 > 3 | ✓ Strong | Write-optimized | | Eventual | 3 | 1 | 1 | 2 < 3 | ✗ Eventual | Performance-critical |

Trade-offs:

  • Higher W: Slower writes, better durability
  • Higher R: Slower reads, fresher data
  • W=N: Maximum durability, writes fail if any node is down
  • R=N: Always read latest, reads fail if any node is down

5. Versioning & Conflict Resolution

We use vector clocks to track causality:

Client A writes v1: [A:1]
Client B writes v2: [B:1]
Client A writes v3: [A:2, B:1]

Conflict detected: v1 and v2 are concurrent
Resolution: Last-write-wins or application-level merge

6. Failure Handling

Techniques for high availability:

  1. Sloppy Quorum: Use first N healthy nodes
  2. Hinted Handoff: Store temporarily on another node
  3. Read Repair: Fix inconsistencies during reads
  4. Anti-entropy: Background process to sync replicas

Storage Engine

LSM-Tree (Log-Structured Merge Tree)

Most distributed KV stores use LSM-trees for write optimization:

┌─────────────┐
│  MemTable   │ ← In-memory writes (sorted)
└──────┬──────┘
       │ Flush when full
┌──────▼──────┐
│   SSTable   │ ← Immutable on-disk files
│  (Level 0)  │
└──────┬──────┘
       │ Compact & merge
┌──────▼──────┐
│   SSTable   │
│  (Level 1)  │
└─────────────┘

Write Path:

  1. Write to commit log (durability)
  2. Write to MemTable (fast access)
  3. Flush to SSTable when full
  4. Background compaction

Read Path:

  1. Check MemTable
  2. Check Bloom filters
  3. Binary search in SSTables
  4. Return value or not found

Auto-scaling

Monitor metrics and scale automatically:

func autoScale(metrics Metrics) {
    if metrics.CPUUsage > 80 || metrics.LatencyP99 > 100 {
        addNode()
    } else if metrics.CPUUsage < 20 && nodeCount > minNodes {
        removeNode()
    }
}

Scaling triggers:

  • CPU usage > 80%
  • Memory usage > 85%
  • P99 latency > threshold
  • Disk usage > 80%

Performance Optimizations

1. Bloom Filters

Skip unnecessary disk reads:

if !bloomFilter.MightContain(key) {
    return nil, NotFound
}

2. Compression

Reduce storage and network I/O:

  • Snappy for speed
  • Zstandard for ratio
  • LZ4 for balance

3. Caching

Multi-level caching strategy:

  • Row cache: Full key-value pairs
  • Block cache: SSTable blocks
  • Key cache: Key → disk location

4. Batch Operations

Reduce network round trips:

batch := NewBatch()
batch.Put("key1", "value1")
batch.Put("key2", "value2")
batch.Delete("key3")
db.Write(batch)

Monitoring & Operations

Key Metrics

  • Latency: P50, P95, P99
  • Throughput: Reads/writes per second
  • Availability: Uptime percentage
  • Consistency: Divergence between replicas
  • Storage: Disk usage, compaction stats

Health Checks

func healthCheck() HealthStatus {
    return HealthStatus{
        Replication: checkReplicationLag(),
        Storage:     checkDiskSpace(),
        Network:     checkNodeConnectivity(),
        Load:        checkLoadDistribution(),
    }
}

Implementation Example

type KVStore struct {
    ring        *ConsistentHash
    storage     StorageEngine
    replicator  Replicator
    coordinator Coordinator
}

func (kv *KVStore) Put(key, value string, consistency Level) error {
    // Find nodes for this key
    nodes := kv.ring.GetNodes(key, kv.replicationFactor)
    
    // Write based on consistency level
    switch consistency {
    case ONE:
        return kv.writeOne(nodes, key, value)
    case QUORUM:
        return kv.writeQuorum(nodes, key, value)
    case ALL:
        return kv.writeAll(nodes, key, value)
    }
}

func (kv *KVStore) Get(key string, consistency Level) (string, error) {
    nodes := kv.ring.GetNodes(key, kv.replicationFactor)
    
    switch consistency {
    case ONE:
        return kv.readOne(nodes, key)
    case QUORUM:
        return kv.readQuorum(nodes, key)
    case ALL:
        return kv.readAll(nodes, key)
    }
}

Trade-offs & Decisions

| Decision | Option A | Option B | Our Choice | |----------|----------|----------|------------| | Consistency Model | Strong (CP) | Eventual (AP) | Tunable (Both) | | Partitioning | Range-based | Hash-based | Consistent Hash | | Storage Engine | B-tree | LSM-tree | LSM-tree | | Replication | Master-slave | Peer-to-peer | Peer-to-peer | | Conflict Resolution | Last-write-wins | Vector clocks | Vector clocks |

Real-World Examples

Amazon DynamoDB

  • Eventually consistent by default
  • Consistent hashing with virtual nodes
  • Multi-master replication
  • Auto-scaling based on traffic

Apache Cassandra

  • Tunable consistency (ONE to ALL)
  • Gossip protocol for membership
  • LSM-tree storage engine
  • No single point of failure

Redis Cluster

  • Master-slave replication
  • 16384 hash slots
  • Synchronous replication option
  • In-memory with persistence

Next Steps

References