Distributed Key-Value Store

CAP Theorem Visualizer

Interactive CAP theorem demo coming soon...

Building a distributed key-value store requires understanding fundamental trade-offs in distributed systems. Let's design a system that can scale to handle billions of key-value pairs across hundreds of nodes.

Requirements

Our distributed key-value store must handle:

  • Small key-value pairs: Less than 10 KB each
  • Big data: Store and retrieve billions of items
  • High availability: Respond quickly even during failures
  • High scalability: Scale horizontally with traffic
  • Auto-scaling: Add/remove nodes automatically
  • Tunable consistency: Choose between strong and eventual consistency
  • Low latency: Sub-millisecond response times

CAP Theorem: The Fundamental Trade-off

The CAP theorem states that a distributed system can only guarantee two of the following three properties:

  • Consistency (C): All nodes see the same data simultaneously
  • Availability (A): System remains operational 100% of the time
  • Partition Tolerance (P): System continues despite network failures

Since network partitions are inevitable in distributed systems, we must choose between CP or AP:

⚠️
CP Systems (Consistency + Partition Tolerance)

  • Example: HBase, MongoDB, Redis
  • Sacrifices availability for consistency
  • Better for: Financial transactions, inventory systems

AP Systems (Availability + Partition Tolerance)

  • Example: Cassandra, DynamoDB, CouchDB
  • Sacrifices consistency for availability
  • Better for: Social media, recommendations, analytics

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │     │   Client    │     │   Client    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                    ┌──────▼──────┐
                    │ Load Balancer│
                    └──────┬──────┘
       ┌───────────────────┼───────────────────┐
       │                   │                   │
┌──────▼──────┐     ┌──────▼──────┐     ┌──────▼──────┐
│  API Server │     │  API Server │     │  API Server │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                    ┌──────▼──────┐
                    │ Coordinator │
                    └──────┬──────┘
       ┌───────────────────┼───────────────────┐
       │                   │                   │
┌──────▼──────┐     ┌──────▼──────┐     ┌──────▼──────┐
│Storage Node1│     │Storage Node2│     │Storage Node3│
│  (Replica)  │     │  (Replica)  │     │  (Replica)  │
└─────────────┘     └─────────────┘     └─────────────┘

Core Components

1. Data Partitioning

We use consistent hashing to distribute data across nodes:

Partitioning Demo

Interactive partitioning demo coming soon...

Benefits:

  • Minimal data movement when nodes join/leave
  • Even distribution of keys
  • Each node manages ~1/N of total data

2. Replication Strategy

For high availability, we replicate each key to N nodes:

Replication Visualizer

Interactive replication demo coming soon...

Replication approaches:

  • Simple: Next N nodes on hash ring
  • Rack-aware: Different racks/data centers
  • Zone-aware: Different availability zones

⚠️
3. Consistency Levels

Different operations require different consistency guarantees:

Consistency Level Demo

Interactive consistency levels demo coming soon...

Write Consistency Levels:

  • ONE: Write completes after one replica acknowledges
  • QUORUM: Write completes after majority acknowledge
  • ALL: Write completes after all replicas acknowledge

Read Consistency Levels:

  • ONE: Return after reading from one replica
  • QUORUM: Return most recent from majority
  • ALL: Return after reading from all replicas

Quorum Formula: R + W > N ensures strong consistency

4. Versioning & Conflict Resolution

We use vector clocks to track causality:

Client A writes v1: [A:1]
Client B writes v2: [B:1]
Client A writes v3: [A:2, B:1]

Conflict detected: v1 and v2 are concurrent
Resolution: Last-write-wins or application-level merge

5. Failure Handling

Techniques for high availability:

  1. Sloppy Quorum: Use first N healthy nodes
  2. Hinted Handoff: Store temporarily on another node
  3. Read Repair: Fix inconsistencies during reads
  4. Anti-entropy: Background process to sync replicas

Storage Engine

LSM-Tree (Log-Structured Merge Tree)

Most distributed KV stores use LSM-trees for write optimization:

┌─────────────┐
│  MemTable   │ ← In-memory writes (sorted)
└──────┬──────┘
       │ Flush when full
┌──────▼──────┐
│   SSTable   │ ← Immutable on-disk files
│  (Level 0)  │
└──────┬──────┘
       │ Compact & merge
┌──────▼──────┐
│   SSTable   │
│  (Level 1)  │
└─────────────┘

Write Path:

  1. Write to commit log (durability)
  2. Write to MemTable (fast access)
  3. Flush to SSTable when full
  4. Background compaction

Read Path:

  1. Check MemTable
  2. Check Bloom filters
  3. Binary search in SSTables
  4. Return value or not found

Auto-scaling

Monitor metrics and scale automatically:

func autoScale(metrics Metrics) {
    if metrics.CPUUsage > 80 || metrics.LatencyP99 > 100 {
        addNode()
    } else if metrics.CPUUsage < 20 && nodeCount > minNodes {
        removeNode()
    }
}

Scaling triggers:

  • CPU usage > 80%
  • Memory usage > 85%
  • P99 latency > threshold
  • Disk usage > 80%

Performance Optimizations

1. Bloom Filters

Skip unnecessary disk reads:

if !bloomFilter.MightContain(key) {
    return nil, NotFound
}

2. Compression

Reduce storage and network I/O:

  • Snappy for speed
  • Zstandard for ratio
  • LZ4 for balance

3. Caching

Multi-level caching strategy:

  • Row cache: Full key-value pairs
  • Block cache: SSTable blocks
  • Key cache: Key → disk location

4. Batch Operations

Reduce network round trips:

batch := NewBatch()
batch.Put("key1", "value1")
batch.Put("key2", "value2")
batch.Delete("key3")
db.Write(batch)

Monitoring & Operations

Key Metrics

  • Latency: P50, P95, P99
  • Throughput: Reads/writes per second
  • Availability: Uptime percentage
  • Consistency: Divergence between replicas
  • Storage: Disk usage, compaction stats

Health Checks

func healthCheck() HealthStatus {
    return HealthStatus{
        Replication: checkReplicationLag(),
        Storage:     checkDiskSpace(),
        Network:     checkNodeConnectivity(),
        Load:        checkLoadDistribution(),
    }
}

Implementation Example

type KVStore struct {
    ring        *ConsistentHash
    storage     StorageEngine
    replicator  Replicator
    coordinator Coordinator
}

func (kv *KVStore) Put(key, value string, consistency Level) error {
    // Find nodes for this key
    nodes := kv.ring.GetNodes(key, kv.replicationFactor)
    
    // Write based on consistency level
    switch consistency {
    case ONE:
        return kv.writeOne(nodes, key, value)
    case QUORUM:
        return kv.writeQuorum(nodes, key, value)
    case ALL:
        return kv.writeAll(nodes, key, value)
    }
}

func (kv *KVStore) Get(key string, consistency Level) (string, error) {
    nodes := kv.ring.GetNodes(key, kv.replicationFactor)
    
    switch consistency {
    case ONE:
        return kv.readOne(nodes, key)
    case QUORUM:
        return kv.readQuorum(nodes, key)
    case ALL:
        return kv.readAll(nodes, key)
    }
}

Trade-offs & Decisions

| Decision | Option A | Option B | Our Choice | |----------|----------|----------|------------| | Consistency Model | Strong (CP) | Eventual (AP) | Tunable (Both) | | Partitioning | Range-based | Hash-based | Consistent Hash | | Storage Engine | B-tree | LSM-tree | LSM-tree | | Replication | Master-slave | Peer-to-peer | Peer-to-peer | | Conflict Resolution | Last-write-wins | Vector clocks | Vector clocks |

Real-World Examples

Amazon DynamoDB

  • Eventually consistent by default
  • Consistent hashing with virtual nodes
  • Multi-master replication
  • Auto-scaling based on traffic

Apache Cassandra

  • Tunable consistency (ONE to ALL)
  • Gossip protocol for membership
  • LSM-tree storage engine
  • No single point of failure

Redis Cluster

  • Master-slave replication
  • 16384 hash slots
  • Synchronous replication option
  • In-memory with persistence

Next Steps

References