System Design Patterns

Step 5 of 6: E - Edge Cases & Failure Handling

Handle system failures, data inconsistencies, and unexpected scenarios

Critical Failure Scenarios

🚨 Network Partitions

• Cross-region communication failure
• DynamoDB Global Table sync delays
• Kafka cluster split-brain scenarios
• Cache invalidation inconsistencies

⚡ Cascading Failures

• Celebrity post overwhelming system
• Cache stampede during peak hours
• Database connection pool exhaustion
• Feed generation service overload

🔐 Security Breaches

• Compromised user credentials
• API rate limiting bypass attempts
• SQL injection on follower queries
• Unauthorized feed access

💾 Data Corruption

• Inconsistent follower counts
• Duplicate posts in feed
• Missing post content
• Stale cache serving old data

Failure Analysis & Solutions

🌪️

Celebrity Post Storm

🚨 Problem Scenario:

Taylor Swift posts: "New album out now! 🎵"

• 200M followers get notified instantly

• 50M simultaneous likes in first minute

• Feed Generation Service: 💥 OVERLOADED

✅ Solution:

// Circuit Breaker Pattern
class CelebrityPostHandler {
async handleViralPost(post, user) {
  if (user.followerCount > 10M) {
    // Skip fanout, serve from cache only
    await this.cachePostGlobally(post);
    return "PULL_ONLY_MODE";
  }

  // Rate limit fanout processing
  const batches = this.chunkFollowers(
    user.followers, 10000
  );

  for (const batch of batches) {
    await this.rateLimitedFanout(batch, post);
    await this.sleep(100); // 100ms delay
  }
}
}

🌐

Cross-Region Network Partition

🚨 Problem:

• US-East can't reach EU-West region
• DynamoDB Global Tables sync broken
• Users see stale feeds for hours
• New posts not propagating globally

✅ Solution:

// Eventual Consistency Strategy
class PartitionHandler {
async handlePartition() {
  // 1. Detect partition
  if (!this.canReachRegion('eu-west')) {
    this.enterPartitionMode();
  }

  // 2. Serve from local cache
  this.useLocalCacheOnly = true;

  // 3. Queue writes for later sync
  this.queuedWrites.push(...newPosts);

  // 4. Show staleness warning
  this.showStaleDataBanner = true;
}

async onPartitionHealed() {
  await this.syncQueuedWrites();
  await this.invalidateStaleCache();
  this.showStaleDataBanner = false;
}
}

🏃‍♂️💨

Cache Stampede During Peak Hours

🚨 Timeline:

09:00 AM: Cache expires for popular feeds

09:01 AM: 100K requests hit database

09:02 AM: Database connections exhausted

09:03 AM: 💥 Complete system failure

✅ Prevention:

// Lock-Based Cache Refresh
class SmartCache {
async getFeed(userId) {
  let feed = await this.cache.get(userId);

  if (feed) return feed;

  // Only one thread rebuilds cache
  const lockKey = `rebuild:${userId}`;
  const lock = await this.acquireLock(lockKey, 30);

  if (lock) {
    try {
      feed = await this.buildFeed(userId);
      await this.cache.set(userId, feed, 3600);
    } finally {
      await this.releaseLock(lockKey);
    }
  } else {
    // Return stale data while rebuild happens
    return await this.getStaleCache(userId);
  }

  return feed;
}
}

Disaster Recovery Plan

🔥

RTO: 15 minutes

Recovery Time Objective for core feed serving

💾

RPO: 1 minute

Recovery Point Objective for data loss

🌍

Multi-Region

Automatic failover across 6 regions

🚀 Recovery Procedures:

# Disaster Recovery Runbook

## Region Failure (Complete Data Center Loss)
1. DNS failover to nearest region (automated, 2min)
2. Scale up backup region capacity (5min)
3. Restore DynamoDB from point-in-time backup (8min)
4. Validate data integrity and resume traffic

## Database Corruption
1. Stop all write operations immediately
2. Restore from latest backup (RPO: 1 minute)
3. Replay transaction logs to current state
4. Comprehensive data validation before resuming

## Security Breach
1. Revoke all API keys and access tokens
2. Force password reset for affected users
3. Audit logs for data exfiltration attempts
4. Implement additional security measures

Monitoring & Early Warning System

📊 Key Metrics

• Feed Generation Latency: P95 < 100ms
• Cache Hit Rate: > 85%
• Database Connection Pool: < 80% usage
• Error Rate: < 0.1%
• Celebrity Post Processing: < 10min fanout

🚨 Alert Thresholds

• P95 Latency > 200ms: Page on-call engineer
• Error rate > 1%: Immediate escalation
• Cache hit < 70%: Investigate performance
• Cross-region sync lag > 5min: Check network
• Viral post detected: Enable celebrity mode

Chaos Engineering & Testing

🔬 Chaos Experiments

• Random service failures during peak hours
• Network latency injection between regions
• Database connection drops
• Cache cluster node failures
• Celebrity post load testing

🧪 Disaster Drills

• Monthly region failover exercises
• Quarterly data recovery simulations
• Security breach response training
• Load testing with 10x normal traffic
• End-to-end backup restoration

🎯 Coming Up Next: Deep Dives

In the final step, we'll dive deep into implementation details:

🛠️ DynamoDB

Schema design, GSI optimization, and cost analysis

⚡ Feed Algorithm

Ranking models, personalization, and ML integration

🔧 Technology Stack

Service architecture, deployment, and monitoring tools

← Previous: Scale Refinement

Next: Deep Dives →