Distributed Unique ID Generator

Designing a distributed ID generator that can produce unique, ordered identifiers at massive scale. Think Twitter's Snowflake, Instagram's ID generation, or Discord's unique message IDs.

System Requirements

Functional Requirements

  • Uniqueness: IDs must be globally unique across all servers
  • Numerical: IDs are numerical values only (primary requirement)
  • 64-bit constraint: Must fit in a 64-bit integer (long/bigint)
  • Time-ordered: IDs generated later have larger values
  • High throughput: Generate 10,000+ unique IDs per second
  • Low latency: ID generation should be fast (~1ms)

Why Not Use Traditional Approaches?

❌ Auto-increment Database ID

  • • Single point of failure
  • • Doesn't scale horizontally
  • • Database bottleneck at high throughput
  • • Hard to shard

❌ UUID/GUID

  • • 128-bit (doesn't fit 64-bit requirement)
  • • Not time-ordered (UUID v4)
  • • Poor index performance
  • • Not sequential

❌ Multi-Master Replication

  • • ID collision risk across masters
  • • Complex conflict resolution
  • • Network partition issues
  • • Synchronization overhead

🔍 Why Multi-Master Fails at Scale

Scenario: Two masters generating auto-increment IDs independently:

Master A: 1001, 1002, 1003...
Master B: 1001, 1002, 1003...
Result: Duplicate IDs! 💥

Common "Solutions" and Their Limitations:

• Increment by K (K = number of servers):

Master A: 1, 4, 7, 10... (start=1, increment=3)
Master B: 2, 5, 8, 11... (start=2, increment=3)
Master C: 3, 6, 9, 12... (start=3, increment=3)
✓ No collisions, but ✗ Adding/removing servers is complex
✗ IDs reveal infrastructure (3→6→9 shows 3 servers)
✗ Gaps grow with more servers (100 servers = increment by 100)

ID ranges: Master A: 1-1M, Master B: 1M-2M (range exhaustion, rebalancing needed)
Coordination: Masters sync before generating (network latency kills throughput)
Central allocator: Request ID blocks from central service (single point of failure)

The Real Problem: While increment-by-K prevents collisions, it doesn't meet our requirement of time-ordered IDs. IDs from different servers interleave randomly:

Time order: A₁(t=1), B₁(t=2), A₂(t=3), C₁(t=4), B₂(t=5)
Actual IDs: 1, 2, 4, 3, 5
Result: IDs are NOT time-ordered! 🔀

Approach 1: Twitter Snowflake

❄️ Snowflake Algorithm (Most Popular)

Twitter's Snowflake packs multiple components into a 64-bit integer to guarantee uniqueness and ordering.

Snowflake Bit Layout

64-bit ID Structure:
┌─────────────────────────────────────────────────────────────────┐
│ 0 │   41 bits timestamp   │ 10 bits machine │ 12 bits sequence │
└─────────────────────────────────────────────────────────────────┘
  ↑            ↑                      ↑                ↑
Sign    Milliseconds since     Data center +    Sequence counter
bit     custom epoch           Worker ID         (per millisecond)

Example ID: 412376492842516481
Binary:     0|10110110010110100000101111000100001|01001|00101|000000000001
            ↑ └────────────41 bits──────────────┘└─5──┘└─5──┘└───12────┘

Component Breakdown

ComponentBitsRangePurpose
Sign bit1 bitAlways 0Keep number positive
Timestamp41 bits~69 yearsMilliseconds since epoch (ensures ordering)
Machine ID10 bits1024 machinesUnique server identifier
Sequence12 bits4096/msCounter for IDs in same millisecond

Implementation

import time
import threading

class SnowflakeIDGenerator:
    def __init__(self, datacenter_id, worker_id):
        # Bit allocation
        self.TIMESTAMP_BITS = 41
        self.DATACENTER_BITS = 5
        self.WORKER_BITS = 5
        self.SEQUENCE_BITS = 12
        
        # Max values
        self.MAX_DATACENTER_ID = (1 << self.DATACENTER_BITS) - 1  # 31
        self.MAX_WORKER_ID = (1 << self.WORKER_BITS) - 1          # 31
        self.MAX_SEQUENCE = (1 << self.SEQUENCE_BITS) - 1         # 4095
        
        # Bit shift amounts
        self.WORKER_SHIFT = self.SEQUENCE_BITS                    # 12
        self.DATACENTER_SHIFT = self.SEQUENCE_BITS + self.WORKER_BITS  # 17
        self.TIMESTAMP_SHIFT = self.DATACENTER_SHIFT + self.DATACENTER_BITS  # 22
        
        # Custom epoch (January 1, 2020)
        self.EPOCH = 1577836800000
        
        # Instance configuration
        if datacenter_id > self.MAX_DATACENTER_ID or datacenter_id < 0:
            raise ValueError(f"Datacenter ID must be between 0 and {self.MAX_DATACENTER_ID}")
        if worker_id > self.MAX_WORKER_ID or worker_id < 0:
            raise ValueError(f"Worker ID must be between 0 and {self.MAX_WORKER_ID}")
            
        self.datacenter_id = datacenter_id
        self.worker_id = worker_id
        self.sequence = 0
        self.last_timestamp = -1
        self.lock = threading.Lock()
    
    def _current_millis(self):
        return int(time.time() * 1000)
    
    def _wait_next_millis(self, last_timestamp):
        timestamp = self._current_millis()
        while timestamp <= last_timestamp:
            timestamp = self._current_millis()
        return timestamp
    
    def generate(self):
        with self.lock:
            timestamp = self._current_millis()
            
            # Clock moved backwards - wait
            if timestamp < self.last_timestamp:
                raise Exception(f"Clock moved backwards. Refusing to generate ID for {self.last_timestamp - timestamp} ms")
            
            # Same millisecond - increment sequence
            if timestamp == self.last_timestamp:
                self.sequence = (self.sequence + 1) & self.MAX_SEQUENCE
                # Sequence overflow - wait for next millisecond
                if self.sequence == 0:
                    timestamp = self._wait_next_millis(self.last_timestamp)
            else:
                # New millisecond - reset sequence
                self.sequence = 0
            
            self.last_timestamp = timestamp
            
            # Generate ID by combining all components
            id_value = ((timestamp - self.EPOCH) << self.TIMESTAMP_SHIFT) | \
                       (self.datacenter_id << self.DATACENTER_SHIFT) | \
                       (self.worker_id << self.WORKER_SHIFT) | \
                       self.sequence
            
            return id_value

# Usage
generator = SnowflakeIDGenerator(datacenter_id=1, worker_id=1)
for _ in range(5):
    print(generator.generate())
# Output: 
# 412376492842516481
# 412376492842516482
# 412376492842516483

Snowflake Analysis

⚠️ The 69-Year Limit: What Happens After?

Question: What happens when the 41-bit timestamp overflows after ~69 years?

The Problem: Simply resetting to a new epoch would cause ID collisions - the same IDs would be generated again!

2020-2089: [timestamp=1][machine=1][seq=0] = 4096
2089-2158: [timestamp=1][machine=1][seq=0] = 4096 💥 Collision!

Real-World Solutions:

  • Most common: Migrate to new system before 69 years (average system lifespan: 10-15 years)
  • Version prefix: Add epoch version outside 64-bit (v1_id, v2_id)
  • Reserve bits: Use 2 bits for epoch version (4 epochs × 17 years each)

Reality: Twitter (2010 epoch) has until 2079. They'll likely migrate to a completely different system decades before then.

✅ Advantages

  • • Time-ordered (sortable)
  • • Highly scalable (1024 machines)
  • • No network calls needed
  • • 4096 IDs per millisecond per machine
  • • Works for 69 years from epoch

⚠️ Considerations

  • • Requires machine ID coordination
  • • Clock synchronization critical
  • • Clock rollback handling needed
  • • ID reveals timestamp (privacy)

Approach 2: Database Ticket Server

🎫 Centralized Ticket Server

Flickr's approach: Use dedicated MySQL servers to generate IDs with auto-increment.

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ App Server 1 │     │ App Server 2 │     │ App Server 3 │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                    ┌──────▼──────┐
                    │Load Balancer│
                    └──────┬──────┘
            ┌──────────────┼──────────────┐
            │                             │
    ┌───────▼────────┐          ┌────────▼───────┐
    │ Ticket Server 1│          │ Ticket Server 2│
    │  (Odd IDs)     │          │  (Even IDs)    │
    │  AUTO_INCREMENT│          │  AUTO_INCREMENT│
    │  increment = 2 │          │  increment = 2 │
    └────────────────┘          └────────────────┘

MySQL Configuration

-- Ticket Server 1 (generates odd IDs)
CREATE TABLE tickets64 (
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    stub CHAR(1) NOT NULL DEFAULT '',
    PRIMARY KEY (id),
    UNIQUE KEY (stub)
);

ALTER TABLE tickets64 AUTO_INCREMENT = 1;
SET @@auto_increment_increment = 2;
SET @@auto_increment_offset = 1;

-- Ticket Server 2 (generates even IDs)
ALTER TABLE tickets64 AUTO_INCREMENT = 2;
SET @@auto_increment_increment = 2;
SET @@auto_increment_offset = 2;

-- To get a new ID:
REPLACE INTO tickets64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

✅ Advantages

  • • Simple implementation
  • • IDs are roughly time-ordered
  • • Works with any MySQL setup
  • • No clock sync issues

❌ Disadvantages

  • • Network call for every ID
  • • Single point of failure risk
  • • Database becomes bottleneck
  • • Not suitable for 10K+ IDs/sec

Approach 3: UUID with Timestamp (UUIDv7)

🔑 Time-Ordered UUID (128-bit Alternative)

📝 Note: This doesn't fit the 64-bit requirement but is included for completeness as it's widely used when 128-bit is acceptable.

UUIDv7 Structure (Time-ordered):
┌────────────────────────────────────────────────────────────────┐
│  48-bit timestamp  │ 4-bit │  12-bit random  │ 2-bit │ 62-bit │
│   (milliseconds)   │version│    or counter   │variant│ random │
└────────────────────────────────────────────────────────────────┘

Example: 018e3f7a-5ec9-7a2b-9e4c-4f2b6c3d8e9f
         └─timestamp─┘└ver┘└─────random──────┘

Approach 4: Alphanumeric IDs

🔤 Alphanumeric ID Generation

When alphanumeric IDs are acceptable, we gain more density and flexibility.

Approach 4a: Encoded Snowflake

def generate_alphanumeric_id():
    # Generate numeric Snowflake ID
    numeric_id = snowflake_generator.generate()  # e.g., 412376492842516481
    
    # Encode to Base62 for URL-safe alphanumeric
    base62_id = base62_encode(numeric_id)  # e.g., "3kQ9mPx"
    
    return base62_id

# Benefits:
# - 7-8 characters instead of 19 digits
# - Still time-ordered (underlying Snowflake)
# - URL-safe without encoding

Approach 4b: Nano ID

import string
import secrets
import time

class NanoIDGenerator:
    def __init__(self, alphabet=None, size=21):
        self.alphabet = alphabet or string.ascii_letters + string.digits + '-_'
        self.size = size
        
    def generate(self):
        # Add timestamp prefix for ordering (optional)
        timestamp_part = base62_encode(int(time.time() * 1000))
        
        # Random suffix for uniqueness
        random_part = ''.join(secrets.choice(self.alphabet) 
                             for _ in range(self.size - len(timestamp_part)))
        
        return timestamp_part + random_part

# Example: "1Ky8_dXl9pRb2C4mNvQ7F"

Approach 4c: KSUID (K-Sortable Unique ID)

KSUID Structure (160 bits = 20 bytes):
┌─────────────────────────────────────────┐
│ 32-bit timestamp │ 128-bit random payload│
│  (seconds since  │   (16 random bytes)   │
│   custom epoch)  │                       │
└─────────────────────────────────────────┘

Base62 encoded: 27 characters
Example: "2LjpHjBdPXKDSJ4gVcNvYjZ9MqT"

Comparison Matrix

Approach64-bitOrderedThroughputComplexityUse Case
Snowflake✅ Yes✅ Yes4M/secMediumTwitter, Discord, Instagram
Ticket Server✅ Yes⚠️ Rough10K/secSimpleFlickr, Small scale
UUIDv7❌ 128-bit✅ YesUnlimitedSimpleWhen 128-bit OK
Base62 Snowflake✅ Yes*✅ Yes4M/secMediumURL shorteners
KSUID❌ 160-bit✅ YesUnlimitedSimpleSegment, Stripe

Real-World Implementations

🐦 Twitter Snowflake

Original Snowflake implementation for tweet IDs

41 bits: timestamp
10 bits: machine ID
12 bits: sequence
= 64-bit sortable IDs

📸 Instagram

Modified Snowflake with PostgreSQL

41 bits: timestamp
13 bits: shard ID
10 bits: sequence
Uses PostgreSQL schemas

💬 Discord

Snowflake for messages, channels, users

Discord epoch: 2015-01-01
5 bits: worker ID
5 bits: process ID
Visible in all Discord IDs

🔧 MongoDB ObjectId

96-bit IDs with embedded timestamp

32 bits: Unix timestamp
40 bits: random value
24 bits: counter
Hex encoded: 24 chars

Handling Edge Cases

⚠️ Critical Considerations

Clock Synchronization

  • • Use NTP to keep clocks in sync
  • • Detect clock rollback and refuse to generate
  • • Consider monotonic clocks
  • • Add jitter buffer (wait if clock moves back)

Machine ID Assignment

  • • Use ZooKeeper for dynamic assignment
  • • Environment variables for static assignment
  • • IP address hashing (risk of collisions)
  • • Container orchestrator labels (K8s)

Sequence Overflow

  • • Wait for next millisecond if sequence exhausted
  • • Pre-generate ID pools during low traffic
  • • Use larger sequence bits if needed
  • • Monitor sequence usage patterns

Final Recommendation

🎯 For Your Requirements

Given the requirements (64-bit, numerical, ordered, 10K+ IDs/sec), Snowflake is the best choice.

Implementation Plan:

  1. 1. Use 41-bit timestamp (69 years from custom epoch)
  2. 2. Allocate 10 bits for machine ID (1024 servers)
  3. 3. Use 12-bit sequence (4096 IDs/ms = 4M IDs/sec per machine)
  4. 4. Implement clock rollback protection
  5. 5. Use ZooKeeper/etcd for machine ID coordination
  6. 6. For alphanumeric: Encode Snowflake IDs with Base62

Summary

Distributed ID generation is a fundamental problem with multiple solutions. Snowflake provides the best balance of:

  • ✅ Fits in 64-bit
  • ✅ Time-ordered for database indexing
  • ✅ No coordination needed after setup
  • ✅ Scales to thousands of machines
  • ✅ Proven at Twitter, Discord, Instagram scale

For alphanumeric IDs, encoding Snowflake IDs with Base62 gives you shorter, URL-safe identifiers while maintaining all the benefits.

💡 Pro Tip: Start with Snowflake for numerical IDs. If you need shorter strings for URLs or user-facing IDs, Base62-encode them. This gives you the best of both worlds: sortable 64-bit integers in your database and compact strings for your users.