Bloom Filters

A space-efficient probabilistic data structure that tells you if an element is definitely not in a set or possibly in a set. Perfect for cases where false positives are acceptable but false negatives are not.

The Core Idea

Why Bloom Filters?

Traditional Set Lookup

• HashSet: ~32 bytes per element
• 1 billion URLs = ~32 GB RAM
• Perfect accuracy
• O(1) lookup, but memory intensive

Bloom Filter

• ~10 bits per element
• 1 billion URLs = ~1.2 GB RAM
• 99.9% accuracy (tunable)
• O(k) lookup where k = hash functions

How Bloom Filters Work

🔧 The Algorithm

Components:

• Bit array of size m (initially all 0s)
• k independent hash functions

Add element:

1. Hash element with k hash functions
2. Set bits at those k positions to 1

Check element:

1. Hash element with same k hash functions
2. If all k positions are 1 → "possibly in set"
3. If any position is 0 → "definitely not in set"

Interactive Demo

Interactive Bloom Filter

Bit Array (m=20, k=3 hash functions)

Add Items

Check Membership

Added Items

No items added yet

Filter Stats

Bits set: 0 / 20

Fill rate: 0.0%

Items added: 0

Legend

Bit set to 1

Bit set to 0

Hash position

Visual Example

Initial Bloom Filter (m=10 bits, k=3 hash functions):
[0][0][0][0][0][0][0][0][0][0]
 0  1  2  3  4  5  6  7  8  9

Add "apple":
hash1("apple") = 2, hash2("apple") = 5, hash3("apple") = 8
[0][0][1][0][0][1][0][0][1][0]
 0  1  2  3  4  5  6  7  8  9

Add "banana":
hash1("banana") = 1, hash2("banana") = 5, hash3("banana") = 7
[0][1][1][0][0][1][0][1][1][0]
 0  1  2  3  4  5  6  7  8  9

Check "apple":
Positions 2, 5, 8 → All are 1 → "Possibly in set" ✓

Check "orange":
hash1("orange") = 2, hash2("orange") = 4, hash3("orange") = 8
Positions 2, 4, 8 → Position 4 is 0 → "Definitely not in set" ✗

Mathematical Foundation

Optimal Parameters

False Positive Rate

p ≈ (1 - e^(-kn/m))^k

• p = false positive probability
• n = number of elements
• m = bit array size
• k = number of hash functions

Optimal Values

k = (m/n) × ln(2) ≈ 0.7 × (m/n) m = -n × ln(p) / (ln(2)^2)

• For p=0.01 (1% FP): ~10 bits/element
• For p=0.001 (0.1% FP): ~15 bits/element

Real-World Use Case: Distributed KV Store Read Path

🗄️ LSM Tree + Bloom Filter Optimization

In our distributed key-value store using LSM trees, Bloom filters dramatically speed up reads:

Without Bloom Filters:

Read "user:123":

Check MemTable → Not found
Check SSTable L0_1 → Disk read → Not found
Check SSTable L0_2 → Disk read → Not found
Check SSTable L1_1 → Disk read → Not found
Check SSTable L1_2 → Disk read → Found!

Result: 4 unnecessary disk reads 💀

With Bloom Filters:

Read "user:123":

Check MemTable → Not found
Check L0_1 Bloom filter → "Not present" (skip)
Check L0_2 Bloom filter → "Not present" (skip)
Check L1_1 Bloom filter → "Not present" (skip)
Check L1_2 Bloom filter → "Possibly present"
Check SSTable L1_2 → Disk read → Found!

Result: Only 1 disk read! 🚀

Implementation Example

import mmh3  # MurmurHash3 for better distribution
import math

class BloomFilter:
    def __init__(self, expected_elements, false_positive_rate=0.01):
        # Calculate optimal size and hash functions
        self.size = self._optimal_size(expected_elements, false_positive_rate)
        self.num_hashes = self._optimal_hashes(self.size, expected_elements)
        self.bit_array = [0] * self.size
        self.count = 0
        
    def _optimal_size(self, n, p):
        """Calculate optimal bit array size"""
        return int(-n * math.log(p) / (math.log(2) ** 2))
    
    def _optimal_hashes(self, m, n):
        """Calculate optimal number of hash functions"""
        return max(1, int(m / n * math.log(2)))
    
    def _hash(self, item, seed):
        """Generate hash value for item with seed"""
        return mmh3.hash(str(item), seed) % self.size
    
    def add(self, item):
        """Add item to the Bloom filter"""
        for i in range(self.num_hashes):
            index = self._hash(item, i)
            self.bit_array[index] = 1
        self.count += 1
    
    def contains(self, item):
        """Check if item might be in the set"""
        for i in range(self.num_hashes):
            index = self._hash(item, i)
            if self.bit_array[index] == 0:
                return False  # Definitely not in set
        return True  # Possibly in set

# Usage in KV Store
class LSMTreeWithBloom:
    def __init__(self):
        self.sstables = []
        
    def create_sstable(self, data):
        bloom = BloomFilter(len(data), false_positive_rate=0.001)
        for key in data:
            bloom.add(key)
        
        sstable = {
            'bloom_filter': bloom,
            'data': data,
            'min_key': min(data.keys()),
            'max_key': max(data.keys())
        }
        self.sstables.append(sstable)
    
    def get(self, key):
        for sstable in self.sstables:
            # First check Bloom filter (memory operation)
            if not sstable['bloom_filter'].contains(key):
                continue  # Skip this SSTable
            
            # Bloom filter says "maybe" - do actual disk read
            if key in sstable['data']:
                return sstable['data'][key]
        
        return None  # Key not found

Comparison with Alternatives

Data Structure	Best For	Pros	Cons
Bloom Filter	Set membership	Simple, space-efficient	No deletions, no counting
Cuckoo Filter	Dynamic sets	Supports deletion, better cache locality	Slightly larger, insertion can fail
Count-Min Sketch	Frequency counting	Tracks frequencies, handles streams	Only for counting, not membership
HyperLogLog	Cardinality estimation	Extremely space-efficient for unique counts	Only estimates count, not membership
XOR Filter	Static sets	25% smaller, 2x faster queries	Immutable after construction

Real-World Applications

🌐 Web Crawlers

Check if URL already crawled without storing millions of URLs

if not bloom.contains(url):
crawl(url)
bloom.add(url)

🗄️ Databases

Skip disk reads for non-existent keys in LSM trees

Cassandra, RocksDB, LevelDB
all use Bloom filters

🔒 Security

Check passwords against breach lists without storing them

Chrome uses Bloom filters
for Safe Browsing

📧 Spam Filtering

Quick check if email/IP in spam list before detailed analysis

if bloom.contains(sender_ip):
deep_spam_check()

When to Use Bloom Filters

✅ Perfect For

• Avoiding expensive operations (disk reads, network calls)
• Large datasets where space is critical
• Applications tolerating false positives
• Write-once, read-many scenarios
• Distributed systems coordination

❌ Not Suitable For

• Need 100% accuracy
• Frequent deletions required
• Need to retrieve actual values
• Small datasets (overhead not worth it)
• Need to count occurrences

Advanced Topics

🔬Deep Dive: Counting Bloom Filters & Scalable Variants▼

Counting Bloom Filter

Instead of bits, use small counters (3-4 bits each). Allows deletions!

Add: Increment counters at k positions Delete: Decrement counters at k positions Check: All k counters > 0 → possibly in set

Scalable Bloom Filter

Grows dynamically as more elements added, maintaining target false positive rate.

Start with small filter (e.g., 1000 elements)
When 70% full → Add new larger filter
Check: Test all filters (newest to oldest)
FP rate: p₀ × (1 + p₁ + p₁×p₂ + ...)

Distributed Bloom Filters

Split filter across multiple nodes for massive scale.

Partitioned: Each node owns bit range Replicated: Full copy on each node Hierarchical: Summary filters + detailed filters

Summary

Bloom filters are a powerful tool in the system designer's toolkit. They trade a small amount of accuracy for massive space savings and performance improvements. In distributed systems like our key-value store, they're essential for avoiding expensive disk I/O operations.

Key Takeaways:

10-30x space savings compared to traditional sets
O(k) constant time operations regardless of dataset size
Perfect for "definitely not in set" queries
Critical component in modern databases and distributed systems
Choose variants (Cuckoo, XOR) based on specific requirements

💡 Pro Tip: In production systems, always benchmark with your actual data. The theoretical false positive rate might differ from real-world performance due to hash function quality and data distribution.