Bloom Filters
A space-efficient probabilistic data structure that tells you if an element is definitely not in a set or possibly in a set. Perfect for cases where false positives are acceptable but false negatives are not.
The Core Idea
Why Bloom Filters?
Traditional Set Lookup
- • HashSet: ~32 bytes per element
- • 1 billion URLs = ~32 GB RAM
- • Perfect accuracy
- • O(1) lookup, but memory intensive
Bloom Filter
- • ~10 bits per element
- • 1 billion URLs = ~1.2 GB RAM
- • 99.9% accuracy (tunable)
- • O(k) lookup where k = hash functions
How Bloom Filters Work
🔧 The Algorithm
Components:
- • Bit array of size m (initially all 0s)
- • k independent hash functions
Add element:
- 1. Hash element with k hash functions
- 2. Set bits at those k positions to 1
Check element:
- 1. Hash element with same k hash functions
- 2. If all k positions are 1 → "possibly in set"
- 3. If any position is 0 → "definitely not in set"
Interactive Demo
Interactive Bloom Filter
Bit Array (m=20, k=3 hash functions)
Add Items
Check Membership
Added Items
Filter Stats
Bits set: 0 / 20
Fill rate: 0.0%
Items added: 0
Legend
Visual Example
Initial Bloom Filter (m=10 bits, k=3 hash functions):
[0][0][0][0][0][0][0][0][0][0]
0 1 2 3 4 5 6 7 8 9
Add "apple":
hash1("apple") = 2, hash2("apple") = 5, hash3("apple") = 8
[0][0][1][0][0][1][0][0][1][0]
0 1 2 3 4 5 6 7 8 9
Add "banana":
hash1("banana") = 1, hash2("banana") = 5, hash3("banana") = 7
[0][1][1][0][0][1][0][1][1][0]
0 1 2 3 4 5 6 7 8 9
Check "apple":
Positions 2, 5, 8 → All are 1 → "Possibly in set" ✓
Check "orange":
hash1("orange") = 2, hash2("orange") = 4, hash3("orange") = 8
Positions 2, 4, 8 → Position 4 is 0 → "Definitely not in set" ✗Mathematical Foundation
Optimal Parameters
False Positive Rate
p ≈ (1 - e^(-kn/m))^k
- • p = false positive probability
- • n = number of elements
- • m = bit array size
- • k = number of hash functions
Optimal Values
k = (m/n) × ln(2) ≈ 0.7 × (m/n) m = -n × ln(p) / (ln(2)^2)
- • For p=0.01 (1% FP): ~10 bits/element
- • For p=0.001 (0.1% FP): ~15 bits/element
Real-World Use Case: Distributed KV Store Read Path
🗄️ LSM Tree + Bloom Filter Optimization
In our distributed key-value store using LSM trees, Bloom filters dramatically speed up reads:
Without Bloom Filters:
Read "user:123":
- Check MemTable → Not found
- Check SSTable L0_1 → Disk read → Not found
- Check SSTable L0_2 → Disk read → Not found
- Check SSTable L1_1 → Disk read → Not found
- Check SSTable L1_2 → Disk read → Found!
Result: 4 unnecessary disk reads 💀
With Bloom Filters:
Read "user:123":
- Check MemTable → Not found
- Check L0_1 Bloom filter → "Not present" (skip)
- Check L0_2 Bloom filter → "Not present" (skip)
- Check L1_1 Bloom filter → "Not present" (skip)
- Check L1_2 Bloom filter → "Possibly present"
- Check SSTable L1_2 → Disk read → Found!
Result: Only 1 disk read! 🚀
Implementation Example
import mmh3 # MurmurHash3 for better distribution
import math
class BloomFilter:
def __init__(self, expected_elements, false_positive_rate=0.01):
# Calculate optimal size and hash functions
self.size = self._optimal_size(expected_elements, false_positive_rate)
self.num_hashes = self._optimal_hashes(self.size, expected_elements)
self.bit_array = [0] * self.size
self.count = 0
def _optimal_size(self, n, p):
"""Calculate optimal bit array size"""
return int(-n * math.log(p) / (math.log(2) ** 2))
def _optimal_hashes(self, m, n):
"""Calculate optimal number of hash functions"""
return max(1, int(m / n * math.log(2)))
def _hash(self, item, seed):
"""Generate hash value for item with seed"""
return mmh3.hash(str(item), seed) % self.size
def add(self, item):
"""Add item to the Bloom filter"""
for i in range(self.num_hashes):
index = self._hash(item, i)
self.bit_array[index] = 1
self.count += 1
def contains(self, item):
"""Check if item might be in the set"""
for i in range(self.num_hashes):
index = self._hash(item, i)
if self.bit_array[index] == 0:
return False # Definitely not in set
return True # Possibly in set
# Usage in KV Store
class LSMTreeWithBloom:
def __init__(self):
self.sstables = []
def create_sstable(self, data):
bloom = BloomFilter(len(data), false_positive_rate=0.001)
for key in data:
bloom.add(key)
sstable = {
'bloom_filter': bloom,
'data': data,
'min_key': min(data.keys()),
'max_key': max(data.keys())
}
self.sstables.append(sstable)
def get(self, key):
for sstable in self.sstables:
# First check Bloom filter (memory operation)
if not sstable['bloom_filter'].contains(key):
continue # Skip this SSTable
# Bloom filter says "maybe" - do actual disk read
if key in sstable['data']:
return sstable['data'][key]
return None # Key not foundComparison with Alternatives
| Data Structure | Best For | Pros | Cons |
|---|---|---|---|
| Bloom Filter | Set membership | Simple, space-efficient | No deletions, no counting |
| Cuckoo Filter | Dynamic sets | Supports deletion, better cache locality | Slightly larger, insertion can fail |
| Count-Min Sketch | Frequency counting | Tracks frequencies, handles streams | Only for counting, not membership |
| HyperLogLog | Cardinality estimation | Extremely space-efficient for unique counts | Only estimates count, not membership |
| XOR Filter | Static sets | 25% smaller, 2x faster queries | Immutable after construction |
Real-World Applications
🌐 Web Crawlers
Check if URL already crawled without storing millions of URLs
if not bloom.contains(url):
crawl(url)
bloom.add(url)
🗄️ Databases
Skip disk reads for non-existent keys in LSM trees
Cassandra, RocksDB, LevelDB
all use Bloom filters
🔒 Security
Check passwords against breach lists without storing them
Chrome uses Bloom filters
for Safe Browsing
📧 Spam Filtering
Quick check if email/IP in spam list before detailed analysis
if bloom.contains(sender_ip):
deep_spam_check()
When to Use Bloom Filters
✅ Perfect For
- • Avoiding expensive operations (disk reads, network calls)
- • Large datasets where space is critical
- • Applications tolerating false positives
- • Write-once, read-many scenarios
- • Distributed systems coordination
❌ Not Suitable For
- • Need 100% accuracy
- • Frequent deletions required
- • Need to retrieve actual values
- • Small datasets (overhead not worth it)
- • Need to count occurrences
Advanced Topics
🔬Deep Dive: Counting Bloom Filters & Scalable Variants▼
Counting Bloom Filter
Instead of bits, use small counters (3-4 bits each). Allows deletions!
Add: Increment counters at k positions Delete: Decrement counters at k positions Check: All k counters > 0 → possibly in set
Scalable Bloom Filter
Grows dynamically as more elements added, maintaining target false positive rate.
Start with small filter (e.g., 1000 elements) When 70% full → Add new larger filter Check: Test all filters (newest to oldest) FP rate: p₀ × (1 + p₁ + p₁×p₂ + ...)
Distributed Bloom Filters
Split filter across multiple nodes for massive scale.
Partitioned: Each node owns bit range Replicated: Full copy on each node Hierarchical: Summary filters + detailed filters
Summary
Bloom filters are a powerful tool in the system designer's toolkit. They trade a small amount of accuracy for massive space savings and performance improvements. In distributed systems like our key-value store, they're essential for avoiding expensive disk I/O operations.
Key Takeaways:
- 10-30x space savings compared to traditional sets
- O(k) constant time operations regardless of dataset size
- Perfect for "definitely not in set" queries
- Critical component in modern databases and distributed systems
- Choose variants (Cuckoo, XOR) based on specific requirements
💡 Pro Tip: In production systems, always benchmark with your actual data. The theoretical false positive rate might differ from real-world performance due to hash function quality and data distribution.