Bloom Filters

A space-efficient probabilistic data structure that tells you if an element is definitely not in a set or possibly in a set. Perfect for cases where false positives are acceptable but false negatives are not.

The Core Idea

Why Bloom Filters?

Traditional Set Lookup

  • • HashSet: ~32 bytes per element
  • • 1 billion URLs = ~32 GB RAM
  • • Perfect accuracy
  • • O(1) lookup, but memory intensive

Bloom Filter

  • • ~10 bits per element
  • • 1 billion URLs = ~1.2 GB RAM
  • • 99.9% accuracy (tunable)
  • • O(k) lookup where k = hash functions

How Bloom Filters Work

🔧 The Algorithm

Components:

  • • Bit array of size m (initially all 0s)
  • • k independent hash functions

Add element:

  1. 1. Hash element with k hash functions
  2. 2. Set bits at those k positions to 1

Check element:

  1. 1. Hash element with same k hash functions
  2. 2. If all k positions are 1 → "possibly in set"
  3. 3. If any position is 0 → "definitely not in set"

Interactive Demo

Interactive Bloom Filter

Bit Array (m=20, k=3 hash functions)

0
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
10
0
11
0
12
0
13
0
14
0
15
0
16
0
17
0
18
0
19
0

Add Items

Check Membership

Added Items

No items added yet

Filter Stats

Bits set: 0 / 20

Fill rate: 0.0%

Items added: 0

Legend

Bit set to 1
Bit set to 0
Hash position

Visual Example

Initial Bloom Filter (m=10 bits, k=3 hash functions):
[0][0][0][0][0][0][0][0][0][0]
 0  1  2  3  4  5  6  7  8  9

Add "apple":
hash1("apple") = 2, hash2("apple") = 5, hash3("apple") = 8
[0][0][1][0][0][1][0][0][1][0]
 0  1  2  3  4  5  6  7  8  9

Add "banana":
hash1("banana") = 1, hash2("banana") = 5, hash3("banana") = 7
[0][1][1][0][0][1][0][1][1][0]
 0  1  2  3  4  5  6  7  8  9

Check "apple":
Positions 2, 5, 8 → All are 1 → "Possibly in set" ✓

Check "orange":
hash1("orange") = 2, hash2("orange") = 4, hash3("orange") = 8
Positions 2, 4, 8 → Position 4 is 0 → "Definitely not in set" ✗

Mathematical Foundation

Optimal Parameters

False Positive Rate

p ≈ (1 - e^(-kn/m))^k

  • • p = false positive probability
  • • n = number of elements
  • • m = bit array size
  • • k = number of hash functions

Optimal Values

k = (m/n) × ln(2) ≈ 0.7 × (m/n) m = -n × ln(p) / (ln(2)^2)

  • • For p=0.01 (1% FP): ~10 bits/element
  • • For p=0.001 (0.1% FP): ~15 bits/element

Real-World Use Case: Distributed KV Store Read Path

🗄️ LSM Tree + Bloom Filter Optimization

In our distributed key-value store using LSM trees, Bloom filters dramatically speed up reads:

Without Bloom Filters:

Read "user:123":

  1. Check MemTable → Not found
  2. Check SSTable L0_1 → Disk read → Not found
  3. Check SSTable L0_2 → Disk read → Not found
  4. Check SSTable L1_1 → Disk read → Not found
  5. Check SSTable L1_2 → Disk read → Found!

Result: 4 unnecessary disk reads 💀

With Bloom Filters:

Read "user:123":

  1. Check MemTable → Not found
  2. Check L0_1 Bloom filter → "Not present" (skip)
  3. Check L0_2 Bloom filter → "Not present" (skip)
  4. Check L1_1 Bloom filter → "Not present" (skip)
  5. Check L1_2 Bloom filter → "Possibly present"
  6. Check SSTable L1_2 → Disk read → Found!

Result: Only 1 disk read! 🚀

Implementation Example

import mmh3  # MurmurHash3 for better distribution
import math

class BloomFilter:
    def __init__(self, expected_elements, false_positive_rate=0.01):
        # Calculate optimal size and hash functions
        self.size = self._optimal_size(expected_elements, false_positive_rate)
        self.num_hashes = self._optimal_hashes(self.size, expected_elements)
        self.bit_array = [0] * self.size
        self.count = 0
        
    def _optimal_size(self, n, p):
        """Calculate optimal bit array size"""
        return int(-n * math.log(p) / (math.log(2) ** 2))
    
    def _optimal_hashes(self, m, n):
        """Calculate optimal number of hash functions"""
        return max(1, int(m / n * math.log(2)))
    
    def _hash(self, item, seed):
        """Generate hash value for item with seed"""
        return mmh3.hash(str(item), seed) % self.size
    
    def add(self, item):
        """Add item to the Bloom filter"""
        for i in range(self.num_hashes):
            index = self._hash(item, i)
            self.bit_array[index] = 1
        self.count += 1
    
    def contains(self, item):
        """Check if item might be in the set"""
        for i in range(self.num_hashes):
            index = self._hash(item, i)
            if self.bit_array[index] == 0:
                return False  # Definitely not in set
        return True  # Possibly in set

# Usage in KV Store
class LSMTreeWithBloom:
    def __init__(self):
        self.sstables = []
        
    def create_sstable(self, data):
        bloom = BloomFilter(len(data), false_positive_rate=0.001)
        for key in data:
            bloom.add(key)
        
        sstable = {
            'bloom_filter': bloom,
            'data': data,
            'min_key': min(data.keys()),
            'max_key': max(data.keys())
        }
        self.sstables.append(sstable)
    
    def get(self, key):
        for sstable in self.sstables:
            # First check Bloom filter (memory operation)
            if not sstable['bloom_filter'].contains(key):
                continue  # Skip this SSTable
            
            # Bloom filter says "maybe" - do actual disk read
            if key in sstable['data']:
                return sstable['data'][key]
        
        return None  # Key not found

Comparison with Alternatives

Data StructureBest ForProsCons
Bloom FilterSet membershipSimple, space-efficientNo deletions, no counting
Cuckoo FilterDynamic setsSupports deletion, better cache localitySlightly larger, insertion can fail
Count-Min SketchFrequency countingTracks frequencies, handles streamsOnly for counting, not membership
HyperLogLogCardinality estimationExtremely space-efficient for unique countsOnly estimates count, not membership
XOR FilterStatic sets25% smaller, 2x faster queriesImmutable after construction

Real-World Applications

🌐 Web Crawlers

Check if URL already crawled without storing millions of URLs

if not bloom.contains(url):
  crawl(url)
  bloom.add(url)

🗄️ Databases

Skip disk reads for non-existent keys in LSM trees

Cassandra, RocksDB, LevelDB
all use Bloom filters

🔒 Security

Check passwords against breach lists without storing them

Chrome uses Bloom filters
for Safe Browsing

📧 Spam Filtering

Quick check if email/IP in spam list before detailed analysis

if bloom.contains(sender_ip):
  deep_spam_check()

When to Use Bloom Filters

✅ Perfect For

  • • Avoiding expensive operations (disk reads, network calls)
  • • Large datasets where space is critical
  • • Applications tolerating false positives
  • • Write-once, read-many scenarios
  • • Distributed systems coordination

❌ Not Suitable For

  • • Need 100% accuracy
  • • Frequent deletions required
  • • Need to retrieve actual values
  • • Small datasets (overhead not worth it)
  • • Need to count occurrences

Advanced Topics

🔬Deep Dive: Counting Bloom Filters & Scalable Variants

Counting Bloom Filter

Instead of bits, use small counters (3-4 bits each). Allows deletions!

Add: Increment counters at k positions Delete: Decrement counters at k positions Check: All k counters > 0 → possibly in set

Scalable Bloom Filter

Grows dynamically as more elements added, maintaining target false positive rate.

Start with small filter (e.g., 1000 elements) When 70% full → Add new larger filter Check: Test all filters (newest to oldest) FP rate: p₀ × (1 + p₁ + p₁×p₂ + ...)

Distributed Bloom Filters

Split filter across multiple nodes for massive scale.

Partitioned: Each node owns bit range Replicated: Full copy on each node Hierarchical: Summary filters + detailed filters

Summary

Bloom filters are a powerful tool in the system designer's toolkit. They trade a small amount of accuracy for massive space savings and performance improvements. In distributed systems like our key-value store, they're essential for avoiding expensive disk I/O operations.

Key Takeaways:

  • 10-30x space savings compared to traditional sets
  • O(k) constant time operations regardless of dataset size
  • Perfect for "definitely not in set" queries
  • Critical component in modern databases and distributed systems
  • Choose variants (Cuckoo, XOR) based on specific requirements

💡 Pro Tip: In production systems, always benchmark with your actual data. The theoretical false positive rate might differ from real-world performance due to hash function quality and data distribution.