Encoding Schemes

Understanding different encoding schemes is crucial for system design. From shortening URLs to generating unique IDs, encoding helps us represent data efficiently while meeting specific constraints.

Why Encoding Matters

Common Use Cases

URL Shortening

  • • Convert long URLs to short codes
  • • bit.ly: 2VhK8pQ
  • • YouTube: dQw4w9WgXcQ

Unique ID Generation

  • • Human-readable identifiers
  • • Distributed system IDs
  • • Session tokens & API keys

Common Encoding Schemes

1. Base10 (Decimal)

Standard Decimal

Character Set: 0-9 (10 chars)

Use Cases: Human-readable numbers

Example: 12345678

✅ Universal understanding

✅ Easy validation

❌ Long representation

2. Base16 (Hexadecimal)

Hexadecimal

Character Set: 0-9, A-F (16 chars)

Use Cases: Memory addresses, color codes

Example: 4A3F2B1C

✅ Compact for binary data

✅ Direct byte mapping

⚠️ Case sensitivity issues

3. Base32

Base32 Encoding

Character Set: A-Z, 2-7 (32 chars)

Use Cases: Case-insensitive systems

Example: JBSWY3DPEBLW64TM

✅ No case sensitivity

✅ Avoids ambiguous chars

❌ 20% longer than Base64

📝 Note: Excludes 0, 1, 8, 9 to avoid confusion with O, I, B, g

4. Base62

Base62 - The URL Shortener's Choice

Character Set: 0-9, a-z, A-Z (62 chars)

Use Cases: URL shorteners, readable IDs

Example: 3D7xmK9p

✅ URL-safe without encoding

✅ High density

✅ Human-friendly

Why Base62 for URLs?
  • • No special characters that need URL encoding
  • • Case-sensitive for maximum density
  • • 62^7 = 3.5 trillion combinations in just 7 characters

5. Base64

Base64 Encoding

Character Set: A-Z, a-z, 0-9, +, / (64 chars)

Use Cases: Binary data in text format

Example: SGVsbG8gV29ybGQh

✅ Efficient for binary

✅ Standard padding with =

❌ Not URL-safe (+, /)

Base64 Variants

Standard Base64

Uses + and / (requires URL encoding)

URL-Safe Base64

Uses - and _ instead of + and /

Encoding Density Comparison

EncodingBits per Char8 Bytes (64 bits) Encoded LengthEfficiency
Base10~3.32 bits20 characters41.5%
Base16 (Hex)4 bits16 characters50%
Base325 bits13 characters62.5%
Base62~5.95 bits11 characters74.4%
Base646 bits11 characters75%

Implementation Examples

Base62 Encoding/Decoding

class Base62Encoder:
    # Character set for Base62
    CHARSET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    BASE = 62
    
    @staticmethod
    def encode(num):
        """Convert a number to Base62 string"""
        if num == 0:
            return "0"
        
        result = []
        while num > 0:
            remainder = num % Base62Encoder.BASE
            result.append(Base62Encoder.CHARSET[remainder])
            num = num // Base62Encoder.BASE
        
        return ''.join(reversed(result))
    
    @staticmethod
    def decode(encoded):
        """Convert Base62 string back to number"""
        num = 0
        for char in encoded:
            num = num * Base62Encoder.BASE + Base62Encoder.CHARSET.index(char)
        return num

# Example usage
encoder = Base62Encoder()

# URL shortener use case
url_id = 125432985  # Database ID
short_code = encoder.encode(url_id)  # "8KpQ5"

# Decode back
original_id = encoder.decode(short_code)  # 125432985

Custom Base Encoding

class CustomBaseEncoder:
    def __init__(self, charset):
        """Create encoder with custom character set"""
        self.charset = charset
        self.base = len(charset)
        # Create reverse lookup for decoding
        self.char_to_index = {char: i for i, char in enumerate(charset)}
    
    def encode(self, num):
        if num == 0:
            return self.charset[0]
        
        result = []
        while num > 0:
            result.append(self.charset[num % self.base])
            num //= self.base
        
        return ''.join(reversed(result))
    
    def decode(self, encoded):
        num = 0
        for char in encoded:
            num = num * self.base + self.char_to_index[char]
        return num

# Example: Crockford's Base32 (excludes I, L, O, U to avoid confusion)
crockford_charset = "0123456789ABCDEFGHJKMNPQRSTVWXYZ"
crockford = CustomBaseEncoder(crockford_charset)

# Example: URL-safe Base64
urlsafe_base64_charset = (
    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
)
urlsafe = CustomBaseEncoder(urlsafe_base64_charset)

Choosing the Right Encoding

🎯 Decision Matrix

Use Base62 when:

  • ✓ Building URL shorteners
  • ✓ Need human-readable IDs
  • ✓ Want maximum density without special characters
  • ✓ Case-sensitivity is acceptable

Use Base32 when:

  • ✓ Case-insensitive systems
  • ✓ Voice/phone communication of codes
  • ✓ QR codes or OCR systems
  • ✓ Need to avoid ambiguous characters

Use Base64 when:

  • ✓ Encoding binary data (images, files)
  • ✓ Email attachments (MIME)
  • ✓ JWT tokens
  • ✓ Data URIs in web development

Use Hexadecimal when:

  • ✓ Debugging binary data
  • ✓ Color codes (#FF5733)
  • ✓ Memory addresses
  • ✓ Cryptographic hashes

Real-World Applications

🔗 TinyURL / bit.ly

Uses Base62 to convert numeric IDs to short codes

ID: 125432985
Base62: "8KpQ5"
URL: https://bit.ly/8KpQ5

🎥 YouTube Video IDs

11-character Base64 variant for video IDs

Video ID: dQw4w9WgXcQ
~64 bits of entropy
2^64 possible videos

🔑 API Keys

Base64 encoding of random bytes

Random: 32 bytes
Base64: sk_live_4eC39HqLyjWDarjtT1zdp7dc

🎟️ Ticket/Coupon Codes

Base32 for case-insensitive, typo-resistant codes

Code: SAVE-2KQ3-XM9P-7TRY
No 0/O, 1/I confusion
Voice-friendly

Summary

Encoding schemes are fundamental building blocks for many system design problems. Choose based on:

  • Density requirements: How short does it need to be?
  • Character constraints: URL-safe? Case-sensitive?
  • Human factors: Will people type it? Say it over phone?
  • Use case: Binary data? Numeric IDs? Random tokens?

💡 Pro Tip: For distributed ID generation, Base62 offers the best balance of density and usability. For binary data transmission, Base64 is the standard. For human communication, consider Base32 variants.