Step 3: Core Architecture
Step 3 of 6: C - Core High-Level Design
Build the fundamental system architecture with file chunking and sync algorithms
🏗️ High-Level System Architecture
🔗 API Endpoint → Architecture Mapping
POST /api/files/upload
Client → API Gateway → File Service → S3 (presigned URL) + Metadata DB
GET /api/files/{file_id}
Client → API Gateway → Metadata Service → Redis Cache → CDN/S3
GET /api/sync/changes
Client → API Gateway → Sync Service → Metadata DB → Redis Cache
POST /api/sharing/create-link
Client → API Gateway → Metadata Service → Metadata DB → Notification Service
GET /api/folders/{id}/contents
Client → API Gateway → Metadata Service → Redis Cache → Metadata DB
GET /api/search
Client → API Gateway → Metadata Service → Elasticsearch → Redis Cache
WebSocket: Real-time sync
Client ↔ Notification Service → Kafka → Sync Service
Authentication
All requests → API Gateway → Auth Service → JWT validation
📱 Client Layer
Multiple client types with automatic chunking and sync capabilities
⚙️ Service Layer
Microservices handling specific responsibilities, communicating via message queue
💾 Data Layer
Hybrid storage: SQL for metadata, Redis for cache, S3 for files, Elasticsearch for search
🎯 Component Responsibilities
📁 File Service
APIs: POST /api/files/upload, GET /api/files/{id}
Responsibilities:
- Generate presigned S3 URLs
- Chunk deduplication logic
- File metadata validation
- Trigger async processing
🔄 Sync Service
APIs: GET /api/sync/changes, POST /api/sync/register-device
Responsibilities:
- Delta change detection
- Conflict resolution
- Device sync coordination
- Version management
📊 Metadata Service
APIs: GET /api/folders/{id}/contents, GET /api/search
Responsibilities:
- File/folder metadata CRUD
- Permission management
- Search queries
- Cache management
🔔 Notification Service
APIs: WebSocket connections, POST /api/sharing/create-link
Responsibilities:
- Real-time WebSocket connections
- Push notifications
- Event broadcasting
- Presence management
🔐 Auth Service
APIs: All authenticated endpoints
Responsibilities:
- JWT token validation
- OAuth 2.0 flow
- Permission checks
- Rate limiting
⚡ API Gateway
APIs: All external APIs
Responsibilities:
- Request routing
- Load balancing
- API versioning
- Request/response logging
📤 File Upload Flow
Client Chunks File
Desktop app detects new file, chunks it into 4MB pieces, calculates checksums
Initialize Upload
Client calls File Service to create file record, gets file_id and upload URLs
Deduplication Check
File Service checks if chunks already exist (by checksum), skips duplicate uploads
Get Presigned URLs
File Service generates S3 presigned URLs for each chunk, returns to client
Direct Upload to S3
Client uploads chunks directly to S3 using presigned URLs, bypassing our servers
Update Metadata
Metadata Service updates PostgreSQL with file info, chunk locations, versions
Trigger Sync
Sync Service publishes change event to Kafka, notifies other devices via WebSocket
Update Search Index
Async job updates Elasticsearch with file name and metadata for quick search
🔑Deep Dive: Presigned URLs vs Proxy Upload▼
There are two approaches for uploading files to cloud storage. Let's compare them:
❌ Proxy Upload (Not Recommended)
Flow:
- Client uploads to our server
- Server receives entire chunk
- Server uploads to S3
- Server responds to client
Drawbacks:
- 2x bandwidth cost (in + out)
- Server becomes bottleneck
- Higher latency
- Need more servers for scale
- Server memory/disk usage
✅ Presigned URLs (Recommended)
Flow:
- Client requests upload permission
- Server generates presigned URL
- Client uploads directly to S3
- S3 notifies server via webhook
Benefits:
- No bandwidth through servers
- Parallel uploads to S3
- Lower latency
- Reduced server costs
- S3 handles retries/resumption
Sample Presigned URL Generation:
// Server-side: Generate presigned URL for client
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
async function generatePresignedUrl(fileId, chunkNumber) {
const params = {
Bucket: 'dropbox-chunks',
Key: `${userId}/${fileId}/chunk-${chunkNumber}`,
Expires: 3600, // URL valid for 1 hour
ContentType: 'application/octet-stream',
// Security: Limit upload size
Conditions: [
['content-length-range', 0, 4194304] // Max 4MB
]
};
const presignedUrl = await s3.getSignedUrlPromise('putObject', params);
return presignedUrl;
}
// Client-side: Use presigned URL to upload directly
async function uploadChunk(presignedUrl, chunkData) {
await fetch(presignedUrl, {
method: 'PUT',
body: chunkData,
headers: {
'Content-Type': 'application/octet-stream'
}
});
// Server is never involved in data transfer!
}💰 Cost Comparison (1TB daily upload)
Proxy Upload:
- • Bandwidth IN: $90/day
- • Bandwidth OUT to S3: $90/day
- • Extra servers: ~20 instances
- • Total: ~$500/day
Presigned URLs:
- • Bandwidth: $0 (direct to S3)
- • S3 requests: $5/day
- • Minimal server overhead
- • Total: ~$10/day
🔄Scenario: Network Failure During Upload (Chunk 3/5)▼
Let's trace through what happens when a user's network disconnects while uploading chunk 3 of 5:
📉 Initial Upload State (Before Failure)
✅ Complete
✅ Complete
❌ Failed 60%
⏳ Pending
⏳ Pending
• Client state: Has upload_session_id and file_id
• S3 state: Chunks 1 & 2 fully uploaded, Chunk 3 partially uploaded
• Database state: Metadata record exists with status "uploading"
🔍 Recovery Detection (When Connection Restored)
1. Client Startup Check:
• Desktop app detects incomplete upload in local SQLite DB
• Finds upload_session_id and file_id in local storage
• Network connectivity restored
2. Server State Verification:
• Client calls: GET /api/uploads/session_xyz789/status
• Server responds with current upload progress
🚀 Resume Upload Process
Server Response:
{
"upload_session_id": "session_xyz789",
"file_id": "file_a1b2c3d4e5",
"status": "partial_upload",
"completed_chunks": [1, 2],
"failed_chunks": [3],
"pending_chunks": [4, 5],
"resume_presigned_urls": [
{
"chunk_number": 3,
"upload_url": "<s3 presigned url>",
"resume_from_byte": 0 // S3 multipart failed, restart chunk
},
{
"chunk_number": 4,
"upload_url": "<s3 presigned url>"
},
{
"chunk_number": 5,
"upload_url": "<s3 presigned url>"
}
]
}⚡ Smart Resume Strategy
Option A: Simple Restart
- Re-upload entire chunk 3 from beginning
- Simpler implementation
- Works with standard S3 multipart
- Wastes some bandwidth
Option B: Byte-level Resume
- Resume from exact byte offset
- HTTP Range headers:
Range: bytes=2621440- - More complex but bandwidth efficient
- Requires chunked transfer encoding
🔄 Final Upload Flow
- Client resumes: Uploads chunk 3, 4, 5 in parallel
- Progress tracking: Updates local DB and shows progress to user
- Completion: Calls
POST /api/uploads/session_xyz789/complete - Finalization: Server updates file status to "available"
- Sync trigger: Notifies other devices via WebSocket
- User experience: File appears as "synced" across all devices
✅ Production Implementation Tips
- • Exponential backoff: Wait 1s, 2s, 4s, 8s before retry attempts
- • Heartbeat mechanism: Periodic ping to detect connection loss early
- • Background uploads: Continue uploading even when app is minimized
- • Bandwidth adaptation: Reduce chunk size on slow/unstable connections
- • User feedback: Show "Resuming upload..." with clear progress indication
- • Timeout handling: Upload sessions expire after 24 hours for security
💾 Client-Side Persistence Requirement
🔑 Key Insight: The upload recovery scenario above is only possible because client applications maintain local persistent storage to track upload sessions and chunk states.
📱 Local Storage (SQLite)
Client apps maintain local database with:
- Upload sessions: session_id, file_id, progress
- Chunk status: completed, failed, pending chunks
- File metadata: checksums, modification times
- Sync cursors: last sync timestamp, change tokens
🔄 Resume Capabilities Enabled
Local persistence enables:
- Upload recovery: Resume after network failures
- App restart resilience: Continue from where left off
- Offline editing: Track changes while disconnected
- Progress tracking: Show accurate upload/download status
Architecture Impact: Every production file sync client (Dropbox, Google Drive, OneDrive) implements local persistence. Without it, users would lose all progress on app crashes or network interruptions, making the service unreliable for large files.
🔄 Delta Sync Algorithm
📊 Change Detection
1. Client maintains local metadata DB (SQLite)
2. File watcher detects changes in real-time
3. Compare local vs server checksums
4. Identify modified chunks only
5. Use Merkle trees for efficient comparison
🎯 Smart Sync
1. Sync only changed chunks (not entire file)
2. Binary diff for small text changes
3. Prioritize small files and recent changes
4. Background sync for large files
5. Pause/resume capability
Example: 1GB Video File Edit
• User edits last 10 seconds of video
• Only 3 chunks (12MB) are modified
• Delta sync uploads only 12MB instead of 1GB
• Other devices download only changed chunks
• 98.8% bandwidth saved!
🔍 Remote Change Detection (Server → Client)
🤔 The Challenge: Local file changes are easy to detect with file watchers, but how does a client know when files change on the server from other devices?
✅ Local → Remote (Easy)
File System Watcher:
- OS notifies app of file changes
- Immediate detection (milliseconds)
- Built-in OS APIs (inotify, ReadDirectoryChangesW)
- No network polling required
🔄 Remote → Local (Complex)
Network-based Detection:
- No direct server → client file notifications
- Must actively check or listen for changes
- Network latency and reliability issues
- Multiple strategies needed
🔄 Remote Change Detection Strategies
1. 🔴 Real-time: WebSocket Notifications (Primary)
How it works:
- Client opens persistent WebSocket connection to Notification Service
- When Device B uploads a file, server publishes change event to Kafka
- Notification Service consumes event and pushes to Device A via WebSocket
- Device A immediately knows to download the changed file
WebSocket Message Example:
{
"type": "file_changed",
"file_id": "file_abc123",
"change_type": "modified",
"changed_by_device": "device_456",
"timestamp": 1234567890,
"new_version": 5
}Pros: Instant notifications, efficient bandwidth
Cons: Connection can drop, doesn't work offline
2. 📡 Server-Sent Events (SSE) (Alternative)
How it works:
- Client opens persistent HTTP connection:
GET /api/sync/events - Server keeps connection open and sends events as they occur
- When Device B uploads file, server sends SSE to all connected devices
- Client receives event and processes file change
SSE Stream Example:
GET /api/sync/events
Accept: text/event-stream
data: {"type":"file_changed","file_id":"abc123","version":5}
data: {"type":"file_deleted","file_id":"def456"}
data: {"type":"folder_shared","folder_id":"ghi789","shared_with":"user123"}Pros: Simpler than WebSocket, automatic reconnection, works through firewalls
Cons: Server → Client only (no client messages), less efficient than WebSocket
3. 🔄 Polling: Regular Sync Checks (Backup)
How it works:
- Client periodically calls
GET /api/sync/changes?since=last_sync_timestamp - Server returns list of files changed since last check
- Client downloads metadata for changed files
- Client updates local SQLite with new information
Polling Response Example:
{
"changes": [
{
"file_id": "file_abc123",
"change_type": "modified",
"new_version": 5,
"timestamp": 1234567890
},
{
"file_id": "file_def456",
"change_type": "deleted",
"timestamp": 1234567891
}
],
"next_cursor": "cursor_xyz789"
}Frequency: Every 30 seconds to 5 minutes
Pros: Works when WebSocket fails, catches missed notifications
Cons: Higher latency, more server load
4. 📱 Push Notifications (Mobile)
How it works:
- Mobile app registers for push notifications (APNs/FCM)
- Server sends push notification when files change
- OS wakes up app even if backgrounded/closed
- App performs sync check and downloads changes
Pros: Works when app is closed, low battery impact
Cons: Mobile-only, delivery not guaranteed
5. 🚀 Startup Sync (App Launch)
How it works:
- When app starts, compare local SQLite vs server state
- Call
GET /api/sync/changesfor all missed changes - Download any files modified while app was closed
- Update local file system to match server
Pros: Catches everything missed while offline
Cons: Only works on app startup
🏗️ Layered Detection Strategy
Production systems use all strategies together for reliability:
- WebSocket or SSE for instant notifications (when connected)
- Polling every 2-5 minutes as backup
- Push notifications for mobile apps
- Startup sync to catch anything missed
This ensures users get changes quickly when online, and everything syncs correctly even after network outages or app crashes.
🎯 Key Architecture Decisions
✅ Microservices
Independent scaling, fault isolation, and team ownership of services
✅ Direct S3 Upload
Clients upload directly to S3 with presigned URLs, bypassing servers
✅ Event-Driven Sync
Kafka for async processing, WebSocket for real-time notifications
⚔️Deep Dive: Conflict Resolution Strategies▼
When the same file is modified on multiple devices while offline, we need conflict resolution:
📝 Text Files (Docs, Code)
- • Use Operational Transforms (OT)
- • Three-way merge (like Git)
- • Show conflict markers if auto-merge fails
- • Keep both versions as separate files
🎬 Binary Files (Images, Videos)
- • Last-write-wins with timestamp
- • Create conflict copy: "file (conflicted copy).mp4"
- • Let user manually choose version
- • Keep version history for rollback
Vector Clock Example
Device A: Edit at 10:00 → Vector: [A:1, B:0, C:0] Device B: Edit at 10:05 → Vector: [A:0, B:1, C:0] Conflict detected! Neither vector dominates. Resolution: Create two versions for user to choose.
🔀Deep Dive: Three-Way Merge Algorithm (Like Git)▼
Three-way merge is the gold standard for automatically resolving text file conflicts. It uses three versions:
Base Version
Last common ancestor
Version A
Changes from Device A
Version B
Changes from Device B
Example Scenario: README.md file
Base (v1):
# Project Description here ## Setup npm install
Device A (v2):
# Project Better description ## Setup npm install ## Testing npm test
Device B (v3):
# MyProject Description here ## Setup npm install yarn install
✅ Automatic Merge Result:
# MyProject ← Device B changed title (no conflict) Better description ← Device A changed description (no conflict) ## Setup npm install yarn install ← Device B added yarn (no conflict) ## Testing npm test ← Device A added testing section (no conflict)
✅ Success! All changes merged automatically because they modified different parts
❌ Conflict Example:
If both devices modified the same line:
Device A: "Better description"
Device B: "Amazing description"
Conflict markers added:
# MyProject <<<<<<< Device A Better description ======= Amazing description >>>>>>> Device B ## Setup
User must manually choose which description to keep
🧠 Why Three-Way Merge Works
- • Distinguishes intent: Knows what changed vs what stayed same
- • Non-conflicting changes merge: Changes to different sections combine
- • Detects real conflicts: Only when same content modified differently
- • Preserves both changes: No data loss with conflict markers
- • Used everywhere: Git, SVN, Google Docs, Office 365