Step 3: Core Architecture

Step 3 of 6: C - Core High-Level Design

Build the fundamental system architecture with file chunking and sync algorithms

🏗️ High-Level System Architecture

🔗 API Endpoint → Architecture Mapping

POST /api/files/upload

Client → API Gateway → File Service → S3 (presigned URL) + Metadata DB

GET /api/files/{file_id}

Client → API Gateway → Metadata Service → Redis Cache → CDN/S3

GET /api/sync/changes

Client → API Gateway → Sync Service → Metadata DB → Redis Cache

POST /api/sharing/create-link

Client → API Gateway → Metadata Service → Metadata DB → Notification Service

GET /api/folders/{id}/contents

Client → API Gateway → Metadata Service → Redis Cache → Metadata DB

GET /api/search

Client → API Gateway → Metadata Service → Elasticsearch → Redis Cache

WebSocket: Real-time sync

Client ↔ Notification Service → Kafka → Sync Service

Authentication

All requests → API Gateway → Auth Service → JWT validation

📱 Client Layer

Multiple client types with automatic chunking and sync capabilities

⚙️ Service Layer

Microservices handling specific responsibilities, communicating via message queue

💾 Data Layer

Hybrid storage: SQL for metadata, Redis for cache, S3 for files, Elasticsearch for search

🎯 Component Responsibilities

📁 File Service

APIs: POST /api/files/upload, GET /api/files/{id}

Responsibilities:

Generate presigned S3 URLs
Chunk deduplication logic
File metadata validation
Trigger async processing

🔄 Sync Service

APIs: GET /api/sync/changes, POST /api/sync/register-device

Responsibilities:

Delta change detection
Conflict resolution
Device sync coordination
Version management

📊 Metadata Service

APIs: GET /api/folders/{id}/contents, GET /api/search

Responsibilities:

File/folder metadata CRUD
Permission management
Search queries
Cache management

🔔 Notification Service

APIs: WebSocket connections, POST /api/sharing/create-link

Responsibilities:

Real-time WebSocket connections
Push notifications
Event broadcasting
Presence management

🔐 Auth Service

APIs: All authenticated endpoints

Responsibilities:

JWT token validation
OAuth 2.0 flow
Permission checks
Rate limiting

⚡ API Gateway

APIs: All external APIs

Responsibilities:

Request routing
Load balancing
API versioning
Request/response logging

📤 File Upload Flow

Client Chunks File

Desktop app detects new file, chunks it into 4MB pieces, calculates checksums

Initialize Upload

Client calls File Service to create file record, gets file_id and upload URLs

Deduplication Check

File Service checks if chunks already exist (by checksum), skips duplicate uploads

Get Presigned URLs

File Service generates S3 presigned URLs for each chunk, returns to client

Direct Upload to S3

Client uploads chunks directly to S3 using presigned URLs, bypassing our servers

Update Metadata

Metadata Service updates PostgreSQL with file info, chunk locations, versions

Trigger Sync

Sync Service publishes change event to Kafka, notifies other devices via WebSocket

Update Search Index

Async job updates Elasticsearch with file name and metadata for quick search

🔑Deep Dive: Presigned URLs vs Proxy Upload▼

There are two approaches for uploading files to cloud storage. Let's compare them:

❌ Proxy Upload (Not Recommended)

Flow:

Client uploads to our server
Server receives entire chunk
Server uploads to S3
Server responds to client

Drawbacks:

2x bandwidth cost (in + out)
Server becomes bottleneck
Higher latency
Need more servers for scale
Server memory/disk usage

✅ Presigned URLs (Recommended)

Flow:

Client requests upload permission
Server generates presigned URL
Client uploads directly to S3
S3 notifies server via webhook

Benefits:

No bandwidth through servers
Parallel uploads to S3
Lower latency
Reduced server costs
S3 handles retries/resumption

Sample Presigned URL Generation:

// Server-side: Generate presigned URL for client
const AWS = require('aws-sdk');
const s3 = new AWS.S3();

async function generatePresignedUrl(fileId, chunkNumber) {
const params = {
  Bucket: 'dropbox-chunks',
  Key: `${userId}/${fileId}/chunk-${chunkNumber}`,
  Expires: 3600, // URL valid for 1 hour
  ContentType: 'application/octet-stream',
  // Security: Limit upload size
  Conditions: [
    ['content-length-range', 0, 4194304] // Max 4MB
  ]
};

const presignedUrl = await s3.getSignedUrlPromise('putObject', params);
return presignedUrl;
}

// Client-side: Use presigned URL to upload directly
async function uploadChunk(presignedUrl, chunkData) {
await fetch(presignedUrl, {
  method: 'PUT',
  body: chunkData,
  headers: {
    'Content-Type': 'application/octet-stream'
  }
});
// Server is never involved in data transfer!
}

💰 Cost Comparison (1TB daily upload)

Proxy Upload:

• Bandwidth IN: $90/day
• Bandwidth OUT to S3: $90/day
• Extra servers: ~20 instances
• Total: ~$500/day

Presigned URLs:

• Bandwidth: $0 (direct to S3)
• S3 requests: $5/day
• Minimal server overhead
• Total: ~$10/day

🔄Scenario: Network Failure During Upload (Chunk 3/5)▼

Let's trace through what happens when a user's network disconnects while uploading chunk 3 of 5:

📉 Initial Upload State (Before Failure)

Chunk 1
✅ Complete

Chunk 2
✅ Complete

Chunk 3
❌ Failed 60%

Chunk 4
⏳ Pending

Chunk 5
⏳ Pending

• Client state: Has upload_session_id and file_id

• S3 state: Chunks 1 & 2 fully uploaded, Chunk 3 partially uploaded

• Database state: Metadata record exists with status "uploading"

🔍 Recovery Detection (When Connection Restored)

1. Client Startup Check:

• Desktop app detects incomplete upload in local SQLite DB

• Finds upload_session_id and file_id in local storage

• Network connectivity restored

2. Server State Verification:

• Client calls: GET /api/uploads/session_xyz789/status

• Server responds with current upload progress

🚀 Resume Upload Process

Server Response:

{
"upload_session_id": "session_xyz789",
"file_id": "file_a1b2c3d4e5",
"status": "partial_upload",
"completed_chunks": [1, 2],
"failed_chunks": [3],
"pending_chunks": [4, 5],
"resume_presigned_urls": [
  {
    "chunk_number": 3,
    "upload_url": "<s3 presigned url>",
    "resume_from_byte": 0  // S3 multipart failed, restart chunk
  },
  {
    "chunk_number": 4,
    "upload_url": "<s3 presigned url>"
  },
  {
    "chunk_number": 5,
    "upload_url": "<s3 presigned url>"
  }
]
}

⚡ Smart Resume Strategy

Option A: Simple Restart

Re-upload entire chunk 3 from beginning
Simpler implementation
Works with standard S3 multipart
Wastes some bandwidth

Option B: Byte-level Resume

Resume from exact byte offset
HTTP Range headers: Range: bytes=2621440-
More complex but bandwidth efficient
Requires chunked transfer encoding

🔄 Final Upload Flow

Client resumes: Uploads chunk 3, 4, 5 in parallel
Progress tracking: Updates local DB and shows progress to user
Completion: Calls POST /api/uploads/session_xyz789/complete
Finalization: Server updates file status to "available"
Sync trigger: Notifies other devices via WebSocket
User experience: File appears as "synced" across all devices

✅ Production Implementation Tips

• Exponential backoff: Wait 1s, 2s, 4s, 8s before retry attempts
• Heartbeat mechanism: Periodic ping to detect connection loss early
• Background uploads: Continue uploading even when app is minimized
• Bandwidth adaptation: Reduce chunk size on slow/unstable connections
• User feedback: Show "Resuming upload..." with clear progress indication
• Timeout handling: Upload sessions expire after 24 hours for security

💾 Client-Side Persistence Requirement

🔑 Key Insight: The upload recovery scenario above is only possible because client applications maintain local persistent storage to track upload sessions and chunk states.

📱 Local Storage (SQLite)

Client apps maintain local database with:

Upload sessions: session_id, file_id, progress
Chunk status: completed, failed, pending chunks
File metadata: checksums, modification times
Sync cursors: last sync timestamp, change tokens

🔄 Resume Capabilities Enabled

Local persistence enables:

Upload recovery: Resume after network failures
App restart resilience: Continue from where left off
Offline editing: Track changes while disconnected
Progress tracking: Show accurate upload/download status

Architecture Impact: Every production file sync client (Dropbox, Google Drive, OneDrive) implements local persistence. Without it, users would lose all progress on app crashes or network interruptions, making the service unreliable for large files.

🔄 Delta Sync Algorithm

📊 Change Detection

1. Client maintains local metadata DB (SQLite)

2. File watcher detects changes in real-time

3. Compare local vs server checksums

4. Identify modified chunks only

5. Use Merkle trees for efficient comparison

🎯 Smart Sync

1. Sync only changed chunks (not entire file)

2. Binary diff for small text changes

3. Prioritize small files and recent changes

4. Background sync for large files

5. Pause/resume capability

Example: 1GB Video File Edit

• User edits last 10 seconds of video

• Only 3 chunks (12MB) are modified

• Delta sync uploads only 12MB instead of 1GB

• Other devices download only changed chunks

• 98.8% bandwidth saved!

🔍 Remote Change Detection (Server → Client)

🤔 The Challenge: Local file changes are easy to detect with file watchers, but how does a client know when files change on the server from other devices?

✅ Local → Remote (Easy)

File System Watcher:

OS notifies app of file changes
Immediate detection (milliseconds)
Built-in OS APIs (inotify, ReadDirectoryChangesW)
No network polling required

🔄 Remote → Local (Complex)

Network-based Detection:

No direct server → client file notifications
Must actively check or listen for changes
Network latency and reliability issues
Multiple strategies needed

🔄 Remote Change Detection Strategies

1. 🔴 Real-time: WebSocket Notifications (Primary)

How it works:

Client opens persistent WebSocket connection to Notification Service
When Device B uploads a file, server publishes change event to Kafka
Notification Service consumes event and pushes to Device A via WebSocket
Device A immediately knows to download the changed file

WebSocket Message Example:

{
"type": "file_changed",
"file_id": "file_abc123",
"change_type": "modified",
"changed_by_device": "device_456",
"timestamp": 1234567890,
"new_version": 5
}

Pros: Instant notifications, efficient bandwidth

Cons: Connection can drop, doesn't work offline

2. 📡 Server-Sent Events (SSE) (Alternative)

How it works:

Client opens persistent HTTP connection: GET /api/sync/events
Server keeps connection open and sends events as they occur
When Device B uploads file, server sends SSE to all connected devices
Client receives event and processes file change

SSE Stream Example:

GET /api/sync/events
Accept: text/event-stream

data: {"type":"file_changed","file_id":"abc123","version":5}

data: {"type":"file_deleted","file_id":"def456"}

data: {"type":"folder_shared","folder_id":"ghi789","shared_with":"user123"}

Pros: Simpler than WebSocket, automatic reconnection, works through firewalls

Cons: Server → Client only (no client messages), less efficient than WebSocket

3. 🔄 Polling: Regular Sync Checks (Backup)

How it works:

Client periodically calls GET /api/sync/changes?since=last_sync_timestamp
Server returns list of files changed since last check
Client downloads metadata for changed files
Client updates local SQLite with new information

Polling Response Example:

{
"changes": [
  {
    "file_id": "file_abc123",
    "change_type": "modified",
    "new_version": 5,
    "timestamp": 1234567890
  },
  {
    "file_id": "file_def456",
    "change_type": "deleted",
    "timestamp": 1234567891
  }
],
"next_cursor": "cursor_xyz789"
}

Frequency: Every 30 seconds to 5 minutes

Pros: Works when WebSocket fails, catches missed notifications

Cons: Higher latency, more server load

4. 📱 Push Notifications (Mobile)

How it works:

Mobile app registers for push notifications (APNs/FCM)
Server sends push notification when files change
OS wakes up app even if backgrounded/closed
App performs sync check and downloads changes

Pros: Works when app is closed, low battery impact

Cons: Mobile-only, delivery not guaranteed

5. 🚀 Startup Sync (App Launch)

How it works:

When app starts, compare local SQLite vs server state
Call GET /api/sync/changes for all missed changes
Download any files modified while app was closed
Update local file system to match server

Pros: Catches everything missed while offline

Cons: Only works on app startup

🏗️ Layered Detection Strategy

Production systems use all strategies together for reliability:

WebSocket or SSE for instant notifications (when connected)
Polling every 2-5 minutes as backup
Push notifications for mobile apps
Startup sync to catch anything missed

This ensures users get changes quickly when online, and everything syncs correctly even after network outages or app crashes.

🎯 Key Architecture Decisions

✅ Microservices

Independent scaling, fault isolation, and team ownership of services

✅ Direct S3 Upload

Clients upload directly to S3 with presigned URLs, bypassing servers

✅ Event-Driven Sync

Kafka for async processing, WebSocket for real-time notifications

⚔️Deep Dive: Conflict Resolution Strategies▼

When the same file is modified on multiple devices while offline, we need conflict resolution:

📝 Text Files (Docs, Code)

• Use Operational Transforms (OT)
• Three-way merge (like Git)
• Show conflict markers if auto-merge fails
• Keep both versions as separate files

🎬 Binary Files (Images, Videos)

• Last-write-wins with timestamp
• Create conflict copy: "file (conflicted copy).mp4"
• Let user manually choose version
• Keep version history for rollback

Vector Clock Example

Device A: Edit at 10:00 → Vector: [A:1, B:0, C:0]
Device B: Edit at 10:05 → Vector: [A:0, B:1, C:0]
Conflict detected! Neither vector dominates.
Resolution: Create two versions for user to choose.

🔀Deep Dive: Three-Way Merge Algorithm (Like Git)▼

Three-way merge is the gold standard for automatically resolving text file conflicts. It uses three versions:

Base Version

Last common ancestor

Version A

Changes from Device A

Version B

Changes from Device B

Example Scenario: README.md file

Base (v1):

# Project
Description here
## Setup
npm install

Device A (v2):

# Project
Better description
## Setup
npm install
## Testing
npm test

Device B (v3):

# MyProject
Description here
## Setup
npm install
yarn install

✅ Automatic Merge Result:

# MyProject          ← Device B changed title (no conflict)
Better description   ← Device A changed description (no conflict)
## Setup
npm install
yarn install        ← Device B added yarn (no conflict)
## Testing
npm test            ← Device A added testing section (no conflict)

✅ Success! All changes merged automatically because they modified different parts

❌ Conflict Example:

If both devices modified the same line:

Device A: "Better description"

Device B: "Amazing description"

Conflict markers added:

# MyProject
<<<<<<< Device A
Better description
=======
Amazing description
>>>>>>> Device B
## Setup

User must manually choose which description to keep

🧠 Why Three-Way Merge Works

• Distinguishes intent: Knows what changed vs what stayed same
• Non-conflicting changes merge: Changes to different sections combine
• Detects real conflicts: Only when same content modified differently
• Preserves both changes: No data loss with conflict markers
• Used everywhere: Git, SVN, Google Docs, Office 365

← Previous: API Design

Next: Scale Refinement →