Step 3: Core Architecture

Step 3 of 6: C - Core High-Level Design

Build the fundamental system architecture with file chunking and sync algorithms

🏗️ High-Level System Architecture

Client ApplicationsDesktop AppMobile AppWeb BrowserThird-party AppsCDN (CloudFront)Static content & chunksAPI Gateway (Load Balancer)MicroservicesFile Service• Upload/Download• Chunking• DeduplicationSync Service• Change detection• Conflict resolution• Delta syncMetadata Service• File metadata• Permissions• VersionsNotification Service• WebSocket• Push notifications• Event streamingAuth Service• OAuth 2.0• JWT tokens• PermissionsMessage Queue (Kafka / RabbitMQ)Data LayerMetadata DB(PostgreSQL)• Files, Users• PermissionsCache Layer(Redis Cluster)• Hot metadata• Session dataBlob Storage(S3 / GCS)• Actual file chunks• 3x replicationSearch Index(Elasticsearch)• File names• Content searchAnalytics DB(ClickHouse)• Usage metrics• Access logsCDN pulls from S3

🔗 API Endpoint → Architecture Mapping

POST /api/files/upload

Client → API Gateway → File Service → S3 (presigned URL) + Metadata DB

GET /api/files/{file_id}

Client → API Gateway → Metadata Service → Redis Cache → CDN/S3

GET /api/sync/changes

Client → API Gateway → Sync Service → Metadata DB → Redis Cache

POST /api/sharing/create-link

Client → API Gateway → Metadata Service → Metadata DB → Notification Service

GET /api/folders/{id}/contents

Client → API Gateway → Metadata Service → Redis Cache → Metadata DB

GET /api/search

Client → API Gateway → Metadata Service → Elasticsearch → Redis Cache

WebSocket: Real-time sync

Client ↔ Notification Service → Kafka → Sync Service

Authentication

All requests → API Gateway → Auth Service → JWT validation

📱 Client Layer

Multiple client types with automatic chunking and sync capabilities

⚙️ Service Layer

Microservices handling specific responsibilities, communicating via message queue

💾 Data Layer

Hybrid storage: SQL for metadata, Redis for cache, S3 for files, Elasticsearch for search

🎯 Component Responsibilities

📁 File Service

APIs: POST /api/files/upload, GET /api/files/{id}

Responsibilities:

  • Generate presigned S3 URLs
  • Chunk deduplication logic
  • File metadata validation
  • Trigger async processing

🔄 Sync Service

APIs: GET /api/sync/changes, POST /api/sync/register-device

Responsibilities:

  • Delta change detection
  • Conflict resolution
  • Device sync coordination
  • Version management

📊 Metadata Service

APIs: GET /api/folders/{id}/contents, GET /api/search

Responsibilities:

  • File/folder metadata CRUD
  • Permission management
  • Search queries
  • Cache management

🔔 Notification Service

APIs: WebSocket connections, POST /api/sharing/create-link

Responsibilities:

  • Real-time WebSocket connections
  • Push notifications
  • Event broadcasting
  • Presence management

🔐 Auth Service

APIs: All authenticated endpoints

Responsibilities:

  • JWT token validation
  • OAuth 2.0 flow
  • Permission checks
  • Rate limiting

⚡ API Gateway

APIs: All external APIs

Responsibilities:

  • Request routing
  • Load balancing
  • API versioning
  • Request/response logging

📤 File Upload Flow

1

Client Chunks File

Desktop app detects new file, chunks it into 4MB pieces, calculates checksums

2

Initialize Upload

Client calls File Service to create file record, gets file_id and upload URLs

3

Deduplication Check

File Service checks if chunks already exist (by checksum), skips duplicate uploads

4

Get Presigned URLs

File Service generates S3 presigned URLs for each chunk, returns to client

5

Direct Upload to S3

Client uploads chunks directly to S3 using presigned URLs, bypassing our servers

6

Update Metadata

Metadata Service updates PostgreSQL with file info, chunk locations, versions

7

Trigger Sync

Sync Service publishes change event to Kafka, notifies other devices via WebSocket

8

Update Search Index

Async job updates Elasticsearch with file name and metadata for quick search

🔑Deep Dive: Presigned URLs vs Proxy Upload

There are two approaches for uploading files to cloud storage. Let's compare them:

❌ Proxy Upload (Not Recommended)

Flow:

  1. Client uploads to our server
  2. Server receives entire chunk
  3. Server uploads to S3
  4. Server responds to client

Drawbacks:

  • 2x bandwidth cost (in + out)
  • Server becomes bottleneck
  • Higher latency
  • Need more servers for scale
  • Server memory/disk usage

✅ Presigned URLs (Recommended)

Flow:

  1. Client requests upload permission
  2. Server generates presigned URL
  3. Client uploads directly to S3
  4. S3 notifies server via webhook

Benefits:

  • No bandwidth through servers
  • Parallel uploads to S3
  • Lower latency
  • Reduced server costs
  • S3 handles retries/resumption

Sample Presigned URL Generation:

// Server-side: Generate presigned URL for client
const AWS = require('aws-sdk');
const s3 = new AWS.S3();

async function generatePresignedUrl(fileId, chunkNumber) {
const params = {
  Bucket: 'dropbox-chunks',
  Key: `${userId}/${fileId}/chunk-${chunkNumber}`,
  Expires: 3600, // URL valid for 1 hour
  ContentType: 'application/octet-stream',
  // Security: Limit upload size
  Conditions: [
    ['content-length-range', 0, 4194304] // Max 4MB
  ]
};

const presignedUrl = await s3.getSignedUrlPromise('putObject', params);
return presignedUrl;
}

// Client-side: Use presigned URL to upload directly
async function uploadChunk(presignedUrl, chunkData) {
await fetch(presignedUrl, {
  method: 'PUT',
  body: chunkData,
  headers: {
    'Content-Type': 'application/octet-stream'
  }
});
// Server is never involved in data transfer!
}

💰 Cost Comparison (1TB daily upload)

Proxy Upload:

  • • Bandwidth IN: $90/day
  • • Bandwidth OUT to S3: $90/day
  • • Extra servers: ~20 instances
  • Total: ~$500/day

Presigned URLs:

  • • Bandwidth: $0 (direct to S3)
  • • S3 requests: $5/day
  • • Minimal server overhead
  • Total: ~$10/day
🔄Scenario: Network Failure During Upload (Chunk 3/5)

Let's trace through what happens when a user's network disconnects while uploading chunk 3 of 5:

📉 Initial Upload State (Before Failure)

Chunk 1
✅ Complete
Chunk 2
✅ Complete
Chunk 3
❌ Failed 60%
Chunk 4
⏳ Pending
Chunk 5
⏳ Pending

Client state: Has upload_session_id and file_id

S3 state: Chunks 1 & 2 fully uploaded, Chunk 3 partially uploaded

Database state: Metadata record exists with status "uploading"

🔍 Recovery Detection (When Connection Restored)

1. Client Startup Check:

• Desktop app detects incomplete upload in local SQLite DB

• Finds upload_session_id and file_id in local storage

• Network connectivity restored

2. Server State Verification:

• Client calls: GET /api/uploads/session_xyz789/status

• Server responds with current upload progress

🚀 Resume Upload Process

Server Response:

{
"upload_session_id": "session_xyz789",
"file_id": "file_a1b2c3d4e5",
"status": "partial_upload",
"completed_chunks": [1, 2],
"failed_chunks": [3],
"pending_chunks": [4, 5],
"resume_presigned_urls": [
  {
    "chunk_number": 3,
    "upload_url": "<s3 presigned url>",
    "resume_from_byte": 0  // S3 multipart failed, restart chunk
  },
  {
    "chunk_number": 4,
    "upload_url": "<s3 presigned url>"
  },
  {
    "chunk_number": 5,
    "upload_url": "<s3 presigned url>"
  }
]
}

⚡ Smart Resume Strategy

Option A: Simple Restart

  • Re-upload entire chunk 3 from beginning
  • Simpler implementation
  • Works with standard S3 multipart
  • Wastes some bandwidth

Option B: Byte-level Resume

  • Resume from exact byte offset
  • HTTP Range headers: Range: bytes=2621440-
  • More complex but bandwidth efficient
  • Requires chunked transfer encoding

🔄 Final Upload Flow

  1. Client resumes: Uploads chunk 3, 4, 5 in parallel
  2. Progress tracking: Updates local DB and shows progress to user
  3. Completion: Calls POST /api/uploads/session_xyz789/complete
  4. Finalization: Server updates file status to "available"
  5. Sync trigger: Notifies other devices via WebSocket
  6. User experience: File appears as "synced" across all devices

✅ Production Implementation Tips

  • Exponential backoff: Wait 1s, 2s, 4s, 8s before retry attempts
  • Heartbeat mechanism: Periodic ping to detect connection loss early
  • Background uploads: Continue uploading even when app is minimized
  • Bandwidth adaptation: Reduce chunk size on slow/unstable connections
  • User feedback: Show "Resuming upload..." with clear progress indication
  • Timeout handling: Upload sessions expire after 24 hours for security

💾 Client-Side Persistence Requirement

🔑 Key Insight: The upload recovery scenario above is only possible because client applications maintain local persistent storage to track upload sessions and chunk states.

📱 Local Storage (SQLite)

Client apps maintain local database with:

  • Upload sessions: session_id, file_id, progress
  • Chunk status: completed, failed, pending chunks
  • File metadata: checksums, modification times
  • Sync cursors: last sync timestamp, change tokens

🔄 Resume Capabilities Enabled

Local persistence enables:

  • Upload recovery: Resume after network failures
  • App restart resilience: Continue from where left off
  • Offline editing: Track changes while disconnected
  • Progress tracking: Show accurate upload/download status

Architecture Impact: Every production file sync client (Dropbox, Google Drive, OneDrive) implements local persistence. Without it, users would lose all progress on app crashes or network interruptions, making the service unreliable for large files.

🔄 Delta Sync Algorithm

📊 Change Detection

1. Client maintains local metadata DB (SQLite)

2. File watcher detects changes in real-time

3. Compare local vs server checksums

4. Identify modified chunks only

5. Use Merkle trees for efficient comparison

🎯 Smart Sync

1. Sync only changed chunks (not entire file)

2. Binary diff for small text changes

3. Prioritize small files and recent changes

4. Background sync for large files

5. Pause/resume capability

Example: 1GB Video File Edit

• User edits last 10 seconds of video

• Only 3 chunks (12MB) are modified

• Delta sync uploads only 12MB instead of 1GB

• Other devices download only changed chunks

98.8% bandwidth saved!

🔍 Remote Change Detection (Server → Client)

🤔 The Challenge: Local file changes are easy to detect with file watchers, but how does a client know when files change on the server from other devices?

✅ Local → Remote (Easy)

File System Watcher:

  • OS notifies app of file changes
  • Immediate detection (milliseconds)
  • Built-in OS APIs (inotify, ReadDirectoryChangesW)
  • No network polling required

🔄 Remote → Local (Complex)

Network-based Detection:

  • No direct server → client file notifications
  • Must actively check or listen for changes
  • Network latency and reliability issues
  • Multiple strategies needed

🔄 Remote Change Detection Strategies

1. 🔴 Real-time: WebSocket Notifications (Primary)

How it works:

  1. Client opens persistent WebSocket connection to Notification Service
  2. When Device B uploads a file, server publishes change event to Kafka
  3. Notification Service consumes event and pushes to Device A via WebSocket
  4. Device A immediately knows to download the changed file

WebSocket Message Example:

{
"type": "file_changed",
"file_id": "file_abc123",
"change_type": "modified",
"changed_by_device": "device_456",
"timestamp": 1234567890,
"new_version": 5
}

Pros: Instant notifications, efficient bandwidth

Cons: Connection can drop, doesn't work offline

2. 📡 Server-Sent Events (SSE) (Alternative)

How it works:

  1. Client opens persistent HTTP connection: GET /api/sync/events
  2. Server keeps connection open and sends events as they occur
  3. When Device B uploads file, server sends SSE to all connected devices
  4. Client receives event and processes file change

SSE Stream Example:

GET /api/sync/events
Accept: text/event-stream

data: {"type":"file_changed","file_id":"abc123","version":5}

data: {"type":"file_deleted","file_id":"def456"}

data: {"type":"folder_shared","folder_id":"ghi789","shared_with":"user123"}

Pros: Simpler than WebSocket, automatic reconnection, works through firewalls

Cons: Server → Client only (no client messages), less efficient than WebSocket

3. 🔄 Polling: Regular Sync Checks (Backup)

How it works:

  1. Client periodically calls GET /api/sync/changes?since=last_sync_timestamp
  2. Server returns list of files changed since last check
  3. Client downloads metadata for changed files
  4. Client updates local SQLite with new information

Polling Response Example:

{
"changes": [
  {
    "file_id": "file_abc123",
    "change_type": "modified",
    "new_version": 5,
    "timestamp": 1234567890
  },
  {
    "file_id": "file_def456",
    "change_type": "deleted",
    "timestamp": 1234567891
  }
],
"next_cursor": "cursor_xyz789"
}

Frequency: Every 30 seconds to 5 minutes

Pros: Works when WebSocket fails, catches missed notifications

Cons: Higher latency, more server load

4. 📱 Push Notifications (Mobile)

How it works:

  1. Mobile app registers for push notifications (APNs/FCM)
  2. Server sends push notification when files change
  3. OS wakes up app even if backgrounded/closed
  4. App performs sync check and downloads changes

Pros: Works when app is closed, low battery impact

Cons: Mobile-only, delivery not guaranteed

5. 🚀 Startup Sync (App Launch)

How it works:

  1. When app starts, compare local SQLite vs server state
  2. Call GET /api/sync/changes for all missed changes
  3. Download any files modified while app was closed
  4. Update local file system to match server

Pros: Catches everything missed while offline

Cons: Only works on app startup

🏗️ Layered Detection Strategy

Production systems use all strategies together for reliability:

  1. WebSocket or SSE for instant notifications (when connected)
  2. Polling every 2-5 minutes as backup
  3. Push notifications for mobile apps
  4. Startup sync to catch anything missed

This ensures users get changes quickly when online, and everything syncs correctly even after network outages or app crashes.

🎯 Key Architecture Decisions

✅ Microservices

Independent scaling, fault isolation, and team ownership of services

✅ Direct S3 Upload

Clients upload directly to S3 with presigned URLs, bypassing servers

✅ Event-Driven Sync

Kafka for async processing, WebSocket for real-time notifications

⚔️Deep Dive: Conflict Resolution Strategies

When the same file is modified on multiple devices while offline, we need conflict resolution:

📝 Text Files (Docs, Code)

  • • Use Operational Transforms (OT)
  • • Three-way merge (like Git)
  • • Show conflict markers if auto-merge fails
  • • Keep both versions as separate files

🎬 Binary Files (Images, Videos)

  • • Last-write-wins with timestamp
  • • Create conflict copy: "file (conflicted copy).mp4"
  • • Let user manually choose version
  • • Keep version history for rollback

Vector Clock Example

Device A: Edit at 10:00 → Vector: [A:1, B:0, C:0]
Device B: Edit at 10:05 → Vector: [A:0, B:1, C:0]
Conflict detected! Neither vector dominates.
Resolution: Create two versions for user to choose.
🔀Deep Dive: Three-Way Merge Algorithm (Like Git)

Three-way merge is the gold standard for automatically resolving text file conflicts. It uses three versions:

Base Version

Last common ancestor

Version A

Changes from Device A

Version B

Changes from Device B

Example Scenario: README.md file

Base (v1):

# Project
Description here
## Setup
npm install

Device A (v2):

# Project
Better description
## Setup
npm install
## Testing
npm test

Device B (v3):

# MyProject
Description here
## Setup
npm install
yarn install
✅ Automatic Merge Result:
# MyProject          ← Device B changed title (no conflict)
Better description   ← Device A changed description (no conflict)
## Setup
npm install
yarn install        ← Device B added yarn (no conflict)
## Testing
npm test            ← Device A added testing section (no conflict)

Success! All changes merged automatically because they modified different parts

❌ Conflict Example:

If both devices modified the same line:

Device A: "Better description"

Device B: "Amazing description"

Conflict markers added:

# MyProject
<<<<<<< Device A
Better description
=======
Amazing description
>>>>>>> Device B
## Setup

User must manually choose which description to keep

🧠 Why Three-Way Merge Works
  • Distinguishes intent: Knows what changed vs what stayed same
  • Non-conflicting changes merge: Changes to different sections combine
  • Detects real conflicts: Only when same content modified differently
  • Preserves both changes: No data loss with conflict markers
  • Used everywhere: Git, SVN, Google Docs, Office 365