Core Principles
Key Properties of a System
Scalability
Handle growing load by adding resources. Vertical (bigger box) or horizontal (more boxes).
Availability
% of time the system is operational. 99.9% = 8.76 hrs/yr downtime. 99.99% = 52 min/yr.
Reliability
System functions correctly even when components fail. Achieved via redundancy and fault tolerance.
Consistency
All nodes see the same data at the same time. Strong vs eventual consistency trade-off.
Latency
Time to serve a single request. P50, P95, P99 matter more than averages.
Throughput
Number of operations per second. Measured as RPS (requests) or TPS (transactions).
CAP Theorem
The diagram below maps real databases to their CAP category and extends the model with PACELC, which also accounts for the latency vs consistency trade-off during normal operation — a critical nuance when choosing a data store.
Latency Numbers Every Engineer Should Know
These hardware and network latency figures are order-of-magnitude benchmarks that inform every architectural decision around caching, replication, and data locality. Keeping them in mind prevents over-engineering fast paths and under-engineering slow ones.
Key insight: memory is 100× faster than SSD, SSD is 100× faster than HDD, and in-DC networking is 300× faster than cross-continent. This drives every caching and replication decision.
Back-of-the-Envelope Estimation
Architectural Patterns
Monolith → Microservices Spectrum
Monolith
Single deployable unit. Shared database. Simple to develop, test, and deploy initially. Coupling grows with team size. Fine for most startups and small teams.
Use when: <10 developers, single deployment cadence, tight coupling is acceptable.
Modular Monolith
Single deployment, but internally structured into bounded modules with clear APIs between them. Each module owns its data. Can be split into services later if needed. The sweet spot for most teams.
Use when: 5–30 developers, want service boundaries without operational overhead.
Microservices
Independently deployable services, each owning its data store. Enables team autonomy, independent scaling, and polyglot tech. Comes with massive operational overhead: service mesh, distributed tracing, eventual consistency.
Use when: 50+ developers, different parts need independent scaling/deployment.
Key Architectural Styles
Layered (N-Tier)
Presentation → Business Logic → Data Access → Database. Each layer depends only on the one below. Simple, well-understood, the default for most CRUD APIs. Risk: layers become pass-throughs.
Clean / Hexagonal
Domain at the centre, infrastructure at the edges. Dependencies point inward — domain never knows about HTTP, databases, or frameworks. Ports (interfaces) and adapters (implementations). Enables testability and framework independence.
Event-Driven
Components communicate via events (messages). Producers don't know consumers. Loose coupling, natural audit trail, supports eventual consistency. Essential for CQRS, saga patterns, and reactive systems.
CQRS
Separate models for reads and writes. Write model: normalised, optimised for consistency. Read model: denormalised, optimised for queries. Often paired with event sourcing. Use when read/write patterns differ drastically.
Event Sourcing
The contrast below shows why event sourcing matters: instead of overwriting a balance, every financial operation is recorded as an immutable event. This makes auditing trivial and allows replaying history to reconstruct state at any point in time — critical for banking, e-commerce, and compliance-heavy domains.
Domain-Driven Design (DDD) — Key Concepts
Bounded Context
A clear boundary around a domain model. Each context has its own ubiquitous language and can use different technology.
Aggregate
A cluster of domain objects treated as a single unit for data changes. One aggregate root enforces invariants.
Entity
An object with identity that persists over time (e.g., Order, User). Equality by ID, not by attributes.
Value Object
An immutable object defined by its attributes (e.g., Money, Address). No identity. Equality by value.
Domain Event
Something significant that happened in the domain. OrderPlaced, PaymentReceived. Drives inter-context communication.
Anti-Corruption Layer
A translation layer between your domain and an external system's model. Prevents foreign concepts from leaking in.
Data Layer
Database Selection
| Type | Examples | Best For | Trade-off |
|---|---|---|---|
| Relational (SQL) | PostgreSQL, SQL Server, MySQL | ACID transactions, complex joins, structured data | Scaling writes horizontally is hard |
| Document | MongoDB, Cosmos DB, Firestore | Flexible schemas, nested data, rapid prototyping | No joins; denormalise or multiple queries |
| Key-Value | Redis, DynamoDB, Memcached | Caching, sessions, leaderboards, high-speed lookups | No complex queries or relationships |
| Wide-Column | Cassandra, HBase, Bigtable | Time-series, IoT, write-heavy at massive scale | Limited query patterns; must model around access |
| Graph | Neo4j, Amazon Neptune | Relationships: social graphs, fraud detection, recommendations | Niche; poor for non-graph workloads |
| Search | Elasticsearch, Meilisearch | Full-text search, log analytics, faceted filtering | Not a primary store; eventual consistency |
| Time-Series | InfluxDB, TimescaleDB, Prometheus | Metrics, monitoring, IoT sensor data | Optimised for append; poor for random updates |
Caching Strategies
Cache-Aside (Lazy Loading)
App checks cache first. On miss, reads from DB, writes result to cache. Most common pattern. Risk: stale data until TTL expires or explicit invalidation.
Write-Through
Write to cache and DB synchronously on every write. Cache always has fresh data. Higher write latency. Use when read-after-write consistency matters.
Write-Behind (Write-Back)
Write to cache immediately, asynchronously flush to DB in batches. Fastest writes but risk of data loss on cache failure. Use for analytics counters, view counts.
Cache Invalidation
The two hardest problems in CS. Options: TTL-based expiry (simple, some staleness), event-driven invalidation (publish on DB change), versioned keys (append version to cache key).
Replication & Partitioning
Replication
Copies of the same data on multiple nodes. Leader-follower (one writes, many read) or multi-leader. Improves read throughput and availability. Risk: replication lag causes stale reads.
Partitioning (Sharding)
Split data across nodes by a partition key (user_id, region, hash). Each shard holds a subset. Improves write throughput and storage capacity. Risk: hot partitions, cross-shard queries, rebalancing.
Consistency Models
Different distributed systems offer different guarantees about when a write becomes visible to subsequent reads. Choosing the right model is a direct trade-off between correctness, latency, and availability — there is no universally correct answer.
Scaling Strategies
Load Balancing
A load balancer distributes incoming traffic across multiple servers to prevent any single instance from becoming a bottleneck. The choice of layer (L4 vs L7) and routing algorithm directly impacts session handling, performance, and operational complexity.
Horizontal Scaling Playbook
Stateless Services
Store no session data in the server. Externalise state to Redis/DB. Any instance can handle any request.
Database Read Replicas
Route reads to replicas, writes to primary. 10:1 read/write ratio = 10× read capacity.
Sharding
Partition data by key. Each shard is an independent database. Range-based, hash-based, or directory-based.
CDN
Cache static/dynamic content at edge. Reduces origin load and latency. Cloudflare, CloudFront, Fastly.
Connection Pooling
Reuse DB connections instead of opening new ones per request. PgBouncer, HikariCP.
Auto-Scaling
Add/remove instances based on metrics (CPU, queue depth, request count). Horizontal pod autoscaler in K8s.
Rate Limiting
Rate limiting controls how many requests a client can make in a given time window, protecting services from abuse and overload. Each algorithm makes different trade-offs between memory usage, precision, and burst tolerance — the right choice depends on your traffic shape and enforcement strictness requirements.
Messaging & Async
Message Queue vs Event Stream
Message Queue
Point-to-point. Message consumed by one consumer then deleted. Work distribution. RabbitMQ, SQS, Azure Service Bus.
Use for: task queues, job processing, command dispatch.
Event Stream
Pub-sub with retention. Multiple consumers replay independently. Append-only log. Kafka, Kinesis, Pulsar, EventHub.
Use for: event sourcing, CDC, analytics pipelines, fan-out.
Messaging Guarantees
Distributed messaging systems cannot simultaneously guarantee speed, simplicity, and exactly-once delivery. Understanding these delivery semantics is essential for designing consumers that remain correct even when messages are retried or reordered.
Saga Pattern (Distributed Transactions)
Choreography
Each service listens for events and decides what to do. No central coordinator. Simple for 2–3 steps. Becomes a tangled mess at 5+ steps — hard to track overall state.
Orchestration
A central orchestrator (saga coordinator) tells each service what to do and handles failures. Easier to reason about, debug, and monitor. Single point of failure — make it resilient.
API Design
REST vs GraphQL vs gRPC
| Aspect | REST | GraphQL | gRPC |
|---|---|---|---|
| Protocol | HTTP + JSON | HTTP + JSON | HTTP/2 + Protobuf |
| Contract | OpenAPI / Swagger | Schema (SDL) | .proto files |
| Data Shape | Server defines | Client defines | Server defines |
| Strengths | Simple, cacheable, ubiquitous | Flexible queries, no over/under-fetching | Fastest, streaming, strong typing |
| Weaknesses | Over-fetching, N+1 endpoints | Complexity, no HTTP caching | Not browser-native, less tooling |
| Best For | Public APIs, CRUD, general-purpose | Mobile clients, dashboard aggregation | Service-to-service, real-time, internal |
REST Best Practices
RESTful APIs use HTTP methods and resource-based URLs to provide a uniform, stateless interface. Following these conventions makes APIs predictable and interoperable — correct use of verbs, status codes, and pagination patterns is what separates a professional API from a brittle one.
API Versioning
URL Path
/api/v1/ → /api/v2/. Most explicit. Easy to route. Widely adopted.
Header
Accept: application/vnd.api.v2+json. Clean URLs but harder to test.
Query Param
?version=2. Simple but pollutes query string. Good for internal APIs.
Reliability & Resilience
Failure Patterns & Mitigations
Circuit Breaker
Track failure rate of downstream calls. When threshold exceeded, trip the circuit — fail fast instead of waiting for timeouts. After a cooldown, allow probe requests. States: Closed (normal) → Open (failing) → Half-Open (testing). Polly (.NET), Resilience4j (Java), opossum (Node).
Retry with Backoff
Retry failed requests with exponential delay + jitter. Prevents thundering herd. Cap retries (3–5). Only retry on transient errors (5xx, network timeout) — never on 4xx.
Bulkhead
Isolate components so one failure doesn't cascade. Separate thread pools, connection pools, or service instances per dependency. A slow DB query shouldn't starve the API of threads for fast requests.
Timeout
Always set timeouts on every external call. No timeout = potential thread leak. Cascade: if service A calls B calls C, A's timeout must be > B's + network overhead.
Health Checks & Graceful Degradation
Health checks tell the orchestration layer whether an instance is running and ready to serve traffic. Distinguishing liveness from readiness prevents bad deploys from receiving requests, and graceful degradation keeps core functionality alive even when supporting systems fail.
Disaster Recovery
RTO
Recovery Time Objective — max acceptable downtime. How fast can you recover?
RPO
Recovery Point Objective — max data loss window. How much data can you afford to lose?
Multi-Region
Active-passive (failover) or active-active (both serve traffic). Active-active is harder but gives lower latency.
Observability
Three Pillars
Metrics
Numeric measurements over time. Counters (request count), gauges (CPU %), histograms (latency distribution). Prometheus, Datadog, CloudWatch. Alert on SLOs, not individual metrics.
Logs
Structured JSON logs with correlation IDs. Aggregate centrally. ELK stack (Elasticsearch + Logstash + Kibana), Loki + Grafana, Datadog Logs. Log at the right level — INFO for business events, ERROR for failures.
Distributed Tracing
Follow a request across service boundaries. Each service adds a span with timing. Jaeger, Zipkin, Datadog APM, OpenTelemetry. Essential for debugging latency in microservices.
SLIs, SLOs, and SLAs
SLIs, SLOs, and SLAs form a hierarchy that connects raw metrics to engineering targets to business contracts. Getting this hierarchy right lets teams make data-driven decisions about reliability investments and deployment risk using the concept of an error budget.
The RED and USE Methods
RED (for services)
Rate — requests/sec
Errors — failed requests/sec
Duration — latency histogram
Monitors the user experience. Dashboard every service with these three.
USE (for resources)
Utilisation — % capacity used
Saturation — work queued
Errors — error count
Monitors infrastructure: CPU, disk, network, DB connections.
Security Architecture
Authentication & Authorisation
Authentication proves identity; authorisation controls what an identity can do. The reference below covers the dominant protocols and models used in production systems, from user-facing OAuth flows to fine-grained relationship-based access control at the scale of Google Drive.
Defence in Depth
Transport
TLS 1.3 everywhere. HSTS headers. Certificate pinning for mobile.
API Gateway
Rate limiting, auth validation, input sanitisation at the edge.
Network
VPC, private subnets, security groups. Zero-trust: verify every request.
Data at Rest
AES-256 encryption. Key rotation via KMS. Separate keys per tenant.
Secrets
Never in code or env vars. Use Vault, AWS Secrets Manager, Azure Key Vault.
Supply Chain
SCA scanning (Snyk, Dependabot). Lock dependencies. Audit transitive deps.
Cloud-Native Patterns
Containerisation & Orchestration
Containers package an application and its dependencies into a portable, immutable unit, eliminating environment drift. Kubernetes extends this by automating scheduling, scaling, self-healing, and service discovery across a cluster — understanding its core primitives is essential for operating any modern production workload.
Infrastructure as Code
Terraform
Declarative, cloud-agnostic. State file tracks resources. Plan → Apply workflow. The standard for multi-cloud.
Pulumi
IaC in real programming languages (TS, Python, C#). Full IDE support. Good for teams that prefer code over HCL.
CloudFormation
AWS-native. Deep integration. YAML/JSON. Use CDK (TypeScript) for a better authoring experience.
CI/CD Pipeline
A CI/CD pipeline automates the path from a code commit to a running production deployment, enforcing quality gates at each stage. The deployment strategy you choose — rolling, blue/green, or canary — determines how much risk you take on with each release and how quickly you can roll back.
System Design Interview
The Framework (45 Minutes)
The five-step framework below maps directly to a 45-minute interview slot. Each step serves a specific purpose: requirements gathering prevents wasted effort, estimation grounds your design decisions in real numbers, and the explicit wrap-up is where many candidates lose points by failing to articulate trade-offs.
Common Questions & Key Components
| System | Core Challenge | Key Components |
|---|---|---|
| URL Shortener | Unique ID generation at scale | Base62 encoding, distributed ID (Snowflake), redirect cache |
| Rate Limiter | Distributed counting | Redis + Lua scripts, token bucket, sliding window |
| Chat System | Real-time delivery, presence | WebSockets, Kafka, Cassandra, pub-sub channels |
| News Feed | Fan-out: push vs pull | Push (write-time fan-out) for most users, pull for celebrities |
| Search Engine | Inverted index, ranking | Web crawler → indexer → Elasticsearch → ranking algorithm |
| Notification System | Multi-channel, rate limiting | Priority queue, templating, per-user preferences, dedup |
| Payment System | Exactly-once semantics | Idempotency keys, two-phase commit, event sourcing, ledger |
| File Storage (S3) | Durability, chunking | Erasure coding, metadata DB, content-addressable storage |
| Video Streaming | Transcoding, adaptive bitrate | DAG-based pipeline, CDN, HLS/DASH, pre-signed URLs |
| Distributed Cache | Consistent hashing, eviction | Consistent hash ring, LRU/LFU, replication, write-through |
Trade-Off Cheat Sheet
Every system design decision involves giving something up. This reference captures the most common tensions you will face in interviews and in production, along with the reasoning that should guide each choice. Naming the trade-off explicitly is what distinguishes a senior-level answer.