System Design & Architecture

Core Principles

Key Properties of a System

Scalability

Handle growing load by adding resources. Vertical (bigger box) or horizontal (more boxes).

Availability

% of time the system is operational. 99.9% = 8.76 hrs/yr downtime. 99.99% = 52 min/yr.

Reliability

System functions correctly even when components fail. Achieved via redundancy and fault tolerance.

Consistency

All nodes see the same data at the same time. Strong vs eventual consistency trade-off.

Latency

Time to serve a single request. P50, P95, P99 matter more than averages.

Throughput

Number of operations per second. Measured as RPS (requests) or TPS (transactions).

CAP Theorem

A distributed system can guarantee at most two of three: Consistency, Availability, Partition tolerance. Since network partitions are inevitable, the real choice is between CP (consistent but may reject requests) and AP (available but may serve stale data).

The diagram below maps real databases to their CAP category and extends the model with PACELC, which also accounts for the latency vs consistency trade-off during normal operation — a critical nuance when choosing a data store.

CP Systems — Reject writes during partition to maintain consistency → Examples: ZooKeeper, etcd, HBase, MongoDB (default) → Use when: financial transactions, leader election, config management AP Systems — Accept writes during partition, reconcile later → Examples: Cassandra, DynamoDB, CouchDB, Riak → Use when: shopping carts, social feeds, analytics, session stores PACELC — Extended model: during Partition choose A or C; Else (normal operation) choose Latency or Consistency → DynamoDB: PA/EL (available + low latency) → Spanner: PC/EC (consistent always, even at latency cost)theorem

Latency Numbers Every Engineer Should Know

These hardware and network latency figures are order-of-magnitude benchmarks that inform every architectural decision around caching, replication, and data locality. Keeping them in mind prevents over-engineering fast paths and under-engineering slow ones.

L1 cache reference 1 ns L2 cache reference 4 ns Main memory reference 100 ns SSD random read 16 μs HDD random read 2 ms Send 1KB over 1 Gbps network 10 μs Read 1 MB from SSD 49 μs Read 1 MB from HDD 825 μs Round trip within same DC 500 μs Round trip cross-continent 150 mslatency

Key insight: memory is 100× faster than SSD, SSD is 100× faster than HDD, and in-DC networking is 300× faster than cross-continent. This drives every caching and replication decision.

Back-of-the-Envelope Estimation

// Quick reference for capacity planning QPS = Daily Active Users × Requests per User / 86,400 Storage = QPS × Avg Object Size × Retention Period Bandwidth = QPS × Avg Object Size // Example: URL shortener at 100M DAU, 1 write + 10 reads per user Write QPS = 100M / 86400 ≈ 1,160 writes/sec Read QPS = 1,160 × 10 ≈ 11,600 reads/sec Storage/yr = 1,160 × 500B × 365 × 86400 ≈ ~18 TB/yearestimation

Architectural Patterns

Monolith → Microservices Spectrum

Monolith

Single deployable unit. Shared database. Simple to develop, test, and deploy initially. Coupling grows with team size. Fine for most startups and small teams.

Use when: <10 developers, single deployment cadence, tight coupling is acceptable.

Modular Monolith

Single deployment, but internally structured into bounded modules with clear APIs between them. Each module owns its data. Can be split into services later if needed. The sweet spot for most teams.

Use when: 5–30 developers, want service boundaries without operational overhead.

Microservices

Independently deployable services, each owning its data store. Enables team autonomy, independent scaling, and polyglot tech. Comes with massive operational overhead: service mesh, distributed tracing, eventual consistency.

Use when: 50+ developers, different parts need independent scaling/deployment.

Key Architectural Styles

Layered (N-Tier)

Presentation → Business Logic → Data Access → Database. Each layer depends only on the one below. Simple, well-understood, the default for most CRUD APIs. Risk: layers become pass-throughs.

Clean / Hexagonal

Domain at the centre, infrastructure at the edges. Dependencies point inward — domain never knows about HTTP, databases, or frameworks. Ports (interfaces) and adapters (implementations). Enables testability and framework independence.

Event-Driven

Components communicate via events (messages). Producers don't know consumers. Loose coupling, natural audit trail, supports eventual consistency. Essential for CQRS, saga patterns, and reactive systems.

CQRS

Separate models for reads and writes. Write model: normalised, optimised for consistency. Read model: denormalised, optimised for queries. Often paired with event sourcing. Use when read/write patterns differ drastically.

Event Sourcing

Store the sequence of state changes, not current state. The current state is derived by replaying events. Gives you a complete audit trail, temporal queries, and the ability to rebuild read models from scratch. Trade-off: complexity, eventual consistency, and snapshot management.

The contrast below shows why event sourcing matters: instead of overwriting a balance, every financial operation is recorded as an immutable event. This makes auditing trivial and allows replaying history to reconstruct state at any point in time — critical for banking, e-commerce, and compliance-heavy domains.

// Traditional: store current state Account { id: 1, balance: 750 } // Event sourcing: store events Event 1: AccountCreated { id: 1 } Event 2: MoneyDeposited { id: 1, amount: 1000 } Event 3: MoneyWithdrawn { id: 1, amount: 250 } // → Replay: 0 + 1000 - 250 = 750 ✓ // Snapshots: periodically save materialised state // to avoid replaying millions of events on every readpattern

Domain-Driven Design (DDD) — Key Concepts

Bounded Context

A clear boundary around a domain model. Each context has its own ubiquitous language and can use different technology.

Aggregate

A cluster of domain objects treated as a single unit for data changes. One aggregate root enforces invariants.

Entity

An object with identity that persists over time (e.g., Order, User). Equality by ID, not by attributes.

Value Object

An immutable object defined by its attributes (e.g., Money, Address). No identity. Equality by value.

Domain Event

Something significant that happened in the domain. OrderPlaced, PaymentReceived. Drives inter-context communication.

Anti-Corruption Layer

A translation layer between your domain and an external system's model. Prevents foreign concepts from leaking in.

Data Layer

Database Selection

Type	Examples	Best For	Trade-off
Relational (SQL)	PostgreSQL, SQL Server, MySQL	ACID transactions, complex joins, structured data	Scaling writes horizontally is hard
Document	MongoDB, Cosmos DB, Firestore	Flexible schemas, nested data, rapid prototyping	No joins; denormalise or multiple queries
Key-Value	Redis, DynamoDB, Memcached	Caching, sessions, leaderboards, high-speed lookups	No complex queries or relationships
Wide-Column	Cassandra, HBase, Bigtable	Time-series, IoT, write-heavy at massive scale	Limited query patterns; must model around access
Graph	Neo4j, Amazon Neptune	Relationships: social graphs, fraud detection, recommendations	Niche; poor for non-graph workloads
Search	Elasticsearch, Meilisearch	Full-text search, log analytics, faceted filtering	Not a primary store; eventual consistency
Time-Series	InfluxDB, TimescaleDB, Prometheus	Metrics, monitoring, IoT sensor data	Optimised for append; poor for random updates

Caching Strategies

Cache-Aside (Lazy Loading)

App checks cache first. On miss, reads from DB, writes result to cache. Most common pattern. Risk: stale data until TTL expires or explicit invalidation.

get(key): result = cache.get(key) if result == null: result = db.query(key) cache.set(key, result, ttl=300) return result

Write-Through

Write to cache and DB synchronously on every write. Cache always has fresh data. Higher write latency. Use when read-after-write consistency matters.

Write-Behind (Write-Back)

Write to cache immediately, asynchronously flush to DB in batches. Fastest writes but risk of data loss on cache failure. Use for analytics counters, view counts.

Cache Invalidation

The two hardest problems in CS. Options: TTL-based expiry (simple, some staleness), event-driven invalidation (publish on DB change), versioned keys (append version to cache key).

Replication & Partitioning

Replication

Copies of the same data on multiple nodes. Leader-follower (one writes, many read) or multi-leader. Improves read throughput and availability. Risk: replication lag causes stale reads.

Partitioning (Sharding)

Split data across nodes by a partition key (user_id, region, hash). Each shard holds a subset. Improves write throughput and storage capacity. Risk: hot partitions, cross-shard queries, rebalancing.

Consistency Models

Different distributed systems offer different guarantees about when a write becomes visible to subsequent reads. Choosing the right model is a direct trade-off between correctness, latency, and availability — there is no universally correct answer.

Strong Consistency Every read returns the most recent write. Linearisable. → Cost: higher latency, lower availability during partitions → Example: Spanner, ZooKeeper, single-node PostgreSQL Eventual Consistency Replicas converge over time. Reads may return stale data. → Cost: complexity in conflict resolution → Example: DynamoDB, Cassandra, S3 Causal Consistency Operations that are causally related are seen in order. → Middle ground between strong and eventual → Example: MongoDB (with read concern "majority") Read-Your-Writes A user always sees their own writes immediately. → Common UX requirement. Route reads to leader after write.models

Scaling Strategies

Load Balancing

A load balancer distributes incoming traffic across multiple servers to prevent any single instance from becoming a bottleneck. The choice of layer (L4 vs L7) and routing algorithm directly impacts session handling, performance, and operational complexity.

Layer 4 (Transport) — Routes by IP/port. Fastest. TCP passthrough. → AWS NLB, HAProxy (TCP mode) Layer 7 (Application) — Routes by HTTP content (URL, headers, cookies). → AWS ALB, NGINX, Envoy, Cloudflare Algorithms: Round Robin — Simple rotation. Ignores server capacity. Weighted RR — Proportional to server capacity. Least Connections — Send to least busy. Good for variable-length requests. IP Hash — Sticky sessions. Same client → same server. Consistent Hash — Minimal rebalancing when nodes added/removed.load balancing

Horizontal Scaling Playbook

Stateless Services

Store no session data in the server. Externalise state to Redis/DB. Any instance can handle any request.

Database Read Replicas

Route reads to replicas, writes to primary. 10:1 read/write ratio = 10× read capacity.

Sharding

Partition data by key. Each shard is an independent database. Range-based, hash-based, or directory-based.

CDN

Cache static/dynamic content at edge. Reduces origin load and latency. Cloudflare, CloudFront, Fastly.

Connection Pooling

Reuse DB connections instead of opening new ones per request. PgBouncer, HikariCP.

Auto-Scaling

Add/remove instances based on metrics (CPU, queue depth, request count). Horizontal pod autoscaler in K8s.

Rate Limiting

Rate limiting controls how many requests a client can make in a given time window, protecting services from abuse and overload. Each algorithm makes different trade-offs between memory usage, precision, and burst tolerance — the right choice depends on your traffic shape and enforcement strictness requirements.

Token Bucket Bucket of N tokens, refilled at rate R. Each request consumes a token. Allows short bursts. Most common algorithm. Sliding Window Log Store timestamp of each request. Count in last window. Precise but memory-heavy. Good for strict rate limits. Sliding Window Counter Weighted combination of current and previous window counts. Approximation, memory-efficient. Good default choice. Fixed Window Counter Simple counter per time window. Resets at window boundary. Problem: burst at window edges doubles effective rate.algorithms

Messaging & Async

Message Queue vs Event Stream

Message Queue

Point-to-point. Message consumed by one consumer then deleted. Work distribution. RabbitMQ, SQS, Azure Service Bus.

Use for: task queues, job processing, command dispatch.

Event Stream

Pub-sub with retention. Multiple consumers replay independently. Append-only log. Kafka, Kinesis, Pulsar, EventHub.

Use for: event sourcing, CDC, analytics pipelines, fan-out.

Messaging Guarantees

Distributed messaging systems cannot simultaneously guarantee speed, simplicity, and exactly-once delivery. Understanding these delivery semantics is essential for designing consumers that remain correct even when messages are retried or reordered.

At-most-once — Fire and forget. Fast, may lose messages. At-least-once — Retry until acknowledged. May deliver duplicates. Exactly-once — Requires idempotent consumers OR transactional outbox. → True exactly-once is extremely hard; design for at-least-once + idempotency. Idempotency key pattern: 1. Producer assigns a unique ID to each message. 2. Consumer checks if ID was already processed (in DB or set). 3. If seen → skip. If new → process + record ID atomically.guarantees

Saga Pattern (Distributed Transactions)

No distributed ACID across microservices. Instead, use a saga — a sequence of local transactions where each step publishes an event triggering the next. On failure, execute compensating transactions to undo previous steps.

Choreography

Each service listens for events and decides what to do. No central coordinator. Simple for 2–3 steps. Becomes a tangled mess at 5+ steps — hard to track overall state.

Orchestration

A central orchestrator (saga coordinator) tells each service what to do and handles failures. Easier to reason about, debug, and monitor. Single point of failure — make it resilient.

API Design

REST vs GraphQL vs gRPC

Aspect	REST	GraphQL	gRPC
Protocol	HTTP + JSON	HTTP + JSON	HTTP/2 + Protobuf
Contract	OpenAPI / Swagger	Schema (SDL)	.proto files
Data Shape	Server defines	Client defines	Server defines
Strengths	Simple, cacheable, ubiquitous	Flexible queries, no over/under-fetching	Fastest, streaming, strong typing
Weaknesses	Over-fetching, N+1 endpoints	Complexity, no HTTP caching	Not browser-native, less tooling
Best For	Public APIs, CRUD, general-purpose	Mobile clients, dashboard aggregation	Service-to-service, real-time, internal

REST Best Practices

RESTful APIs use HTTP methods and resource-based URLs to provide a uniform, stateless interface. Following these conventions makes APIs predictable and interoperable — correct use of verbs, status codes, and pagination patterns is what separates a professional API from a brittle one.

// Resources are nouns, not verbs GET /api/v1/devices // list GET /api/v1/devices/42 // detail POST /api/v1/devices // create PUT /api/v1/devices/42 // full replace PATCH /api/v1/devices/42 // partial update DELETE /api/v1/devices/42 // delete // Pagination — cursor-based (preferred) or offset GET /api/v1/devices?cursor=abc123&limit=20 // Filtering, sorting, field selection GET /api/v1/devices?status=online&sort=-created_at&fields=id,name // Status codes 200 OK 201 Created 204 No Content 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found 409 Conflict 429 Too Many Requests 500 Server ErrorREST

API Versioning

URL Path

/api/v1/ → /api/v2/. Most explicit. Easy to route. Widely adopted.

Header

Accept: application/vnd.api.v2+json. Clean URLs but harder to test.

Query Param

?version=2. Simple but pollutes query string. Good for internal APIs.

Reliability & Resilience

Failure Patterns & Mitigations

Circuit Breaker

Track failure rate of downstream calls. When threshold exceeded, trip the circuit — fail fast instead of waiting for timeouts. After a cooldown, allow probe requests. States: Closed (normal) → Open (failing) → Half-Open (testing). Polly (.NET), Resilience4j (Java), opossum (Node).

Retry with Backoff

Retry failed requests with exponential delay + jitter. Prevents thundering herd. Cap retries (3–5). Only retry on transient errors (5xx, network timeout) — never on 4xx.

Bulkhead

Isolate components so one failure doesn't cascade. Separate thread pools, connection pools, or service instances per dependency. A slow DB query shouldn't starve the API of threads for fast requests.

Timeout

Always set timeouts on every external call. No timeout = potential thread leak. Cascade: if service A calls B calls C, A's timeout must be > B's + network overhead.

Health Checks & Graceful Degradation

Health checks tell the orchestration layer whether an instance is running and ready to serve traffic. Distinguishing liveness from readiness prevents bad deploys from receiving requests, and graceful degradation keeps core functionality alive even when supporting systems fail.

Liveness probe — "Is the process alive?" → Return 200 if the server can respond. Restart if it fails. Readiness probe — "Can this instance handle traffic?" → Check DB connection, cache, downstream dependencies. → Fail readiness ≠ restart. Just remove from load balancer. Startup probe — "Has the app finished initialising?" → Prevents liveness probe from killing slow-starting apps. Graceful degradation: 1. Cache can't connect → serve stale data with warning 2. Recommendation engine down → show popular items instead 3. Payment provider timeout → queue order for async retryhealth

Disaster Recovery

RTO

Recovery Time Objective — max acceptable downtime. How fast can you recover?

RPO

Recovery Point Objective — max data loss window. How much data can you afford to lose?

Multi-Region

Active-passive (failover) or active-active (both serve traffic). Active-active is harder but gives lower latency.

Observability

Three Pillars

Metrics

Numeric measurements over time. Counters (request count), gauges (CPU %), histograms (latency distribution). Prometheus, Datadog, CloudWatch. Alert on SLOs, not individual metrics.

Logs

Structured JSON logs with correlation IDs. Aggregate centrally. ELK stack (Elasticsearch + Logstash + Kibana), Loki + Grafana, Datadog Logs. Log at the right level — INFO for business events, ERROR for failures.

Distributed Tracing

Follow a request across service boundaries. Each service adds a span with timing. Jaeger, Zipkin, Datadog APM, OpenTelemetry. Essential for debugging latency in microservices.

SLIs, SLOs, and SLAs

SLIs, SLOs, and SLAs form a hierarchy that connects raw metrics to engineering targets to business contracts. Getting this hierarchy right lets teams make data-driven decisions about reliability investments and deployment risk using the concept of an error budget.

SLI (Service Level Indicator) The metric you measure. e.g., "% of requests < 200ms" SLO (Service Level Objective) The target. e.g., "99.9% of requests < 200ms over 30 days" → This is your INTERNAL engineering target. SLA (Service Level Agreement) The contract. e.g., "99.9% uptime or customer gets credit" → This is your EXTERNAL promise. Always set SLO tighter than SLA. Error Budget = 1 - SLO At 99.9% SLO → 0.1% error budget → ~43 min/month of downtime. Spend it on deployments and experiments. Freeze deploys when exhausted.SRE

The RED and USE Methods

RED (for services)

Rate — requests/sec
Errors — failed requests/sec
Duration — latency histogram

Monitors the user experience. Dashboard every service with these three.

USE (for resources)

Utilisation — % capacity used
Saturation — work queued
Errors — error count

Monitors infrastructure: CPU, disk, network, DB connections.

Security Architecture

Authentication & Authorisation

Authentication proves identity; authorisation controls what an identity can do. The reference below covers the dominant protocols and models used in production systems, from user-facing OAuth flows to fine-grained relationship-based access control at the scale of Google Drive.

Authentication — "Who are you?" OAuth 2.0 + OIDC: Industry standard. Authorization code flow for web, PKCE for SPAs/mobile. Never implicit flow. JWT: Stateless tokens. Short-lived access (15 min) + longer refresh token. Store in httpOnly secure cookies (not localStorage). API Keys: For server-to-server. Always rotate. Never in client code. Authorisation — "What can you do?" RBAC: Role-based. User → Role → Permissions. Simple, widely used. ABAC: Attribute-based. Policy evaluates user attrs, resource attrs, context. More flexible, more complex. e.g., "managers can approve if amount < £10k" ReBAC: Relationship-based. "Can user X view doc Y?" based on graph of relationships (owner, shared-with, org member). Google Zanzibar model.auth

Defence in Depth

Transport

TLS 1.3 everywhere. HSTS headers. Certificate pinning for mobile.

API Gateway

Rate limiting, auth validation, input sanitisation at the edge.

Network

VPC, private subnets, security groups. Zero-trust: verify every request.

Data at Rest

AES-256 encryption. Key rotation via KMS. Separate keys per tenant.

Secrets

Never in code or env vars. Use Vault, AWS Secrets Manager, Azure Key Vault.

Supply Chain

SCA scanning (Snyk, Dependabot). Lock dependencies. Audit transitive deps.

Cloud-Native Patterns

Containerisation & Orchestration

Containers package an application and its dependencies into a portable, immutable unit, eliminating environment drift. Kubernetes extends this by automating scheduling, scaling, self-healing, and service discovery across a cluster — understanding its core primitives is essential for operating any modern production workload.

Docker Package app + dependencies into an immutable image. → Multi-stage builds to keep images small (~100MB, not 1GB). → One process per container. Sidecar pattern for supporting processes. Kubernetes Pod — Smallest deployable unit (1+ containers). Deployment — Manages ReplicaSets, rolling updates, rollbacks. Service — Stable network endpoint for a set of pods. Ingress — HTTP routing from outside the cluster. ConfigMap — Non-secret configuration. Secret — Base64 secrets (use external secret operators for real security). HPA — Horizontal Pod Autoscaler. Scale on CPU, memory, or custom metrics. PDB — Pod Disruption Budget. Guarantees during maintenance.K8s

Infrastructure as Code

Terraform

Declarative, cloud-agnostic. State file tracks resources. Plan → Apply workflow. The standard for multi-cloud.

Pulumi

IaC in real programming languages (TS, Python, C#). Full IDE support. Good for teams that prefer code over HCL.

CloudFormation

AWS-native. Deep integration. YAML/JSON. Use CDK (TypeScript) for a better authoring experience.

CI/CD Pipeline

A CI/CD pipeline automates the path from a code commit to a running production deployment, enforcing quality gates at each stage. The deployment strategy you choose — rolling, blue/green, or canary — determines how much risk you take on with each release and how quickly you can roll back.

1. Commit → Push to feature branch 2. Build → Compile, lint, type-check 3. Test → Unit tests, integration tests 4. Security → SAST (Semgrep), SCA (Snyk), secret scanning 5. Package → Build Docker image, push to registry 6. Deploy staging → Automated deployment to staging environment 7. E2E Tests → Playwright / Cypress against staging 8. Deploy prod → Canary / blue-green / rolling update 9. Verify → Smoke tests, SLO monitoring, auto-rollback Deployment strategies: Rolling — Replace instances gradually. Zero downtime. Default K8s. Blue/Green — Two identical environments. Switch traffic. Instant rollback. Canary — Route small % of traffic to new version. Promote or rollback. Feature Flag — Deploy code dark. Enable progressively per user/segment.CI/CD

Interview Framework

System Design Interview

The Framework (45 Minutes)

Structure wins interviews. The interviewer is assessing your thought process, trade-off reasoning, and ability to scope — not whether you produce the "right" architecture.

The five-step framework below maps directly to a 45-minute interview slot. Each step serves a specific purpose: requirements gathering prevents wasted effort, estimation grounds your design decisions in real numbers, and the explicit wrap-up is where many candidates lose points by failing to articulate trade-offs.

Step 1: Clarify Requirements (5 min) → Functional: What does the system DO? Core use cases only. → Non-functional: Scale, latency, consistency, availability targets. → Constraints: Existing systems, compliance, budget. → ASK: DAU, read/write ratio, data size, peak traffic. Step 2: Back-of-Envelope Estimation (3 min) → QPS, storage, bandwidth. This informs every later decision. Step 3: High-Level Design (10 min) → Draw the big boxes: clients, LB, API, services, DB, cache. → Define the API: key endpoints and data models. → Pick primary data store and justify. Step 4: Deep Dive (20 min) → Scale bottlenecks: caching, sharding, replication. → Data model: schema, indexes, access patterns. → Key algorithms: ID generation, ranking, search. → Edge cases: hot partitions, race conditions, failure modes. Step 5: Wrap Up (5 min) → Summarise trade-offs explicitly. → What would you do differently at 10× scale? → Operational concerns: monitoring, alerting, deployment.framework

Common Questions & Key Components

System	Core Challenge	Key Components
URL Shortener	Unique ID generation at scale	Base62 encoding, distributed ID (Snowflake), redirect cache
Rate Limiter	Distributed counting	Redis + Lua scripts, token bucket, sliding window
Chat System	Real-time delivery, presence	WebSockets, Kafka, Cassandra, pub-sub channels
News Feed	Fan-out: push vs pull	Push (write-time fan-out) for most users, pull for celebrities
Search Engine	Inverted index, ranking	Web crawler → indexer → Elasticsearch → ranking algorithm
Notification System	Multi-channel, rate limiting	Priority queue, templating, per-user preferences, dedup
Payment System	Exactly-once semantics	Idempotency keys, two-phase commit, event sourcing, ledger
File Storage (S3)	Durability, chunking	Erasure coding, metadata DB, content-addressable storage
Video Streaming	Transcoding, adaptive bitrate	DAG-based pipeline, CDN, HLS/DASH, pre-signed URLs
Distributed Cache	Consistent hashing, eviction	Consistent hash ring, LRU/LFU, replication, write-through

Trade-Off Cheat Sheet

Every system design decision involves giving something up. This reference captures the most common tensions you will face in interviews and in production, along with the reasoning that should guide each choice. Naming the trade-off explicitly is what distinguishes a senior-level answer.

Consistency vs Availability → Financial data: strong consistency. Social feed: eventual is fine. Latency vs Throughput → Batching increases throughput but adds latency per request. Read-Optimised vs Write-Optimised → Denormalise for reads (fast queries). Normalise for writes (no anomalies). Push vs Pull → Push: lower latency, higher write amplification (fan-out on write). → Pull: simpler writes, higher read latency (fan-out on read). SQL vs NoSQL → SQL: ACID, joins, mature. NoSQL: scale, flexible schema, speed. → Most systems need BOTH — SQL for transactions, NoSQL for caching/search. Monolith vs Microservices → Start with a modular monolith. Extract services when you NEED to, not because it sounds impressive. Build vs Buy → Buy commodity (auth, payments, email). Build your competitive edge.trade-offs