How to find problems in real system design
To find problems in a real system design, work from the outside in:
- Start with the goal. What is the system supposed to do, for whom, and at what scale?
- Trace the main user flow end to end. Follow one request through every service, database, queue, cache, and dependency.
- Check the weak spots in each layer:
- latency and throughput
- single points of failure
- data consistency
- retries and idempotency
- timeouts and backpressure
- scaling limits
- security and privacy
- observability and alerting
- Look at edge cases. Bad network, duplicate requests, partial outages, empty data, large spikes, stale cache, failed jobs.
- Compare expected load to actual capacity. Many problems show up only when traffic grows or a dependency slows down.
- Ask “what breaks first?” for every component.
- Review logs, metrics, traces, and incidents. Real problems usually appear there before they appear in architecture diagrams.
A simple way to do this is to use four questions for every box in the design:
- What can fail?
- What happens when it fails?
- How do we detect it?
- How do we recover?
May 13, 2026