How to find problems in real system design

To find problems in a real system design, work from the outside in:

  1. Start with the goal. What is the system supposed to do, for whom, and at what scale?
  2. Trace the main user flow end to end. Follow one request through every service, database, queue, cache, and dependency.
  3. Check the weak spots in each layer:
    • latency and throughput
    • single points of failure
    • data consistency
    • retries and idempotency
    • timeouts and backpressure
    • scaling limits
    • security and privacy
    • observability and alerting
  4. Look at edge cases. Bad network, duplicate requests, partial outages, empty data, large spikes, stale cache, failed jobs.
  5. Compare expected load to actual capacity. Many problems show up only when traffic grows or a dependency slows down.
  6. Ask “what breaks first?” for every component.
  7. Review logs, metrics, traces, and incidents. Real problems usually appear there before they appear in architecture diagrams.

A simple way to do this is to use four questions for every box in the design:

  • What can fail?
  • What happens when it fails?
  • How do we detect it?
  • How do we recover?
May 13, 2026