Error Messages Are Written for the Next Person
Error Messages Are Written for the Next Person
I have a habit. Before I write an error message or a log line, I run a quick mental simulation: if I’m the one getting paged about this in three months, at 2am, is this message going to be enough? The first draft almost never is.
failed to process request is probably the most-committed log message in the industry. It tells you something went wrong, and nothing else. Which request? Which user? Which layer? Where should you look next? You end up spending the first half-hour cross-referencing timestamps, spelunking through stack traces, and guessing which upstream service decided to have a bad night. That’s not debugging — that’s archaeology. A useful error message tells the next person three things: what happened, where the boundary was, and which direction to start digging. Leave out any one of those and you’ve just made someone’s evening worse for no reason.
Log, Metrics, Trace — Each Carries Its Own Weight
These three are not interchangeable, but they get conflated constantly or collapsed into whichever one the team learned first.
Logs are for detail. They answer “what was happening at that exact moment” — and they’re most useful when they carry context: request ID, user ID, the raw error the upstream service returned. What logs are not good for is spotting trends, because at any real volume, nobody is reading them line by line.
Metrics are for trends. A graph showing your third-party API timeout rate climbing over the past hour is immediately actionable in a way that no grep over log files can match. But metrics have no detail. They tell you something is wrong; they can’t tell you which specific request triggered it or why.
Traces give you the full journey of a single request across service boundaries — how long each hop took, where it got stuck, where it died. In a microservices setup or anywhere you have a BFF sitting in front of multiple backends, traces are what turn a vague “it’s slow” into “it’s slow because service B is taking 900ms on this particular query path”.
You need all three. Metrics to notice the anomaly, trace to isolate which service, log to understand what that service was seeing at the time. Any one missing means a blind spot that will surface at the worst possible moment.
Design for Failure as the Default
The mental model matters here. The question isn’t “what do we do if this breaks?” — it’s “this will break eventually, what does recovery look like?” Third-party services go down. Deploys introduce regressions. APIs time out. These are not edge cases. They are the steady state of production software.
I was paged once at midnight over a batch report job that appeared to complete normally but was silently dropping thousands of rows. No error in the output, just a cheerful done. It took an hour to trace back to an upstream API that had started returning 5xx errors after a certain pagination offset — errors our code was catching, swallowing, and continuing past. The final result looked plausible enough that nobody noticed until a downstream consumer complained about missing data.
If that retry-exhausted, give-up path had logged “giving up after 3 retries, batch ID X, page 47, upstream returned 503” — I would have been back asleep in ten minutes. Instead I lost an hour to log archaeology and had to explain the data gap to a stakeholder the next morning.
When you’re writing error handling, think about two things simultaneously: can the program continue safely, and can the next person understand what happened within fifteen minutes? Both are your responsibility to the code.
Key Takeaways
- A useful error message covers three things: what happened, where the boundary was, and which direction to look next.
- Logs, metrics, and traces cover different questions — missing any one of them leaves a blind spot that will hurt you at exactly the wrong time.
- Designing for failure means accepting failure as the steady state and building the recovery path before things go wrong, not after.
- Silent failures are harder to debug than noisy crashes; before you swallow an error, ask whether the next person will be able to see it.
- Observability isn’t a monitoring layer you bolt on later — it’s designed alongside the error handling from the start.
Sheng’s take, drafted with Claude · part of the 2026-06-13 blog renovation, paint still drying.