Good Performance for Bad Days
The core of what I tried to communicate is that, in my view, a lot of the performance evaluation community is overly focused on happy case performance (throughput, latency, scalability), and not focusing as much as we need to on performance under saturation and overload.
Hypothesis 1: In large-scale systems, system performance is the single largest contributor to system availability
Hypothesis 2: In large-scale systems, unpredictable system performance is the single largest contributor to system unavailability
Closed benchmarks are too kind to realistically reflect how performance changes with load, for the simple reason that they slow their load down when latency goes up.
The real world isn’t that kind to systems. In most cases, if you slow down, you just have more work to be done later.