Chapter 2 - defining nonfunctional requirements

last updated: Apr 20, 2026

The authors define a functional requirement as "the functionality that the application must offer", and the nonfunctional requirements as everything else you need to do.

Right off the bat, I struggle with this definition! They give performance as an example of a nonfunctional requirement, but obviously that's a matter of degree. Let's see if they clarify, or just allow the definition to be very fuzzy at the edges.

Twitter example

They start off with a motivating example, of a common interview question: let's design a service like twitter. Given three tables:

users id screen_name profile_image
12 jack 123.png
posts id sender_id fk(users) timestamp text
20 12 123456 just setting up my twttr
follows follower_id fk(users) followee_id fk(users)
9923882 12

They give the first pass SQL query to create a timeline page:

SELECT posts.*, users.* 
  FROM posts
  JOIN follows ON posts.sender_id = follows.followee_id
  JOIN users   ON posts.sender_id = users.id
 ORDER BY posts.timestamp DESC
 LIMIT 1000

This query will be expensive, so we will in practice need to materialize the timeline. For each user, we store their home timeline; when a user posts, we look up all their followers and insert that post into their timeline.

fan-out means the factor by which the number of requests increases in such a scenario.

describing performance

Two main types of performance metric:

Generally, response time decreases as throughput increases.

They have a brief sidebar about using jitter and exponential backoff plus circuit breakers to avoid thundering herd issues, but don't dive into it.

response time is usually what users care about the most, whereas the throughput determines the required computing resources

There's a brief discussion of median/mean/percentiles

Amazon describes response time requirements for internal services in terms of the 99.9th percentile... Optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed too expensive and found to not yield enough benefit

The authors point to several sources for the "performance == money" idea, and suggest that they're all insufficient and somewhat contradictory, and that we don't really know how much performance is functional, in the sense of making more money from users.

They point to the tail at scale to describe tail latency amplification, whereby as a single service request makes more and more requests to other endpoints, it becomes more likely that one of them suffers a tail latency spike.

There's a brief discussion of how SLOs and SLAs may use percentiles to define their expectations, and an important sidebar that averaging percentiles is meaningless. I've seen people make that mistake many times in practice, and it's possible I have (I certainly have (sorry))

reliability and fault tolerance

reliability is, roughly, continuing to work correctly, even when things go wrong

They distinguish betwen faults and failures:

They are the same thing at different levels

I'm glad they said that, it was my immediate objection to that definition!

We call a system fault-tolerant if it continues providing the required service to users in spite of faults occurring, and a part is a single point of failure if failure causes failure to the whole system.

Counterintuitively, in fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately -- for example, by randomly killing individual processes without warning. This is called fault injection... by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested

Hardware and software faults

Hardware redundancy increases the uptime of a single machine... using a distributed system has advantages, such as being able to tolerate a complete outage of one datacenter. For this reason, ccloud systems tend to focus less on the reliability of individual machines and instead aim to make services highly available by tolerating faulty nodes at the software level. Cloud providers use availability zones to identify which resources are physically co-located

hardware failures are often less correlated than software faults, because it is common for many nodes to run the same software and thus have the same bugs

Human beings

One study of large internet services found that configuration changes by operators were the leading cause of outages, whereas hardware faults played a role in only 10-25% of cases [72]

ed: that study is from 2003, I wonder if there's anything more recent

What we call "human error" is... a symptom of a problem with the sociotechnical system in which people are trying their best to do their jobs

They cite "the field guide to understanding human error", which I had for a while but failed to read

Scalability

Scalability is the term we use to describe a system's ability to cope with increased load

↑ up