Chapter 2 - defining nonfunctional requirements
The authors define a functional requirement as "the functionality that the application must offer", and the nonfunctional requirements as everything else you need to do.
Right off the bat, I struggle with this definition! They give performance as an example of a nonfunctional requirement, but obviously that's a matter of degree. Let's see if they clarify, or just allow the definition to be very fuzzy at the edges.
Twitter example
They start off with a motivating example, of a common interview question: let's design a service like twitter. Given three tables:
| users | id | screen_name | profile_image |
|---|---|---|---|
| 12 | jack | 123.png |
| posts | id | sender_id fk(users) | timestamp | text |
|---|---|---|---|---|
| 20 | 12 | 123456 | just setting up my twttr |
| follows | follower_id fk(users) | followee_id fk(users) |
|---|---|---|
| 9923882 | 12 |
They give the first pass SQL query to create a timeline page:
SELECT posts.*, users.*
FROM posts
JOIN follows ON posts.sender_id = follows.followee_id
JOIN users ON posts.sender_id = users.id
ORDER BY posts.timestamp DESC
LIMIT 1000
This query will be expensive, so we will in practice need to materialize the timeline. For each user, we store their home timeline; when a user posts, we look up all their followers and insert that post into their timeline.
fan-out means the factor by which the number of requests increases in such a scenario.
describing performance
Two main types of performance metric:
- response time - the elapsed time from a request to a response
- throughput - the requests per second the system is processing
Generally, response time decreases as throughput increases.
They have a brief sidebar about using jitter and exponential backoff plus circuit breakers to avoid thundering herd issues, but don't dive into it.
response time is usually what users care about the most, whereas the throughput determines the required computing resources
- response time is what the client sees; includes all delays anywhere in a system
- service time is the duration for which the service is actively processing the client's request
- queueing delays they don't define, oddly, just put it in italics and assume its definition is understood
- latency is a catchall term for time during which a request is not being actively processed (i.e. during which it is latent)
- This is interesting to me because they make latency a relativistic measurement; it matters if, from your perspective, you know that work is being done on it
- We can call response time "request latency" because it's time when we (the client) don't know that the request is being actively processed (or we can say that it is not being actively processed by our system)
- the author gives "network latency" as a particular example -
the time that a request and response spend traveling through the network, but from the network stack's perspective, there are several different times within that span when the request is latent, and others when it's workign on it
- the author gives "network latency" as a particular example -
There's a brief discussion of median/mean/percentiles
Amazon describes response time requirements for internal services in terms of the 99.9th percentile... Optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed too expensive and found to not yield enough benefit
The authors point to several sources for the "performance == money" idea, and suggest that they're all insufficient and somewhat contradictory, and that we don't really know how much performance is functional, in the sense of making more money from users.
They point to the tail at scale to describe tail latency amplification, whereby as a single service request makes more and more requests to other endpoints, it becomes more likely that one of them suffers a tail latency spike.
There's a brief discussion of how SLOs and SLAs may use percentiles to define their expectations, and an important sidebar that averaging percentiles is meaningless. I've seen people make that mistake many times in practice, and it's possible I have (I certainly have (sorry))
reliability and fault tolerance
reliability is, roughly, continuing to work correctly, even when things go wrong
They distinguish betwen faults and failures:
- A fault occurs when a particular part of a system stops working correctly
- A failure occurs when the system as a whole stops providing the required service to the user
They are the same thing at different levels
I'm glad they said that, it was my immediate objection to that definition!
We call a system fault-tolerant if it continues providing the required service to users in spite of faults occurring, and a part is a single point of failure if failure causes failure to the whole system.
Counterintuitively, in fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately -- for example, by randomly killing individual processes without warning. This is called fault injection... by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested
Hardware and software faults
Hardware redundancy increases the uptime of a single machine... using a distributed system has advantages, such as being able to tolerate a complete outage of one datacenter. For this reason, ccloud systems tend to focus less on the reliability of individual machines and instead aim to make services highly available by tolerating faulty nodes at the software level. Cloud providers use availability zones to identify which resources are physically co-located
hardware failures are often less correlated than software faults, because it is common for many nodes to run the same software and thus have the same bugs
Human beings
One study of large internet services found that configuration changes by operators were the leading cause of outages, whereas hardware faults played a role in only 10-25% of cases [72]
ed: that study is from 2003, I wonder if there's anything more recent
What we call "human error" is... a symptom of a problem with the sociotechnical system in which people are trying their best to do their jobs
They cite "the field guide to understanding human error", which I had for a while but failed to read
Scalability
Scalability is the term we use to describe a system's ability to cope with increased load