DDIA-Chapter1-Note

Published on April 10, 2024

Chapter1 - Reliable, Scalable and Maintainable application

This chapter introduces the terminology and approach we used in the book; and gives examples in a high level of these terminology(what and how to achieve)

Data-intensive application is build from standard building blocks :
- Databases -> store data that can be used(read) again
- Caches -> speed up reads (remember the result of expensive operation )
- Search index -> allow users to filter or search by keyword
- Stream processing -> handle data asynchronously
- Batch processing -> periodically crunch a large amount of accumulated data
These blocks are highly attracted and application engines don't need to implement a storage engine, they just use them.
But databases have different characters, and we need to choose them based on the requirements. -> find the proper one.
1. Things in common, what are the difference and how they achieve the characters.

Thinking about data systems

Many tools for data storage and processing have emerged in recent years.
1. Kafka as a message queue with database-like durability guarantees
2. Redis as a datastore can be used as message queues.
Many systems has increasing demands on a single tool with many functions
1. Switching between tools by applications cold is less efficiently than one tools handle multiple tasks together.
In this book, three common concerns will be discussed :
- Reliability : system continue work correctly(performing the correct function at the desired level of performance) even in the face of adversity( hardware or software faults, and even human errors)
- Scalability: system grows(in data volume, traffic volume or complexity), there should be reasonably ways of dealing with that growth.
- Maintainability: New people can maintain and adopt the system productively.
Reliability
1. Expected behaviors:
  - Perform the function as user expected (functional reliability)
  - Tolerate user making mistakes or use softwares in an unexpected way(bad behavior prevention reliability )
  - Performance are good under an expected data load(performance reliability)
  - Prevents unauthorized access and abuse (security reliability )
2. What can effect reliability
  1. Fault -> system can cope with fault called fault tolerant or resilient
  2. 100% fault tolerate is impossible in reality
  3. Fault vs failure
    - Fault is components diverting from its spec
    - Failure: system stops providing required service
3. Trigger faults deliberately -> can increase fault tolerance
  1. Netflix Chaos Monkey : randomly kill individual process without warning -> to test the system
4. Better prevent than cure
  1. Security : if system are hacked, system are controlled or data are lost, this can be undone and cure can not be done.
  2. But in this book most are faults that can be cured.
Hardware Faults
1. Examples in a datacenter
  1. Disk crash
  2. RAM faulty
  3. Power grid has a blockout
  4. Someone unplugs the wrong network cable
2. Option1: redundancy -> good but cannot prevent totally
  1. Disk set up a RAID configuration(no fear of loss data in some degree)
  2. Server have dual power supplies and hot swappable CPUs
  3. Batteries and diesel generators for backup power
3. Option2: fault tolerance techniques (as substitution or in addition to redundancy) -> operational advantage -> e.g. rolling upgrades
System Errors
1. System errors are more like to be related compared with hardware errors.
2. Examples
  - A software bug that cause every instance of an application server to crash when given a bad input.
  - A runaway process that uses up some shared resources.(A runaway process is a process that enters an infinite loop and spawns new processes.)
  - A service that the system depends on that slows down/ or unresponsive or starts return corrupted responses.
  - Cascading failures
3. Reasons: The softwares are making some assumptions are true and these assumptions becomes false.
4. Solution
  1. Thinking carefully when designing
  2. Testing
  3. Process isolation,
  4. Allow process crashing and restart
  5. Measuring and monitoring
Human Errors
1. Approaches :
  1. Design carefully. e.g. abstractions, APIs and interfaces...
  2. Decouple the places that people making most mistake. e.g. sandbox env
  3. Test thoroughly at all levels; e.g. automation tests
  4. Allow quick and easy recovery: quick rollback and rollout new code gradually; provide tools that recompute data
  5. Detailed and clear monitoring. e.g. performance metrics and error rates -> telemetry
  6. Management practices and training
How important of Reliability?
1. Cost, reputation and trust
2. There are situations that we want to cut cost by sacrifice reliability
Scalability
1. Attribute of how system coping with increasing load.
Describing load
1. Load has different meaning in terms of architecture
  1. Request per second to a web service
  2. Ratio of reads to write in a database
  3. The number of simultaneously active users in a chat room
  4. Hit rate on a cache
  5. ...
  Something that's bottleneck
2. An example : twitter
  1. post tweet : a user can publish a new message to their followers (4.6k request/sec on average, over 12k request/sec at peak)
  2. Home timeline: a user can view tweets posted by the people they follow (300k request / sec)
3. Fan-out: one user follows many people and also is followed by many people.
4. Two ways of implementing the operations of twitter:
  1. Posting tweet will insert a new tweet into a global collection of tweets. When a user request their home time, lok up all the people they follow and find all the tweets for each of those users, merge them(sorted by time).
    2. sql SELECT tweets.*, users.* FROM tweets JOIN users ON tweets.sender_id = users.id JOIN follows ON follows.followee_id = users.id WHERE follows.follower_id = current_user
  2. Maintain a cache for each users's home timeline. When a user posts a tweet, look up all the followers and insert their caches with the new tweet into their timeline cace. The read to home timeline is cheaper since all the result is pre- calculated.
5. Tweeter switch from approach 1 to 2 as the data load increasing -> since read is much more than write, so do some additional work in writing time is ok.
6. Approach 2 has a problem is that for celebrities, they have much more followers then others and if they post a tweets, there will be huge amount of writes -> whose overhead cannot be ignored(considering they are aiming for 5s target to finish home timeline load)
7. In the end they do a hybrid way: normal people approach 2 and celebrities are adopting approach 1 to avoid too much writes.
Describing performance
1. With data load increasing, how is the performance affected ? Or how many system resources needed if you want to keep the performance not affected.
2. Batch processing (Hadoop): throughput is important (number of request can be processed per sec); total time for a database to run a job in a certain size; service response time in an online system
  1. Response vs latency: sometimes used as synonymously but have slight difference: response is for client , which include the time of processing and the time of transmission in the network; while the latency is the duration of a request to be handled.(in server side)
3. Single response in a system has no use; we need to think about the distribution of the response times overall.
  1. Same request to server can have different processing time due to data size;
  2. Even we assume they are equal, there are still variation due to random latency like: a GC, a page fault forcing read from disk; mechanical vibration in the server rack, context switch to a background process ...
  3. Average / mean is not good because it's not typical, you don't know how many sets actually experiencing the delay.
  4. Percentiles: is much more useful: e.g. median(50th percentile) lets you know how long half of users typically wait.
    1. P99, p95, p999 are common , they are referred as tail latency
    2. Tail latency effects user experiences.
    3. There is a balance of cost and optimizing the last 1 or 0.1 percentile of response time. Besides, those are sometimes effected by random/unexpected response which cannot be controlled
  5. SLO: service level object vs SLA: service level agreement : contact that defines the expected performance and availability of a service.
  6. Queueing delay account largely for the response time ate high percentiles due to server(CPU limitation)
    1. Head-of-line blocking: head of line/queue are slow and following are fast but in the view of client, those all are slow. -> measure response at client side.
    2. When doing load test, send request independently to avoid head of line blocking.
  7. Percentile in practice :
    1. Tail latency amplification : small number of slow requests make the whole requests looks slow.
    2. Calculate the latency on a ongoing basis : rolling window for dashboards.
    3. Keep a list of response time for all requests within time window and sort the list every minute / or there there are efficient algorithms: forward decay, t-digist, HdrHistogram.
    4. Averaging percentile is meaningless-> use histograms.
Approaches for coping with Load
1. Think about the system architecture often when service is growing fast.(data load as well)
2. Scaling up(vertical scaling) vs scaling out(horizontal scaling)
3. Scaling up: moving to a more powerful machine
4. Scaling out: distributing the load across multiple smaller machines.
  1. Also know as shared-nothing architecture.
5. Two ways are used mixed in terms of condition.
6. Elastic : add or reduce resource based on the loads detected.
  1. Vs manually scaling : human analyzes the capacity and decide to add more resources.
  2. Elastic is good if system load is unpredictable but manual scale is simpler and has fewer operational surprises.
7. Systems varies so no magic scaling sauce(one-size-fit-all scalable architecture)
8. Good design is based around assumptions : which operations are common and which are rare.
Maintainability

Operability
1. Make it easy to operations team to keep the system running smoothly
2. Works that are good operability
Simplicity
1. Make it easy for new engineers to understand the system, by removing as much as complexity as possible from the system.
2. Complexity leads to more cost and more bugs
3. Accidental complexity : not from the problem the software solves but from implementation
  1. Use abstraction to solve this issue. Improve code quality
  2. e.g. Java hide the machine code , CPU register stuff , SQL hide on-disk and in-mem data structures
Evolvability/Extensibility: Making change easy
1. Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.
2. Agility : to make change quickly and reliability

DDIA-Chapter1-Note

Chapter1 - Reliable, Scalable and Maintainable application

Thinking about data systems

Reliability

Hardware Faults

System Errors

Human Errors

How important of Reliability?

Scalability

Describing load

Describing performance

Approaches for coping with Load

Maintainability

Operability

Simplicity

Evolvability/Extensibility: Making change easy