DDIA-Chapter8-Note

Published on May 5, 2024

Chapter 8 The Troubles with Distributed Systems

Faults and Partial Failures

In single machine, Good software /systems are either fully functional for entirely broker, but not something in between. -> we don't want the faulty result which cause useless work and confusing
The nondeterminism an possibility of partial failure is what makes distributed system hard to work with -> some times it works well but sometime not, you sometime event don't know the failure due to the nondeterministic network failure

Cloud Computing and Supercomputing

Large scale systems
1. One end is the field of high-performance computing(HPC)-> large amount of CPU do computationally intensive tasks
2. The other extreme is cloud computing, works with multi-tenant datacenter, commodity computers connect with IP network, elastic resource allocation, and metered billing
3. Traditional enterprise datacenter lie somewhere between these extremes.
Diffrent system handle faults in a different way
1. Supercomputers raise single node failuer to the system failure and fix it
2. We focus on the different part
  1. Many systems are online system -> availability is needed
  2. Supercomputers node are more reliable compared with commodity machines
  3. Large data centers networks are based on IP and Ethernet, arranged in Clostopologies to provide high bisection bandwidth while supercomputers use some special topology
  4. The bigger the system, the possibility of node failure higher so consist giving up is bad
  5. Fault tolerance is useful
  6. Geographically distributed deployment communication goes over internet which is slow compared with local networks . Supercomputers assume they are close to each other.
3. Fault tolerance and partial failure : build reliable system with unreliable components

Unreliable Networks

We in the book talking about shared nothing system -> communicated with asynchronous packet network -> the message/packet are not guaranteed to delivered on time or even delivered
1. Request could be lost -> someone unplugged a network cable
2. Request is waiting in a queue and delivers after -> congestion control
3. Remote node may have failed (power down)
4. Remote node may have temporarily stopped responding (GC or .. )
5. The remote node processed you message but response delayed
You don't know a response is lost or delayed so we use timeout
However, you don't know remote node processed you request

Network Facts in Practice

Network issues are common
Handling network problem doesn't necessarily mean tolerating them , you can thrown an error is your network is normally reliable.

Detecting Faults

Many system automatically detect faulty nodes
1. Load balancer don't sent request to dead node
2. In single leader replication, new leader can be elected when a leader fails
The uncertainty of the network makes it hard to know what's wrong
1. You can reach a machine where the node should be running, by sending a FIN packet in reply, -> however, if the node crashed which it's handling the message, you don't know how many it processed
2. If a node process is crash by the OS is running, a script can notify other nodes about the crash so others can taker over quickly without having to wait for a timeout.
3. You may connect to the data center management layer if you are not connect via internet
4. A router may or may not know the destination is failed -> depends on the router feature
You cannot rely on the feedback from client -> even if TCP ack a packet is delivered, the node can crash while processing
In general, you assume you get no response, you retry in the application level and wait for timeout and declare the node dead.

Timeout and Unbounded delays

Determine timeout is hard
1. Too should may kill some node working on heavy task
2. Too long will waste time
Delaracing a node is dead is problematic: it;s still alive and performing soem task, some other node takes over -> the task may be perform twice
Take over to another node the responsibility has overheads -> if the system is overload, this makes things worse -> if the node is response slow due to the system overload, doing the transfer makes things worse and may let it down.
Unbounded delays: async network has no guarantees on delivery and processing

Network congestion and queuing

Queueing when
1. Several nodes send packets simultaneously
2. Destination node are CPU busy
3. In an virtual environment, a running OS is often pause for sometime when another VM use a CPU core -> during which it cannot process any data
4. TCP perform flow control -> queuing at sender
TCP retransmitted when it thinks packet timeout and lost -> this consumes times like like queuing
In a multi-tenant datacenter, resources are shared(network links and bandwidth , CPU) , as there is no control of resource allocation-> there might be delay due to a heavy resource consumer
Timeout can be chosen experimentally: measure the distribution of network round trip times and determine the expected delay. The taking into account other characteristics
Even better -> measure reins times and their variability(jitter), and automatically adjust timeout according to observed response time distribution. TCP retransmission works similarly. This is sued in Akka and Cassandra.

Synchronous vs asynchronous networks

The data pass through use fixed allocated space(bandwidth), so there is no queuing -> so the delay is fixed -> bounded delay

Can we not simply make network delays predictable

TCP will transmission any size of data using as much as available bandwidth. So it's different from the circuit
Datacenter network and the internet were packet-switch networks -> to handle bursty traffic (the size of packet varies)
If you want to use circuit you need to predict the bandwidth -> this may cause the transmission slow -> packet switch can use the bandwidths efficiently
You can regard variable delays as a the consequence of dynamic partitioning. -> it's a trade off (time and space)

Unreliable Clocks

In distribute systems, time is tricky because communication is not instantaneous -> it takes time for messages to travel across the network from machine to another.
This make things hard to determine the order when involving multiple machines.
Machines has their own clocks but not definitely the same , so synchronization may be possible . -> NTP network time protocol -> allows the computer clock to be adjusted according to the time reported by a group of servers. -> the servers get their time form more accurate time source like GPS.

Monatomic vs Time-of-Day clocks

Two kind of clocks of different purpose

Time-of-Day Clocks

It returns the date and time according some calendar, such as System.currentTimeMillis()
It's usually synchronized with NTP-> time on a machine is the same as others since they have the same truth of source

Monotonic clocks

It's suitable for measuring a duration , such as timeout or a service response time.
You can check the value of the monotonic clock at one time point, do something and check the value again, the difference tells you how much time elapsed between the two checks. -> the absolute value of the clock is meaningless (it might be the number of nanoseconds after the PC starts)
In multiple CPU cores, there might be different timer for each CPU. OS will help composite the discrepancy and try to present a monotonic clock
NTP may just the frequency at which the monotonic clock moves forward if it detects that the computers locally quartz is moving faster or slower than the server.

Clock Synchronization and Accuracy

Hardware clocks and NTO can change:
1. Clocks drifts due to temperature
2. If the local clock differs too much from NTP server, it will refuse sync or forcible be synced. -> cause application rely on the time confused
3. Firewall may accidentally firewall NTP server
4. NTP sync relies on the network bandwidth
5. NTP servers can be wrongly configured and inaccurate
6. Leap seconds result in a minute 59 or 61 seconds -> causing trouble to some system that didn't handle this issue
7. In a virtual machine, it syncs time from CPU, but it has suddenly pause (due to the CPU switch), this will cause the clock on the VM jumped suddenly
8. Some devices time cannot be trust if you have no full control of the device -> some time and date are deliberately configured wrong.
Higher accuracy are needed in some scenarios like high frequency trading , and can be achieved by GPS receives , the precision time protocol and careful deployment and monitoring

Relying on Synchronized Clocks

The incorrect clocks easily go unnoticed and the system issues may be subtle
Some monitoring may needed

Timestamps for ordering events

The write by client B is causally later than the write by A, but B's write has ab earlier timestamp -> node 2 thinks x = 1 is the more rennet value -> B's change will be lost
LWW : last write wins are designed for resolve this . And it's used in multi-leader and leaderless database such as Cassandra and Riak.
Some implementations generate timestamps on the client rather than server. But this doesn't change the fundamental problems with LWW:
1. Database write can be lost (due to clock)
2. LWW cannot distinguish between write that occurred sequentially in quick secession
3. Two clients generated write with same timestamp
It important to be aware of that the definition of "recent" depends on the local time-of-day clock and it might be inaccurate
NTP sync has its own limit -> networks delays
Logical clocks are based on incrementing counters rather than quartz crystal, are a safer alternative.

Clock readings have a confidence interval

The clock reading should be considered as a range of a time , with a confident internal -> the time read from NTP server has delay and drift.
However, the uncertainty is not reported by many manufactures

Synchronized clocks for global snapshots

In distributed system of implementing the snapshot isolation, we need database has a global monotonically increasing transactions ID -> this ID needs to reflect causality : if transaction B read a value that was written by A, then B must have a higher truncation ID than A. -> hard and can be bottleneck
In order to ensure the transaction timestamps reflect causality, spanner deliberately waits for the length of the confidence interval before committing a read-write transaction -> it ensures that ant transition that many read the data is at a sufficiently later time, so their confidence internal don't overlap -> internal needs to be as small as possible -> google deploy GPS receiver in each datacenter

Process Pauses

How does a single leader based replication know it's a leader?

pseudocode while (true) { request = getIncomingRequest(); // Ensure that the lease always has at least 10 seconds remaining if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) { lease = lease.renew(); } if (lease.isValid()) { process(request); } }
This logic :
1. It relies on synchronized locks : the expiry time on the lease is set by a different machine , and its being compared with local clock.
2. Even if we change the logic to use only local monotonic clock, there is another problem : the code assume it's fast between check expire and the code and the time it processed the request -> no one tells the thread that another node takes over -> this can really happen when
  1. GC
  2. VM suspended
  3. Device suspend (user close the lid of the laptop)
  4. OS context switch or hypervisor switch to a different VM
  5. Sycnrvhnous disk IO and thread waits
  6. In a unix system , a SIGSTOP
3. All these is like making multi-thread code on a single machine but in single machine, treads shard memory and have ways to avoid this: mutex, semaphores. Atomic counters ,...
4. In a distributed system, a node can and needs to be stop without interfering others .

Response time guarantees

Realtime is not real time, it defines the time guaranteed timing
Making system real time relies on programing language, tools, and libs restriction
So most server side applications accept repose pause and time instability

Imitating the impact of garbage collection

Do GC as plan task and aware other nodes. -> others will know and don't sent request to this node.
Use GC in a short lived objects and restart process periodically -> also planned

Knowledge, Truth and Lies

The Truth Is Defined by the Majority

Quorum : voting among the node :
Absolute majority or more than half the nodes.

The leader and the lock

Frenqeuuntly a system needs there is only one of something:
1. In node is allowed to be the leader for a database partition to avoid split brain
2. One transaction or client is allows to hold the lock for a particular resource or objet, to avoid concurrently writing to it and corrupting it.
3. Only one user is allowed to register a username to identify a user.
The chosen one : be careful , when quorum thinks it's no longer the one, it's abandoned even if it considers itself as the one.
You want to ensure a file can only be access by a user to avoid corruption ->
1. If a client hold the lease is paused for too long, its lease expires .
2. Another client can object a lease for the same file, and start writing to the file.
3. When the pause client comes back , it believes (incorrectly) it stills has a valid lease and processed to also write to the file -> writes clash corrupt the file

Fencing tokens

To ensure the client 1 not to disrupt the rest of the system -> fencing
Every time the lock grants a lock or lease , it also returns a fencing token, which is a number increasing every time a lock is granted -> then the write to the storage needs to be with the token -> the storage remember the token and can find the expire token
This mechanism requires resource checkin the token -> it's wise to not assuming the client is good.

Byzantine Faults

What if a client send a fake token that bypassing the fencing?
Byzantine fault: a node declare it receives message while in actual it didn't

Byzantine fault tolerant: a system continue to work when some nodes are dishonest
We can assume in this book you have full control of the system -> we don't apply byzantine prevent algo in the suer input -> server can do this auth
Byzantine algo are more important in the system there is no center
Byzantine fault tolerance algo requires most nodes works well so if all nodes are gone or attacked, Byzantine fault tolerance algo does not work here.

Weak forms of lying

Even system nodes are honest, it's worth to add machinism to prevent lies
1. Packets are damaged due to many reasons -> checksum
2. User input filter needs to be added in a public service
3. NTP can be configured with multiple server address -> make it robust

System Model and Reality

System models for timing issues
1. Synchronous : you have bounded network delay, bounded process pause and bounded clock error
2. Partial synchronous : system behave like synchronous models most of time, but sometimes exceeds bounded delay, pause or clock error -> common in real world .
3. Asynchronous : system doesn't allow any timing assumptions
System models for node failures
1. Crash-stop faults : crash and down and never come back
2. Crash recovery faults : crash and may return , may have stable storage to help recovery
3. Byzantine faults: nodes can don anything (lies )

Correctness of an algorithm

Properties to be clarify to define the correctiveness of an algorithm
Uniqueness : no two requests for a fencing token return the same value
Monotonic sequence : if request x returned token tx, and request y returns token tx, and x completed before y began, then tx , ty
Availability: a node that requests a fencing token and doesn't crash eventually receive a response

Safety and liveness

What if all nodes delay becomes infinite long ? We can do nothing. So we have two types of properties
Uniqueness and monotonic sequence are safety properties and availability is liveness
Safety means nothing bad happens
Liveness : good things happens - > properties with "eventually"

In my view it's like safety properties defines you illness, and you can have illness but you cannot work well or correctly due to the effect, liveness is like your life, you finally recover or stay alive.

Mapping system models to the real world

In real world there are many things that break the assumption and let the algorithm or theories doesn't work
But in abstract, models help out simplify and focus on the key problem to analysis.