10k

DDIA-Chapter9-Note

Chapter9 Consistency and Consensus

  1. We need to build systems that is fault tolerant -> can be with fault but still alive
  2. The best way of building fault tolerant system is to find some general purpose abstraction with useful guarantees , implement once and let the system rely on those grantees.
  3. Consensus : all nodes agree on something.
    1. e.g. in single leader replication , all nodes need the consensus that : there should be only one leader in the system

Consistency Guarantees

  1. In replication, data reach to different node at different time -> you may read different data at a time point from different nodes
  2. Eventual consistency: but they finally become the same -> better call : convergence
  3. This is a weak guarantee -> the eventual -> how long does it take for all nodes to have a consistent value?
  4. We talk about stronger guarantees in the chapter -> they have price: lower performance or less fault tolerance
  5. Distributed consistency model and hierarchy of transaction isolation levels
    1. They are similar and with some overlap but cannot be matched
    2. Transaction isolation is about avoiding race conditions due to concurrently executing transactions
    3. Distributed consistency is about coordinating the state of replicas in the face of delayed and faults
  6. Topics :
    1. Linearizability and examine the pros and cons
    2. Ordering events , particularly around causality and total ordering
    3. How to atomically committed a distributed transaction -> to the solutions for the consensus problem

Linearizability

  1. Make the system appear as if there were only one copy of data , and all operation on it are atomic
  2. A recency guarantee : it must see the data it just write

What Makes a System Linearizable ?

  1. See an example first :

    image-20240507081047099

  2. In the above graph,

    1. The first read of A read x before write x = 1, so it's definitely 0;
    2. The last read of A read x after write x = 1, so it's definitely 1;
    3. B's read has overlap with C's write so we don't know whether or not the write has taken effect as the time when the read is processed - >they are concurrent.
  3. This is not linearizability

  4. This is what we want for fully linearizability :

    image-20240507081443786

  5. Once the x = 1 is read, all its following read must return 1 even the write hasn't complete.

  6. Another example with compare and set

    image-20240507081832559

    Compare and set means a write compare what it knows the original value , if successes the write new value

    image-20240507081938713

    1. The time line is important, each operation has a time point to read or flip the value (write the value in db)
    2. D, B.s read and A's write are not returned as they started order, this is ok since there might be delay of processing
    3. B's read return 1, while A's write isn't finish , this is ok, the actual flip in A is before B/s read -> this only means the ok response is delayed.
    4. B, D are concurrent so D's cas failed due to old value is not 0 (it's 2)
    5. B's final read is failed due to A returns 4 before (the 4 s set by C's cas), so older value cannot be returned in the linearizable system
  7. Linearizable vs serializable

    1. Seiralizalibity is an isolation property of transitions , where every transaction may read and write multiple objects -> it guarantee transactions in concurrent behave like in serial
    2. Linearizable : recency guarantee on reads and writes of a register (an individual object)
    3. Actual serial execution or implementation of serializability based on 2PL are typically linearizable
    4. Serializable snapshot isolation is not linearizable: it makes read from a consistent snapshot to avoid lock between readers and writers -> it doest include most recent writes

Relying on Linearizability

Application areas

Locking and leader election

  1. In a single leader election, a lock is used to ensure there is only one node being elected and avoid split brain.
  2. It must be linearizable -> we use coordination service -> ZooKeeper -> linearizable storage service for coordination tasks
  3. Distributed locking is also used. -> Oracle Real Application Clusters(RAC) uses a lock per disk page , with multiple nodes sharing access to the same disk storage system.

Constraints and uniqueness guarantees

  1. You need linearizability to enforce the constraint when the data is written.
    1. It's like a compare and set, all the nodes must have agreements on the value has a up to date value.
    2. Like bank transfer balance, seat booking, username registration
    3. In practice, the constant can be loosely -> if flight seats overbooked, we can move customer to a different flight.
    4. Hard uniqueness contains requires linearizability , other kinds of constraints, like foreign keys can be without it.

Cross-channel timing dependencies

  1. Channel -> you have a stale query and new result needs a new channel

    1. You have a website where users can upload a photo, a background process resize the photos to lower resolution for faster download(thumbnails)

    2. The image resizes needs to be explicitly instructed to performa a resizing job.

      image-20240508073123195

    3. If the file storage service is not linearizable, there is a risk of race condition: the message queen might be faster than the internal replication inside the storage service

    4. Two different channels -> file storage and message queue

Implementing Linearizable Systems

  1. Single leader replication -> potential linearizability : if the you can read from leader , then it can be linearizable, however, if a node is not a leader while it thinks it's leader, then read to it might violate the linearizability
    1. With async replication, failover may even lose committed writes , violating both durability and linearizability
  2. Consensus algorithm (linearizability) -> Zookeeper and etcd works like this
  3. Multi-leader replication (not linearizable) : they concurrently process writes in many nodes and asynchronous replicate them to other nodes. -> the can produce conflicting writes -> no single copy of data
  4. Leaderless replication (probably node linearizable):
    1. LWW conflict resolution based on time-of-day clocks in Cassandra are almost certainly nonlinearizability -> clock timestamps cannot be guaranteed to be consistent with actual event ordering due to clock skew

Linearizability and quorums

  1. Strict quorum may have lineaerizable in a Dynamo0style model -> why may -> network delays

    image-20240508074455443

    1. A writer client is updating x to 1 by sending the write to all three replicas(n=3, w=3), concurrently, A reads from a quorum of two nodes (r=2), and sees the new value 1 on one of the node. Also concurrently, B read forms the other two nodes and get back old value 0 from both.
    2. The quorum condition met(w + r > n), but this is not linearizability -> b's request starts after A but returns old value
  2. It can be linearizable with performance reducing: a reader must performa a read repair synchronously , before returning the result to the application , and a writer must read the latest state of a quorum of nodes before sending it's writes.

    1. Riak doesn't perform synchronous read repair
    2. Cassandra does wait for read repair but lose linearizability if there are many write to the same key.

The Cost of Linearizability

image-20240508075704741

  1. If the network interrupted between two data center
    1. In a single leader -> if read from DC that has no leader, it has stale old value
    2. If read from leader DC, then the other DC are unavailable to keep the linearizability
    3. In a multi-leader replication
      1. The sync cannot be done, but each DC can work -> they can change data async when network is back

The CAP theorem

  1. If your application requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, then some replicas cannot process request while they are disconnected : they must either wait until the network problem is fixed r return an error (unavailable)
  2. If you app doesn't require linearizbaility, then it can be written in a way that each replicas can process request independently, even if it's disconnected from other replicas. in this case, the application can remain available in the face of a network problem , but its behaviors is not linearizable
  3. This is CAP

Linearizability and network delays

  1. Linearizbaility are not common in reality
  2. RAM in Multicore cpu they are not since they have their own memory cache and storage buffer -> memory access first goes to the cache by default, and ant changes are async written out to main memory. (Access cache for performance )
    1. The we have several copies of data -> in memory , in caches -> and they are async updated -> no linearizability
  3. This is ok we don't expect any single CPU can work independently - >we pursue performance rather than fault tolerance

Ordering Guarantees

  1. In single leader replication, the single leader needs to ensure the order of write to avoid conflict in multiple leader
  2. Serializability is about to ensure transactions behave like executed in sequential order
  3. Use of timestamp and clock is a way of trying to keep order

Ordering and Causality

  1. Orders prevent causality
  2. e.g.
    1. Causality dependency : people see the question before the answer
    2. In multi-leader replication, leaders may receive some updates to rows not existing. -> causality : rows must be created before updated
    3. Happened before is another causality : A happened before B means b may know or depend on A's existing
    4. In snapshot isolation for transactions, -> transaction read from consistent snapshot -> consistent : -> causality consistent : if it reads an answer, it must can read an question for the answer
    5. Write skew : the on call shift issue -> SSI serialization snapshot isolation tracks causal dependencies between transaction
    6. ...
  3. Causality consistent: a system obeys the ordering imposed by causality

The causal order is not a total order

  1. Linearizability has total order -> strict timeline , one happens after the other, no concurrency
  2. Causality: two operations are concurrent if neither happens before the other . -> it sometimes happen

Linerizability is stronger than causal consistency

  1. Linearizabiliy prevent causality -> the strict timeline cures that things are in good causality
  2. But linearizability has price: performance and availability
  3. Middle land: causality consistent without incurring the linearizability , and also keep the performance

Capturing causal dependencies

  1. To maintain causality consistency , you need to know the happen-before relationship
  2. This is a partial order: if you have one happened before relationship in one replica, you must have it in all replica
  3. Version vectors used across the whole databases to track the happen before relationship
  4. Similar to the serializable snapshot isolation

Sequence Number Ordering

  1. Practical way to order events: seance number or timestamp (logic clock)
  2. We relies on that if A casually happened before B, then A occurs before B in the total order(or has lower sequence number)
    1. Concurrent operations ordered arbitrarily
  3. In single leader replication, a counter is assign to each operation monotonically increasing.

Noncausal sequence number generators

  1. If there is no single leader doing the allocation, in practice , we can :
    1. Each node generate its own independent set of sequence numbers. With some. Stretegy, you can avoid sequence number conflicts : if you have two node, one can be odd and one can be even
    2. You can attach a timestamp form a time-of-day clock to each operation,-> no sequential, but if high resolution, can preserve the total order to some high extend
    3. Like method1, preallocate blocks for each node, A has 1-1000, B has 1001-2000...
  2. All of these methods has no causality consistency
    1. Nodes can be fast and slow, if you have a node fast, say 1,3,5,7,9, but the other is 2, you cannot tell the causality
    2. Timestamp from a physical clock has clock skew
    3. Block allocator 1001 and number 1 cannot tell the order

Lamport timestamps

  1. Lamport timestamp:
    1. Each node has a unique identifier , and each node keeps a counter of the number of operations to has processed
    2. The Lamport timestamp is a pair of (counter, node ID).
    3. Two node may have same counter but they have node id so make the timestamp identical.
    4. image-20240512082930657
    5. if you have two timestamp, one with higher counter is the greater timestamp; if the counter values are the same, the one with the greater node ID is the greater timestamp
  2. The difference between odd/event counter in the last section and Lamport timestamp is that every node and every client keeps track of the max counter value it has seen so far, and includes that max on every request -> when node receives a request/response with a max counter greater than its own counter value, it immediately increases its own counter to that max.
  3. Difference with version vectors:
    1. Vv can distinguish whether two operations are concurrent or whether one is causally dependent on the other
    2. Lamport timestamps always enforce a total order.-> you can't tell concurrent or they are causally dependent

Timestamp ordering is not sufficient

  1. e.g. to identify a username is identical -> it needs instantly but it need to check other nodes values -> the total order only emerges after you have collect all of the operations
  2. Not only to have the total ordering of operations , you also need to know when that order is finalized (next section)

Total Order Broadcast

  1. How to scale the system if the throughput is greater that a single leader can handle ; and how to failover if the leader fails
  2. It relies on:
    1. Reliability delivery : no message lost , if a message is delivered to one node, it;s delivered to all nodes
    2. Totally ordered delivery: messages are delivered to every node in the same order

Using total order broadcast

  1. State machine replication: each replica process the same writes in the same order -> the remain consistent with each other. - .application of total order broadcast
  2. Implementing serializable transactions:
  3. The order is fixed at the time the messages are delivered
  4. It's like a append-only log, every node can see it and obey the sequence.

Implementing linearizable storage using total order broadcast

  1. Steps:
    1. Append a message to the log
    2. Read the log and wait fro the message you appended to be delivered back
    3. Check for any messages claiming the username that you want. If the first message for your desired username is your own message you are successful , otherwise you abort the operation.
  2. The read might be inconsistent -> you read a node that is async update from the log.
    1. Read the log like the write
    2. Read from sync replica
    3. Read the latest log in a linearizable way -> if you have the position

Implementing total order broadcast using linearizable storage

  1. Steps:
    1. You have a linearizable register
    2. for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.

Distributed Transactions and Consensus

Atomic Commit and Two-Phase Commit

Distributed Transactions in Practice

Fault Tolerant Consensus

Membership and Coordination Services

Thoughts? Leave a comment