System Design Interview and beyond Note8 - Deliver Data Reliably

Published on March 23, 2024

Deliver data reliably

What else to know to build reliable, scalable and fast systems?

List of common problems of distributed system

List of system design concepts that can help solve these problems

Three-tier architecture

Problems in distributed systems
1. Transient network issues
2. Long lasting network use
3. Consumer application failures-> broker may keep the messages for a while until its disk is full;
  1. Timeout, reties, idempotency, backoff with jitter, failover fallback, message delivery guarantees, consumer offsets
4. If messages comes faster than then speed of consumer can process them due to -> traffic burst, or natural growth of load -> in such case, we will need to increate the throughout or reducing the latency:
  1. Batching, compression, scaling, partition
5. Load shedding, rate limiting...
6. Circular breaker and bulkhead when some consumers failures
7. Three tier architecture:
  1. Representation tier: displaying data
  2. Application tier: business logic (functionality)
  3. Data tier:data persistence
8. Request failed to pass through tier ? One tier becomes slow? ....

Timeouts

Fast failures and slow failures

Connection and request timeouts

Everything eventually fails;
Fast failure : we know failure fast
Slow failure: we don't know failure immediately and we see it;'s still working -> occupy resources;
So we use timeouts to turn slow failures into fast failure
Timeout : how much time the client is willing to wait for a response from the server
1. Connection timeout : how much time to wait for a connection to establish (10-100ms)
2. Request timeout: time to wait for a request to complete
  1. Trick to choose timeout -> analyze metrics (percentile) and set a p99

What do we do with failed requests ?

Strategies for handling failed requests(cancel, retry, failover fallback)

Cancel : when a request failed, it cancel this request , and may send an exception out. (Try, fail, log) -> this is useful when the failure is not transient(resend is not useful-> auth fail)
Retry : when a request failed, it will resend the same request again. -> use when the failure is transient( network issue).
Failover : switch to a redundancy or standby system of the same type.
1. Database :
2. Load balancers
3. Leader and followes
Failover in microservice
Fallback : switch to an alternative system (of a different type)
1. Message queue is totally different type of service B
Fallback increase reliability but also the system complexity
1. -> introduce new components
2. Hard to test
3. Fallback system rarely used may not be ready to handle the failed request
Those can be used combined : when a request failed, we retry 3 times and if still fails, we use failover or fallback path.

When to retry

Idempotency

Quiz: which AWS API failures are safe to retry

Idempotency : no matter how many times the operation is preformed, the result is always the same .
1. X = 5 is idempotent
2. x++ is not ;
Idenpetent API : making multiple identical requests to that api has the same effect as making a single request
1. READ is typically idempotent
2. Write is typically not (create a new record) (what about put update?)
Can we retry a non-idempotent API? -> yes, if you know
1. Original request never hit the API (do nothing on server side)
2. API supports idempotency kets(client tokens )
  1. Service A send request with tokens to service B;
  2. If service receives request and processed, stores token;
  3. When A sent the retry request, B find it's duplication, do nothing with the request;
  4. If the request if failed , means no record of token in B;
  5. When retry, the request will be accepted and processed
This is applied to some temporary storage -> has a TTL of idempotency key
How do we know an API is idempotent -> service provider tells us developer

HTTP Status Code	Description	Retry?	Comments
N/A	Network is not established	yes	Request never send before
400	Bad request	No	Request bad, retry no help
403	Access denied	No	Credential not right
404	Not found exception	no	No requested resources
429	Too many request	Yes with delay	Too many request in a periods of time. so wait for a bit
500	Internal server error	Yes if idempotent	Internal unexpected error
503	Service unavailable	yes	Service is not available, request not hit server
504	Request timeout	Yes of idempotent	Request hit server but couldn't complete with the timeout periods

How to retry

If service B is under pressure, retry will make it worse -> retry storm
1. Introduce delays between retry attempts ; And the delay(wait time) increase after each retry (exponential backoff)
2. Limit the number of retries -> failover falloff after several retries
3. Retry from different clients at the same time will cause pressure to the server -> Force clients tore retry at a different times -> Jitter(random intervals)
Jitter
1. Easy to implement
1. Data aggregation scenarios : application monitoring : if the metrics send monitoring data at the minute start , this will cause load pressure to the monitoring system -> jitter :
2. In general : aggregate in time series and flush to system , workload that run periodically
  1. Use jitter to avoid start8ing all the tasks exactly at the same time
3. Jitter can also help resource competing after they were waiting for a event

Message delivery guarantees

At most once guarantee

At left once guarantee

Exactly once once guarantee

Push : when broker send message to consumer, it waits for the ack to comeback , if the broker was unable to deliver the message or the ack didn't come back , the broker may retry immediately or fallback
1. Fallback option -> dead letter queue to contains poison pills(message send failed )
Pull : consume send pull request, if fails ,retry, after consumer receive the message it will send a deletion message and the broker delete the message(ack);
Completing consumers : speed up processing
1. Push: broker randomly select one consumer and send message, if failed, it will retry, and if still fails, it will select another consumer to send the message(failover)
2. Pull: all consumers are pulling messages, if one consumer receiving the messages, others should be invisible to pull this message -> SWA SQS
  1. Visibility timeout -> the message cannot be processed; enough for one consumer to process one message
Message delivery guarantee
At most once: message will never be delivered over once; we try, We failed and we we give up... -> message might be lost
At least once: best effort : retry the message until it's delivered. -> messages might be delivered multiple times.
Exactly once: message will be delivered, or proceed exactly once.
Why we have these guarantees? Isn't exactly once is what we want?
Failuers will happen -> the ways of the failures were handled determinate the guarantee.
1. Broker don't care about the message delivered, it sends the message and regard it's successfully delivered.->message may be lost -> at most once -> throughput good but message safety bad.
2. Broker wait for the message ack to come back-> at least once-> consumer may get the message multiple times. -> ack will be send back to broker only after the message processed.
  1. If the ack send back before the message was processed, if processing failed, the message was deleted (due to receiving of ack), thus now way to process again.
3. Exactly once: Kafka, SQS FIFO queue;
  1. Rely on either idempotency or distributed transaction to achieve exactly-once semantics.
    1. If rely on idempotency, both broker and consumer should be idempotent : broker to detect and eliminate duplicate messages when they are published; consumer to avoid processing message more than once.
    2. Hard to build such system in practice, most are at least once.

Consumer offsets

Log-based messaging system

Checkpointing

Two ways of processing messages
1. Take message from head of the queue;
2. Consumer process,
3. Delete from head when its processed ;
Second way is :
1. Retrieve message from any position in the queue
2. Consumer retrive message form position 0,
3. Consumer process, but not deleted the message immediately after processing;
4. It move the index to next one ;
Broker is busy for the first option: ActiveMQ, RabbitMQ, SQS
1. track status for every consumer
2. Make message processed
3. Wait for the ack
Broker is free : Kafka -> Better performance due to this
1. Receive the position of message requested and return the message(this position is maintained by the consumers)
Option 2
1. Messages are stored in both memory and disk(cheap and large)
2. Messages saved in segments , a segment is deleted when all messages in it are processed
3. Broker sign a increasing sequence number to message , identify message in segments
Message system follow this approach are referred as log based messaging system.
Consumer persist the offset (where and how up to consumer)
This process persisting the offset is called checkpointing. -> this offset is write often and read less often -> this offset is used to rewind of find the right position to process message when there is a bug...
Kenises -> DynamoDB, Kafka-> Zookeeper(but not good at high load writes so now offsets are stored in broker.)
Checkpointing for guarantee -
1. Checkpoint after message processed -> checkpointing may be fail-> at least once
2. Checkpoint before the message processed -> if processing fail, we lose message -> at most one
Option 1 is reliable and option2 is more reliable because in option2, all message are still there even something happens