10k

System Design Interview and beyond Note8 - Deliver Data Reliably

Deliver data reliably

What else to know to build reliable, scalable and fast systems?

List of common problems of distributed system

List of system design concepts that can help solve these problems

Three-tier architecture

  1. Problems in distributed systems

    1. Transient network issues

    2. Long lasting network use

    3. Consumer application failures-> broker may keep the messages for a while until its disk is full;

      1. Timeout, reties, idempotency, backoff with jitter, failover fallback, message delivery guarantees, consumer offsets
    4. If messages comes faster than then speed of consumer can process them due to -> traffic burst, or natural growth of load -> in such case, we will need to increate the throughout or reducing the latency:

      1. Batching, compression, scaling, partition
    5. Load shedding, rate limiting...

    6. Circular breaker and bulkhead when some consumers failures

      image-20240323085711706

    7. Three tier architecture:

      1. Representation tier: displaying data
      2. Application tier: business logic (functionality)
      3. Data tier:data persistence
    8. Request failed to pass through tier ? One tier becomes slow? ....

Timeouts

Fast failures and slow failures

Connection and request timeouts

  1. Everything eventually fails;
  2. Fast failure : we know failure fast
  3. Slow failure: we don't know failure immediately and we see it;'s still working -> occupy resources;
  4. So we use timeouts to turn slow failures into fast failure
  5. Timeout : how much time the client is willing to wait for a response from the server
    1. Connection timeout : how much time to wait for a connection to establish (10-100ms)
    2. Request timeout: time to wait for a request to complete
      1. Trick to choose timeout -> analyze metrics (percentile) and set a p99

What do we do with failed requests ?

Strategies for handling failed requests(cancel, retry, failover fallback)

  1. Cancel : when a request failed, it cancel this request , and may send an exception out. (Try, fail, log) -> this is useful when the failure is not transient(resend is not useful-> auth fail)

  2. Retry : when a request failed, it will resend the same request again. -> use when the failure is transient( network issue).

  3. Failover : switch to a redundancy or standby system of the same type.

    1. Database :

    2. Load balancers

    3. Leader and followes

      image-20240323093037187

      image-20240323092952105

  4. Failover in microservice

    image-20240323093133999

  5. Fallback : switch to an alternative system (of a different type)

    image-20240323093248735

    1. Message queue is totally different type of service B
  6. Fallback increase reliability but also the system complexity

    1. -> introduce new components
    2. Hard to test
    3. Fallback system rarely used may not be ready to handle the failed request
  7. Those can be used combined : when a request failed, we retry 3 times and if still fails, we use failover or fallback path.

When to retry

Idempotency

Quiz: which AWS API failures are safe to retry

  1. Idempotency : no matter how many times the operation is preformed, the result is always the same .

    1. X = 5 is idempotent
    2. x++ is not ;
  2. Idenpetent API : making multiple identical requests to that api has the same effect as making a single request

    1. READ is typically idempotent
    2. Write is typically not (create a new record) (what about put update?)
  3. Can we retry a non-idempotent API? -> yes, if you know

    1. Original request never hit the API (do nothing on server side)
    2. API supports idempotency kets(client tokens )
      1. Service A send request with tokens to service B;
      2. If service receives request and processed, stores token;
      3. When A sent the retry request, B find it's duplication, do nothing with the request;
      4. If the request if failed , means no record of token in B;
      5. When retry, the request will be accepted and processed
  4. This is applied to some temporary storage -> has a TTL of idempotency key

  5. How do we know an API is idempotent -> service provider tells us developer

  6. HTTP Status Code Description Retry? Comments
    N/A Network is not established yes Request never send before
    400 Bad request No Request bad, retry no help
    403 Access denied No Credential not right
    404 Not found exception no No requested resources
    429 Too many request Yes with delay Too many request in a periods of time. so wait for a bit
    500 Internal server error Yes if idempotent Internal unexpected error
    503 Service unavailable yes Service is not available, request not hit server
    504 Request timeout Yes of idempotent Request hit server but couldn't complete with the timeout periods

How to retry

  1. If service B is under pressure, retry will make it worse -> retry storm

    1. Introduce delays between retry attempts ; And the delay(wait time) increase after each retry (exponential backoff)

    2. Limit the number of retries -> failover falloff after several retries

    3. Retry from different clients at the same time will cause pressure to the server -> Force clients tore retry at a different times -> Jitter(random intervals)

      image-20240323101619665

  2. Jitter

    1. Easy to implement

    image-20240323101803694

    1. Data aggregation scenarios : application monitoring : if the metrics send monitoring data at the minute start , this will cause load pressure to the monitoring system -> jitter :

      image-20240323101941214

    2. In general : aggregate in time series and flush to system , workload that run periodically

      1. Use jitter to avoid start8ing all the tasks exactly at the same time

        image-20240323102148954

    3. Jitter can also help resource competing after they were waiting for a event

Message delivery guarantees

At most once guarantee

At left once guarantee

Exactly once once guarantee

  1. Push : when broker send message to consumer, it waits for the ack to comeback , if the broker was unable to deliver the message or the ack didn't come back , the broker may retry immediately or fallback
    1. Fallback option -> dead letter queue to contains poison pills(message send failed )
  2. Pull : consume send pull request, if fails ,retry, after consumer receive the message it will send a deletion message and the broker delete the message(ack);
  3. Completing consumers : speed up processing
    1. Push: broker randomly select one consumer and send message, if failed, it will retry, and if still fails, it will select another consumer to send the message(failover)
    2. Pull: all consumers are pulling messages, if one consumer receiving the messages, others should be invisible to pull this message -> SWA SQS
      1. Visibility timeout -> the message cannot be processed; enough for one consumer to process one message
  4. Message delivery guarantee
  5. At most once: message will never be delivered over once; we try, We failed and we we give up... -> message might be lost
  6. At least once: best effort : retry the message until it's delivered. -> messages might be delivered multiple times.
  7. Exactly once: message will be delivered, or proceed exactly once.
  8. Why we have these guarantees? Isn't exactly once is what we want?
  9. Failuers will happen -> the ways of the failures were handled determinate the guarantee.
    1. Broker don't care about the message delivered, it sends the message and regard it's successfully delivered.->message may be lost -> at most once -> throughput good but message safety bad.
    2. Broker wait for the message ack to come back-> at least once-> consumer may get the message multiple times. -> ack will be send back to broker only after the message processed.
      1. If the ack send back before the message was processed, if processing failed, the message was deleted (due to receiving of ack), thus now way to process again.
    3. Exactly once: Kafka, SQS FIFO queue;
      1. Rely on either idempotency or distributed transaction to achieve exactly-once semantics.
        1. If rely on idempotency, both broker and consumer should be idempotent : broker to detect and eliminate duplicate messages when they are published; consumer to avoid processing message more than once.
        2. Hard to build such system in practice, most are at least once.

Consumer offsets

Log-based messaging system

Checkpointing

  1. Two ways of processing messages
    1. Take message from head of the queue;
    2. Consumer process,
    3. Delete from head when its processed ;
  2. Second way is :
    1. Retrieve message from any position in the queue
    2. Consumer retrive message form position 0,
    3. Consumer process, but not deleted the message immediately after processing;
    4. It move the index to next one ;
  3. Broker is busy for the first option: ActiveMQ, RabbitMQ, SQS
    1. track status for every consumer
    2. Make message processed
    3. Wait for the ack
  4. Broker is free : Kafka -> Better performance due to this
    1. Receive the position of message requested and return the message(this position is maintained by the consumers)
  5. Option 2
    1. Messages are stored in both memory and disk(cheap and large)
    2. Messages saved in segments , a segment is deleted when all messages in it are processed
    3. Broker sign a increasing sequence number to message , identify message in segments
  6. Message system follow this approach are referred as log based messaging system.
  7. Consumer persist the offset (where and how up to consumer)
  8. This process persisting the offset is called checkpointing. -> this offset is write often and read less often -> this offset is used to rewind of find the right position to process message when there is a bug...
  9. Kenises -> DynamoDB, Kafka-> Zookeeper(but not good at high load writes so now offsets are stored in broker.)
  10. Checkpointing for guarantee -
    1. Checkpoint after message processed -> checkpointing may be fail-> at least once
    2. Checkpoint before the message processed -> if processing fail, we lose message -> at most one
  11. Option 1 is reliable and option2 is more reliable because in option2, all message are still there even something happens
Thoughts? Leave a comment