10k

System Design Interview and beyond Note11 - Protect server from client

Protect Servers from clients

System Overload

Why it's important to protect a system from overload

image-20240325075809245

Auto scaling

Scaling policies(metric-based, schedule-based, predictive)

  1. Overload happens when existing computational resources are not enough to process all the loads;

  2. -> add more servers manually?

    1. Do it often is operationally expensive
    2. Infrequent changes increase the likelihood of overload
    3. Leave it excess the needs is is waste of cost
  3. We should find ways to do(add and remove) it automatically -> auto scaling

    1. Improves availability -> system add or remove machines on the unpredictable traffic spikes
    2. Reduce cost
    3. Improve performance on mod requests
  4. Types /polices

    1. Metric based : perforce metrics are used to make decisions on whether to scale or not -> scale out when average CPU utilizations is over 80% and scale in when CPU utilization is less than 40%
      • CPU utilization
      • Memory utilization
      • Disk unitilization
      • Request count
      • Active threads count
    2. Schedule - based : define a schedule for the autoscaling system to follow: scale out business hour and scale in for non-business hours(nights and weekends)
    3. Predictive : machine learning models are used to used to predict expected traffic -> scale out when average CPU is expected to increate and scale in for when average CPU is expected to decrease -> use historical data to predict
    4. Can be used together
  5. Apply to messaging system

    1. Monitor depth of queue, when exceeds limit, we add consumers; when the queue shrinks, we reduce the consumer instances ;

    2. This only works for competing consumers model RabbitMQ &SQS

    3. Kafka and Kenises has a different way of implement auto scaling -> they are consumer offset -> they scale out by adding partitions

      image-20240325081638982

Autoscaling system design

How to deign an autoscaling system

  1. We have a web service and want to scale out and in based on it's metrics ;
  2. So we add a monitoring system to the service ;
  3. On the other hand, we need to configure rules that apply to the service autoscaling -> configuration service
  4. For the configuration service , we do need a storage ;
  5. For such rules , we need a coordinator/rule checker to fetch rules, and also fetch the metrics periodically to check if we need to do the scaling
    1. Here rules may be in a large number and rules can be split into chunks;
    2. One way to implement this is to elect a leader acting as a coordinator for how to partition rules and how to assign rules partition to workers for processing
    3. Another way is we split rules into fixed number partition; assign 10 workers to assign one partition to each worker, since there are distributed locks , each machine will get a different partition;
  6. If so, we can have a message(task) queue and send messages for scaling ;
  7. Here for the queue message consuming we could have a complete consumer model to increase the performance;
  8. The we need a consumer , called provisioning service, when the message was consumed, the new machines are added or machines are removed from a particular cluster.
    1. Provisioning service to find web service to scale -> service discovery problem
  9. image-20240326081409493

Load Shedding

How to implement

Important considerations

  1. Auto scaling relies on servers themselves effort, but sometimes this is not enough, the auto scaling system cannot scale in time, we need some protections as well.
    1. Monitoring system needs time to collection metric data
    2. Provisioning new machines takes time;
    3. Deploying services and bootstrapping on new machines takes time;
  2. Load shedding: drop incoming request when some metric reach a threshold. (Until more rooms is available)
  3. How to determine the threshold (CPU, storage ....) -> running load testing
  4. How to implementing:
    1. Thread per connection
      1. Limit the number of connections -> limit the number of clients
      2. Limit the number of request -> limit the number of processing threads
      3. Monitoring system performance and start drop request when reach a limit
    2. Thread per request with non-blocking io
      1. Limit the number of connections -> limit the number of clients
      2. Limit the number of request -> limit the number of processing threads
      3. Monitoring system performance and start drop request when reach a limit
    3. Event loop
      1. Limit the number of connections -> limit the number of clients
      2. Limit the number of request -> limit the size of processing queue
      3. Monitoring system performance and start drop request when reach a limit
  5. Tuning the thread pool size is hard, too big will put pressure on the server and too small may cause waste of CPU/memory resources -> load testing
  6. e.g. messaging system -> RabbitMQ-> consumer prefetch -> help avoid overloading consumer with too many messages
    1. Consumer side configure a prefetch buffer defines max messages consumer can consume,
    2. When the buffer is full, meaning the consumer is still processing the messages;
    3. The broker doesn't push more messages to the consumer
    4. When the buffer has space, the broker continue push messages
  7. Similar to TCP control flow
    1. TCP sender send packages to receiver and waits for ack;
    2. Ack contains how many buffers receiver is willing to accepts;
  8. Difference from back pressure : -> clients are involved this precedure
    1. This mechanism doesn't response the overloading or drop message to the broker -> this is the consumers business and is called load shedding (self defends)
    2. If consumer response to the broker on overload -> rejection -> back pressure
  9. Considerations
    1. Request priority -> healthy request of LB, people over robots
    2. Request cost : fetch data in the cache vs heave calculations
    3. Request duration: LIFO-> if the service is busy with processing request, it's better to focus on processing the most recent requests(e.g. A use tried to waiting for the response and refreshed the page in the browser), what's more, the old request may have timed out. -> but how the server know a request is timeout out? > time out hint (with request)
    4. Using with Autoscaling , avoid load shedding block autoscaling. -> when having both, configure a lower threshold for load auto scaling.

Rate limiting

How to use knowledge grained in the course for solving the problem of rate limiting (step by step)

  1. Unfair -> noisy neighbors
    1. image-20240327081421764
  2. How to solve ? -> we need to limit the total number of request coming to the system ; on the other hand we need to ensure each client will not impacted by others (they share total resources)
    1. Share resources evenly -> split limit to each client -> quota
  3. Rate limiting(throttling): controls the rate of request send or received on the network. (On per client based )
  4. How long is the interval / how to account for heterogeneity requests
    1. Assumption : 1s (configurable) and all request are equals
  5. Start with single server :
    1. Where to store the counter : memory or disks -> in-mem cache since the data is short-lived -> unique clientID is the key and request count is the value -> time also need to be stored in the key(let do this first)
    2. Two problems : concurrency and cache size
      1. Concurrency is a problem is we use a thread per request model and two different thread try to modify the same cache engine.-> lock or atomic values
      2. Too many keys -> LRU eviction or time-based expiration or both. -> active expiration is more effective
  6. Add another server:
    1. Split the quote and route the request by LB -> how to share the change in the cluster when machines added or removed in a timely manner
      1. Configuration tools to share and deploy to servers
      2. Database and pull data periodically
      3. Gossip protocol
    2. Issue :
      1. LB doesn't guarantee the uniform distribution for each client
        1. If quota is low, some servers will start throttling requests long before the quota is reached.
        2. LB balances connection and not requested , some persistent connections may be used to transfer more request than others
  7. Servers share the number of request they received
    1. Gossiping
    2. Service registry or seed nodes advertised by DNS
    3. Use TCP and UDP
  8. This doesn't scale well -> communication increase with the server # of the cluster
  9. Use a shared cache , server report the count to the cache
  10. What client should do with the throttle request
    1. Retries
    2. Fallback
    3. Expotentional request with Jitter
    4. Batching
Thoughts? Leave a comment