10k

System Design Interview and beyound Note1 - Requirements

Reinforce the understanding of frequently used system design concepts and to demonstrate how to apply them to solve problems.

How to Define System Requirements

System Requirements

  1. What a functional requirement and non-functional requirements

    • Functional requirement: defines the behavior, what a system us is supposed to do; e.g. the system must allow applications to exchange messages

    • Nonfunctional requirements: defines the quality of a system, how a system is supposed to be. e.g. scalable, highly available, fast

  2. Why they are important

    • Interviewer tends to make the problems vague and wants to see how candidates approach them. And non functional requirements can help us guide to correct direction when there is open ended questions.
    • Day to day work on technical design.
  3. An example of the thinking process should be when design a scalable, highly available and fast message queue:

    1. Let's start with scalability requirement -> Do we need to scale for reads or writes? -> probably both since the message will be written and consumed
      1. To scale writes -> partition messages in multiple queues; -> partition strategy ? Hash?
      2. Where to store the messages? Memory or disk? -> if disk, append only log or embed database -> if database, B-tree or LSM-tree database? LSM tree based since they are faster for writes.
      3. Partition will help scalable reads as well since we will have consumes for each partition.
      4. Read: Should I choose push or pull for reading? If pull, system might want to do the long pull to decrease the # of read request.
    2. High availability -> replicate message -> lead base or leaderless -> mist likely leader based-> leader selection -> a coordination service or a database that ensure strong consistency.
    3. Reliable - > protection -> rat limiting or load shedding
    4. shuffle sharding ?
    5. Reverse proxy?
    6. Batching and compress message to make it fast
    7. Internal net -> use TCP rather HTTP maybe .

Functional requirements

  1. How to define functional requirements
    1. Define who is using this system and how
      1. System we easily know user: Gmail, YouTube...
      2. Hard to know -> rate limiting system, fraud prevention system, CDN..., challenging to know the input and output.
    2. How to start with unclear user/usage: Start with a user(people/devices/...) and work backwards
      1. Backwards: Once you define the user you know how they use the system(behavbious) .
      2. e.g. Youtube: creators and viewers(users)
        1. Creators upload, edit,... videos more;(how)
        2. Viewers watch, likes, comments videos more.(how)
    3. Some times clear requirements and we are required to do API definition(REST/RPC guidelines) :
      1. Upload a video to the channel -> POST /channels/{channel_id}/videos
      2. Return a list of videos for the channel, ordered by popularity -> GET /channels/{channel_id}/videos?sort_by=view_desc
      3. Search for videos and channels -> GET /search?q={keywords}
      4. Watch the specific video -> GET /videos/id={video_id}
      5. Delete the specific video -> DELETE /videos/id={video_id}

High Availability

  1. Time-based and count-based availability(What)

    1. Time-based : Uptime , the percentage of time the system has been working and available.
    2. Count-based: Success ration of number of request.

    100% availability is impossible in reality; It's not about the number of time or successful request number, it's the process and architecture(high available).

  2. Design principles behind high availability

    1. Build redundancy to eliminate single points of failure(regions, availability zone, fallback, data replication, high availability pair...)
    2. Switch from one server to another with our losing data(DNS, load balancing, reverse proxy, API gateway, peer discovery, service discovery...)
    3. Protect the system from atypical client behavior(load shedding, rate limiting, shuffle sharing, cell-based architecture...)
    4. Protect system from failures and performance degradation of its dependencies(timeouts, circuit breaker, retries, bulkhead, idempotency, ...)
    5. Detect failures as they occur(monitoring at all levels,...)
  3. Process behind high availability (How)

    1. Change management: All code and configuration changes are reviewed and approved
    2. QA process: regularly exercise tests to validate that newly introduced changes meet functional and non-functional requirements
    3. Deployment: deploy changes to a production environment frequently, quickly, safely; automatically rollback.
    4. Capacity planning: monitor system utilization and add resources to meet growing demand.
    5. Disaster recovery: recover system quickly in the event of disaster; regularly test failover to disaster recovery.
    6. Root cause analysis: establish the root cause of the failure and identify preventive measures.
    7. Operational readiness review: evaluate system's operational state and identify gaps in operation; define actions to remediate risks.
    8. Game day: simulate a failure or event and test the system and team response.
    9. Team culture: good culture promotes process discipline.
  4. SLO & SLA

    1. Service level objective : availability goal.
    2. Service level agreement: the agreement of the availability goal.

Fault tolerance, resilience, reliability

  1. Fault tolerance = high availability ?

    1. Fault tolerance: (from Wiki): fault tolerance is the property that enables a system to continue operating properly in the event of one or more faults within some of its components.
      1. Fault: produced by system or engineer.
    2. Fault tolerance system's goal is to zero downtime.
    3. High available system is it accept the downtime and to try to minimize the downtime (no 100% availability)
    4. Fault tolerance is a higher level availability.(more redundancy and cost as well).
  2. Error, fault, failure

    1. Error: developer write program introduced error, this will lead to bugs(faults)-> one or several faults results in a system failure , failure is the inability of a system to perform the required function.
  3. Resilience almost equals fault tolerance.
  4. Game day vs Chaos engineering
    1. Game day helps team to response quickly and properly; (team behavior)
    2. Chaos engineering is like randomly killing a server to test the system ability of deal with this situation automatically. (System behavior)
  5. Reliability = high availability + correctness + time
    1. Correctness: system returns the correct result
    2. Time: system replies back in time
  6. Reliability vs high availability vs fault tolerance vs resilience
    1. Reliability: system always perform properly and in time
    2. high availability : small downtime
    3. fault tolerance : close to zero downtime
    4. Resilience: quickly recovery from failures.
  7. Expected and unexpected failures
    1. Reliability and high availability: system can handle expected failures:
      1. Server crash
      2. Power outrage
      3. Network problem
    2. Fault tolerance: system know how to handle unexpected issue quickly:
      1. Load spike
      2. Dependency failures

Scalability

  1. Ability to handle growing load (# of requests, volume of incoming/outgoing data/# of concurrent connection...)
  2. Vertical scaling : more powerful machine; simple but with machine become powerful, it become expensive as well, what's more, the ability is limited.
  3. Horizontal scaling : more machines; unlimited potential, but system become complicated,
    1. Service discovery, request find all the machine
    2. Load balance evenly
    3. Request counting(requirement that same request to same machine to maintain state)
    4. Maintenance of multiple machines
  4. Relational Database: single db -> scale up to a limit -> sharding to scale writes and replication to scale reads...
  5. Trade off , don't assume horizontal scaling -> clarify requirement
  6. Horizontal and vertical scaling can be used conjunction -> keep # of machine small -> upgrade machines -> become expensive at some point -> scale horizontal
  7. Elasticity : the ability of a system to acquire resource when needed and release then when it no longer needs them.
  8. Compare with scalability -> elasticity is short-term, tactical needs while scalability is long term and strategitic needs.
    1. A service has higher volume of request at day time so it may use more machines -> elasticity ;
    2. The service/business become popular over time and overall request volume are multiple times, the system or machines needs to be scaled -> scalability.
  9. Scalability can be automatic -> autoscale ; vs manually

Performance

  1. Time required to process something(latency) or the rate to which something is processed(throughput) ;(net disk download rate 10M/s or download 100M files in 5 sec);
  2. Response time: network delay + server process time
  3. Latency : can be network delay or server process time or can be response time.
    1. Networks : protocols \OSI models
    2. Serverside latent: water also, men vs disk, load cache, thread poll and parallel processing...
    3. Client-side : blocking vs non-blocking IO, message format , data compression, CDN, external cache...
  4. Average latency
  5. Percentiles: P99, P75, P50.... # of requests processed using contain time
    1. What's the goal of reduce latency ?
    2. Commitment to customer (SLO/SLA)
  6. Throughput (rate)
    1. Decrease latency
    2. Scale system
      1. File transfer -> chunk file -> sender to many workers (MapReduce) batch processing;
      2. Message queue- -> more consumer -> or even more queues
      3. Increase write throughput : sharing(partitioning) -> write different part to different database partition;
      4. Increase read throughput : replication -> replicate data
  7. Bandwidth: max rate of data transfer across a given path (bps)
  8. Increate network transfer thought put may needs to increate bandwidth .

Durability

  1. Once data is successfully submitted to the system, it's not lost, even fault occurs.
  2. How to achieve? -? copy data redundancy
  3. Backup: copy data from a non-volatile storage(disk) periodically.
    1. Full : store full copy of data at each backup. It has short restoration time but the backup time make be long;
    2. Differential backup: only save the difference in data since last backup. Smaller size, shorter creation time but longer restoration time compared to full backup.
    3. Incremental backup: the backup contains the change since preceding backup. Smaller size and short create time but longer restoration time (complex restoration process)
    4. Data change and backup has gap, some data may not be back up; and data restoration takes time and new drive(data is not available!).
  4. RAID(redundant array of independent disks): combination of multiple physical disks into one logic unit.
    1. From app perspective, it looks like a single device. (Internal disk redundancy);
    2. Also increate performance because it allows multiple read and write at the same time. -> RAID 1
    3. RAID 0: no replication and like sharding, separate block into different unit-> increate performance .
  5. Replication
    1. Server and application and storage are all duplicated - > they sync the writes to each other.
    2. Increase availability
  6. Cassandra: replication can help durability but RAID can provide additional protection.
  7. Other ways provide durability :
    1. versioning: like back up but store specific objects not all
    2. Safeguard agains accidental deletion ....
  8. It's not only about copy data , we also need to ensure copies are not corrupted and stay healthy.
    1. Regularly check may needed: checksum: when storing data calculate checksum, and when retrieving, calculate checksum and compare. If failed, create copy from other clean copy and remove corrupted data.(->HDFS)
  9. What's the durability goal?
  10. Availability -> system up time -> can I access my data now?
  11. Durability -> data storing without losing -> will I fetch my data in the future.

Consistency

  1. C in ACID: database (relational) transactions don'y violates data consistency.
  2. E in BASE(no SQL database) / tunable consistency/ CAP: E stands for eventually consistency.
  3. Single database can handle traffic in the old days; read and writes and simple and consistent
  4. Nowadays, replication is needed for better scalability. (Also better availability and durability)
    1. System always return single (most recent) data to customer -> strong consistency
    2. May return some outdated in a certain time periods -> weak consistency
  5. Consistency model: Rules of consistency levels, which defines the order of updates in the system and when these updates are visible
  6. Linearizability : strongest consistent can be implemented in practice : after the update completes, all clients when they read data , het back the updated values. (Strict consistency is the most strong consistency only exists in theory)
    1. Typical usage: banking, e-commerce, booking system distributed locks
    2. It's slow
    3. C in CAP stands for linearizability; we only need to choose between consistency and availability in case of a network partition.
  7. Eventual consistency : if there is no additional updates made to the object, eventually all reads will return the latest written value of that object.
    1. Inconsistency window is typically small (sub-second)
    2. Can be much faster than lineraizability (no need to finish sync immediately )
    3. No need to sacrifice availability
    4. DNS is the most popular example
  8. Eventual consistent may cause confusion :
    1. You left a comment after refresh , you read from another replica that hasn't sync the comment, you see no comment.
    2. You reply a comment to a previous one for explaining. You refresh page and don't find the comment -> the reason is you comment has't approbate to all replicas yet.
    3. and after a while you find it but the sequence of the comments are not right! -> database is partitioned.
      1. First comment goes to the first shard and second comment goes to the second shard;
      2. each shard is replicated
      3. Second comment synced to the second shard first then first comment in the shard sync to its replica;
      4. System assume the second is older when read
  9. Consistency model
    1. First disappearing comment: can be implementing by monotonic reads
    2. Second issue can be implementing read-your(after)-writes ;
    3. Third out of order comment can be solve by consistent prefix reads.

Maintainability, security, cost

Maintainability

  1. After product launch -> best practices
    1. Bufix
    2. Adding new features
    3. Improving performance
    4. Insreasing test coverage
    5. Documentation....

Interview questions:

  1. Failure modes and mitigations
    1. If some components fails, what happens to the rest of the system
    2. How the system handle network partitions?
    3. How we want the system handle network partitions?
  2. Monitoring
    1. How to monitor health
    2. How do I know which part is broken
  3. Testing
    1. Test each individual component
    2. How to do E2E test
  4. Deployment
    1. How to do CD safely
    2. How to roll back quickly and safely

Security

  1. CIA:
    1. confidentiality : data protected from unauthorized users;
    2. Integrity: data is not corrupted or lost and only authorized user can modify data
    3. Availability : authorized users have access to resources when needed

Interview questions:

  1. Indetiyt and permission management
    1. Who can access
    2. Who can access what in the system
    3. How to implement authentication and authorization in the system
  2. Infrustrture protects:
    1. DDos
    2. SQL
    3. Firewall or API gateway
  3. Data protection
    1. Protect data in rest
    2. In transit

Cost

  1. Reduce total cost
    1. engineering cost: design implementation, testing and deployment....
    2. Maintenance cost: automation of mounting testing and deploying
    3. Resource cost:
      1. hardware : machines , load balancers , network devices
      2. software : cloud service
        1. Storage
        2. Data transfer
        3. Request count
  2. How system design affect cost
    1. Availability : rudandant hardware -> hardware
    2. Durability: replica ->storage up
    3. Elasticity : reduce hardware cost up
    4. Long pulling request bathing: request count down
    5. Compression : byte transferred down
    6. Shot and cold storage : storage cost down (cold -> not used frequently)

Summary of System requirements

Process:

  1. Identify both functional and non-func requiements
  2. Write them down on the board
  3. Use non-func requirements when evaluate different options
  4. To identify function requirements, start with the customers and move backwards( interview should define some of them)
  5. Be ready to convert requirements to APIs

image-20240208112517351

image-20240208112559065

Don't over engineered. Think out load, keep generating and sharing dreads, enumerates concepts and discuss trade offs.

Thoughts? Leave a comment