10k

System Design Case Study 5 - Design A URL shortener

1. Understand the problem and establish design scope

Ask questions and clarifications

  1. An example?

    You have a url https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long,and your service create a short url: https://tinyurl.com/y7keocwj, when you click the url, it redirect you to the original url.

  2. What is the traffic volume?-> 100 million per day

  3. How long is the shorten url? -> as short as possible

  4. What characters are allowed in the shorten url? -> a-z, A-z, 0-9

  5. Are we allowed to update and delete the short urls? -> no

Basic use case:

  1. URL shortening: given the long url, return the short one
  2. URL redirecting: click the short url, redirect to the original url
  3. Availability, scalability, fault tolerance

Estimation

Write per day: 100 million requests per day

Write per second: 10*8 / 24/3600~= 1160

Assume read/write is 10:1, we got read per sec is 11600

Assume this service will run for 10 year, the record number would be 100 million*365*10=365 billion records

Assume average Len of url us 100 byte, storage over 10 years = 365 billion*100Byte = 36.5 billon KB = 36.5 million MB = 36500 GB = 36.5 TB

2. High level design

API Endpoint

  1. Use restful API
    1. POST api/v1/shorten, parameter would be {url: <longURL>}, return short url
    2. GET api/v1/<shortURL>, redirect to corresponding long url.

URL redirecting

Figure 1 shows what happens when you enter a tinyurl onto the browser. Once the server receives a tinyurl request, it changes the short URL to the long URL with 301 redirect.

Figure 1

The detailed communication between clients and servers is shown in Figure 2.

Figure 2

301 redirect - the requests permanently moved to the long url

The browser caches the long url in the first response (301), and later will request the long url directly

302 redirect -the requests temporarily moved to the long url

Each time the browser send a request, it will go to the short url first and them get 301, and then request the long url.

Use 301 redirect makes sense for there would be too many unnecessary requests to the tiny url server, however, if analytics is important, 302 redirect is a better choice as it can track click rate and source of the click more easily.

URL shortening

Hash table is a good choice for storing such thing: {shortUrl, longUrl}. And for the function we could implement a hash function.

Figure 3

3. Deep dive

Data Model

We could store the data in a hash table in memory, but the resource is limited and it's hard to scale.

We could store the data in a relational database

Figure 4

Hash function

Hash value length

We are going to fun this service for 10 year, 365 billion records generated. We have a-z,A-Z, 0-9 62 characters available in total. To support 365 billion records, 65^n >= 36 billion. After calculation it's 7.

Hash function and collision
  1. Some well known hash functions: CRC32, MD5, or SHA-1.

    Hash function Hash value (Hexadecimal)
    CRC32 5cb54054
    MD5 5a62509a84df9ee03fe1230b9df8b84e
    SHA-1 0eeae7916c06853901d9ccbefbfcaf4de57ed85b
  2. No of them are suitable. CRC has 8 digits which is greater than 7. Remember, we need to keep the short url as short as possible.

    1. If we use CRC32 8 digits and remove the last digits, that would be 7 but we got more chances of collision.

    2. We could append some predefined strings to the url and rehash until we got a non-repeated key.

      Figure 5

    3. It's expensive to check if a key exists in database, so we can use bloom filter which is a effective technique to check is an element is a member of a set,

Base62 conversion
  1. Base conversion is also commonly used for shorten url.

  2. For unitizing this base conversion technique, we need to introduce unique ID for each records.

  3. This is how it works:

    convert 1115710 to base 62 representation (1115710 represents 11157 in a base 10 system).

    • From its name, base 62 is a way of using 62 characters for encoding. The mappings are: 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z, where ‘a’ stands for 10, ‘Z’ stands for 61, etc.
    • 1115710 = 2 x 622 + 55 x 621 + 59 x 620 = [2, 55, 59] -> [2, T, X] in base 62 representation. Figure 6 shows the conversation process.

    Figure 6

    • Thus, the short URL is https://tinyurl.com/2TX
Comparison
Hash + collision resolution Base 62 conversion
Fixed length Not fixed, goes up with ID
No need of UUID Depends on unique ID(I think it's because of the conversion is number things and original long url contains characters)
Collision exists and needs to be handled No collision because of unique ID
No order Can be guessed next feasible url if ID increate be 1 for a new entry-> security

URL shorten deep dive

Figure 7

URL redirecting deep dive

Figure 8

(My original design)

无标题绘图

4. Wrap up

If there are extra time:

  • Rate limiter for the service
  • Server Scaling
  • DB scaling
  • Analitics
  • Availability, consistency and reliability.
Thoughts? Leave a comment