Advertisement
Guest User

Untitled

a guest
Sep 20th, 2018
111
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.29 KB | None | 0 0
  1. Skip to content
  2. Features
  3. Business
  4. Explore
  5. Marketplace
  6. Pricing
  7.  
  8. Search
  9.  
  10. Sign in or Sign up
  11. 2,973 48,801 6,527 donnemartin/system-design-primer
  12. Code Issues 37 Pull requests 17 Projects 0 Insights
  13. Join GitHub today
  14. GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
  15.  
  16. system-design-primer/solutions/system_design/pastebin/README.md
  17. 4c37b06 13 days ago
  18. @fabriziocucci fabriziocucci Fix typo in Design Pastebin.com exercise (#210)
  19. @donnemartin @Tarrasch @fabriziocucci @balki-server @baldassarreFe
  20.  
  21. 333 lines (240 sloc) 15.3 KB
  22. Design Pastebin.com (or Bit.ly)
  23. Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
  24.  
  25. Design Bit.ly - is a similar question, except pastebin requires storing the paste contents instead of the original unshortened url.
  26.  
  27. Step 1: Outline use cases and constraints
  28. Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
  29.  
  30. Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
  31.  
  32. Use cases
  33. We'll scope the problem to handle only the following use cases
  34. User enters a block of text and gets a randomly generated link
  35. Expiration
  36. Default setting does not expire
  37. Can optionally set a timed expiration
  38. User enters a paste's url and views the contents
  39. User is anonymous
  40. Service tracks analytics of pages
  41. Monthly visit stats
  42. Service deletes expired pastes
  43. Service has high availability
  44. Out of scope
  45. User registers for an account
  46. User verifies email
  47. User logs into a registered account
  48. User edits the document
  49. User can set visibility
  50. User can set the shortlink
  51. Constraints and assumptions
  52. State assumptions
  53. Traffic is not evenly distributed
  54. Following a short link should be fast
  55. Pastes are text only
  56. Page view analytics do not need to be realtime
  57. 10 million users
  58. 10 million paste writes per month
  59. 100 million paste reads per month
  60. 10:1 read to write ratio
  61. Calculate usage
  62. Clarify with your interviewer if you should run back-of-the-envelope usage calculations.
  63.  
  64. Size per paste
  65. 1 KB content per paste
  66. shortlink - 7 bytes
  67. expiration_length_in_minutes - 4 bytes
  68. created_at - 5 bytes
  69. paste_path - 255 bytes
  70. total = ~1.27 KB
  71. 12.7 GB of new paste content per month
  72. 1.27 KB per paste * 10 million pastes per month
  73. ~450 GB of new paste content in 3 years
  74. 360 million shortlinks in 3 years
  75. Assume most are new pastes instead of updates to existing ones
  76. 4 paste writes per second on average
  77. 40 read requests per second on average
  78. Handy conversion guide:
  79.  
  80. 2.5 million seconds per month
  81. 1 request per second = 2.5 million requests per month
  82. 40 requests per second = 100 million requests per month
  83. 400 requests per second = 1 billion requests per month
  84. Step 2: Create a high level design
  85. Outline a high level design with all important components.
  86.  
  87. Imgur
  88.  
  89. Step 3: Design core components
  90. Dive into details for each core component.
  91.  
  92. Use case: User enters a block of text and gets a randomly generated link
  93. We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.
  94.  
  95. Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.
  96.  
  97. An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.
  98.  
  99. The Client sends a create paste request to the Web Server, running as a reverse proxy
  100. The Web Server forwards the request to the Write API server
  101. The Write API server does the following:
  102. Generates a unique url
  103. Checks if the url is unique by looking at the SQL Database for a duplicate
  104. If the url is not unique, it generates another url
  105. If we supported a custom url, we could use the user-supplied (also check for a duplicate)
  106. Saves to the SQL Database pastes table
  107. Saves the paste data to the Object Store
  108. Returns the url
  109. Clarify with your interviewer how much code you are expected to write.
  110.  
  111. The pastes table could have the following structure:
  112.  
  113. shortlink char(7) NOT NULL
  114. expiration_length_in_minutes int NOT NULL
  115. created_at datetime NOT NULL
  116. paste_path varchar(255) NOT NULL
  117. PRIMARY KEY(shortlink)
  118. We'll create an index on shortlink and created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.1
  119.  
  120. To generate the unique url, we could:
  121.  
  122. Take the MD5 hash of the user's ip_address + timestamp
  123. MD5 is a widely used hashing function that produces a 128-bit hash value
  124. MD5 is uniformly distributed
  125. Alternatively, we could also take the MD5 hash of randomly-generated data
  126. Base 62 encode the MD5 hash
  127. Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters
  128. There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
  129. Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters
  130. The following Base 62 pseudocode runs in O(k) time where k is the number of digits = 7:
  131. def base_encode(num, base=62):
  132. digits = []
  133. while num > 0
  134. remainder = modulo(num, base)
  135. digits.push(remainder)
  136. num = divide(num, base)
  137. digits = digits.reverse
  138. Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
  139. url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
  140. We'll use a public REST API:
  141.  
  142. $ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
  143. "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste
  144. Response:
  145.  
  146. {
  147. "shortlink": "foobar"
  148. }
  149. For internal communications, we could use Remote Procedure Calls.
  150.  
  151. Use case: User enters a paste's url and views the contents
  152. The Client sends a get paste request to the Web Server
  153. The Web Server forwards the request to the Read API server
  154. The Read API server does the following:
  155. Checks the SQL Database for the generated url
  156. If the url is in the SQL Database, fetch the paste contents from the Object Store
  157. Else, return an error message for the user
  158. REST API:
  159.  
  160. $ curl https://pastebin.com/api/v1/paste?shortlink=foobar
  161. Response:
  162.  
  163. {
  164. "paste_contents": "Hello World"
  165. "created_at": "YYYY-MM-DD HH:MM:SS"
  166. "expiration_length_in_minutes": "60"
  167. }
  168. Use case: Service tracks analytics of pages
  169. Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.
  170.  
  171. Clarify with your interviewer how much code you are expected to write.
  172.  
  173. class HitCounts(MRJob):
  174.  
  175. def extract_url(self, line):
  176. """Extract the generated url from the log line."""
  177. ...
  178.  
  179. def extract_year_month(self, line):
  180. """Return the year and month portions of the timestamp."""
  181. ...
  182.  
  183. def mapper(self, _, line):
  184. """Parse each log line, extract and transform relevant lines.
  185.  
  186. Emit key value pairs of the form:
  187.  
  188. (2016-01, url0), 1
  189. (2016-01, url0), 1
  190. (2016-01, url1), 1
  191. """
  192. url = self.extract_url(line)
  193. period = self.extract_year_month(line)
  194. yield (period, url), 1
  195.  
  196. def reducer(self, key, values):
  197. """Sum values for each key.
  198.  
  199. (2016-01, url0), 2
  200. (2016-01, url1), 1
  201. """
  202. yield key, sum(values)
  203. Use case: Service deletes expired pastes
  204. To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.
  205.  
  206. Step 4: Scale the design
  207. Identify and address bottlenecks, given the constraints.
  208.  
  209. Imgur
  210.  
  211. Important: Do not simply jump right into the final design from the initial design!
  212.  
  213. State you would do this iteratively: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat. See Design a system that scales to millions of users on AWS as a sample on how to iteratively scale the initial design.
  214.  
  215. It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web Servers? CDN? Master-Slave Replicas? What are the alternatives and Trade-Offs for each?
  216.  
  217. We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.
  218.  
  219. To avoid repeating discussions, refer to the following system design topics for main talking points, tradeoffs, and alternatives:
  220.  
  221. DNS
  222. CDN
  223. Load balancer
  224. Horizontal scaling
  225. Web server (reverse proxy)
  226. API server (application layer)
  227. Cache
  228. Relational database management system (RDBMS)
  229. SQL write master-slave failover
  230. Master-slave replication
  231. Consistency patterns
  232. Availability patterns
  233. The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
  234.  
  235. An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.
  236.  
  237. To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
  238.  
  239. 4 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:
  240.  
  241. Federation
  242. Sharding
  243. Denormalization
  244. SQL Tuning
  245. We should also consider moving some data to a NoSQL Database.
  246.  
  247. Additional talking points
  248. Additional topics to dive into, depending on the problem scope and time remaining.
  249.  
  250. NoSQL
  251. Key-value store
  252. Document store
  253. Wide column store
  254. Graph database
  255. SQL vs NoSQL
  256. Caching
  257. Where to cache
  258. Client caching
  259. CDN caching
  260. Web server caching
  261. Database caching
  262. Application caching
  263. What to cache
  264. Caching at the database query level
  265. Caching at the object level
  266. When to update the cache
  267. Cache-aside
  268. Write-through
  269. Write-behind (write-back)
  270. Refresh ahead
  271. Asynchronism and microservices
  272. Message queues
  273. Task queues
  274. Back pressure
  275. Microservices
  276. Communications
  277. Discuss tradeoffs:
  278. External communication with clients - HTTP APIs following REST
  279. Internal communications - RPC
  280. Service discovery
  281. Security
  282. Refer to the security section.
  283.  
  284. Latency numbers
  285. See Latency numbers every programmer should know.
  286.  
  287. Ongoing
  288. Continue benchmarking and monitoring your system to address bottlenecks as they come up
  289. Scaling is an iterative process
  290. © 2018 GitHub, Inc.
  291. Terms
  292. Privacy
  293. Security
  294. Status
  295. Help
  296. Contact GitHub
  297. Pricing
  298. API
  299. Training
  300. Blog
  301. About
  302. Press h to open a hovercard with more details.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement