Untitled

System design interview

  CUSTOMER PERSPECTIVE
    Problem
      Definition.
      Who is the customer?
      Pain points.
      Use cases
      Scenarios that will not be covered
    Functional requirements
      Entities and verbs.
      High-level contract (API)
      Make several iterations if possible
    Non-functional requirements
      Performance
        P99 latency for read/write queries?
        Write-to-read data delay?
      Scalability
        Usage patterns, e.g. reads vs writes.
        How many users?
        How many read queries per second?
        How much data is queried per request?
        How many video views are processed per second?
        Can there be spikes in traffic?
      Cost
        Maximize cost of {developmet, time-to-market, maintenance}
      Availability vs Consistency
      Durability
  ESTIMATIONS [5 min]
    Throughput (QPS for read and write queries)
    Latency expected from the system (for read and write queries)
    Read/Write ratio
    Traffic estimates
      Write (QPS, Volume of data)
      Read  (QPS, Volume of data)
    Storage estimates
    Memory estimates
      If we are using a cache, what is the kind of data we want to store in cache
      How much RAM and how many machines do we need for us to achieve this ?
      Amount of data you want to store in disk/ssd
  HIGH LEVEL DESIGN [5-10 min]
    APIs for Read/Write scenarios for crucial components
    Database schema
    Basic algorithm
    High level design for Read/Write scenario
  DEEP DIVE [15-20 min]
    Scaling the algorithm
    Scaling individual components
      Availability, Consistency and Scale story for each component
      Consistency and availability patterns
    Think about the following components, how they would fit in and how it would help
      DNS
      CDN [Push vs Pull]
      Load Balancers [Active-Passive, Active-Active, Layer 4, Layer 7]
      Reverse Proxy
      Application layer scaling [Microservices, Service Discovery]
      DB [RDBMS, NoSQL]
        RDBMS
          Master-slave, Master-master, Federation, Sharding, Denormalization, SQL Tuning, Indexing
        NoSQL (in general - Denormalized data + no-joins)
          Key-Value, Wide-Column, Graph, Document
          Fast-lookups:
            RAM [Bounded size] => Redis, Memcached
            Availability  [Unbounded size] => Cassandra, RIAK, Voldemort
            Consistency  [Unbounded size] => HBase, MongoDB, Couchbase, DynamoDB
        Caches
          Client caching, CDN caching, Webserver caching, Database caching, Application caching, Cache @Query level, Cache @Object level
          Eviction policies:
            LRU, LFU, FIFO
          Caching patterns:
            Cache aside
            Write through
            Write behind
            Refresh ahead
        Asynchronism
          Message queues
          Task queues
          Back pressure - Resistance or force opposing the desired flow of data through software("pipes") - buffering vs. dropping
        Communication
          TCP
          UDP
          RESTRPC
          Binary protocols - Apache Avro (evolved from Protocol Buffers and Thrift)
        Security
          Encryption: during transfer/at rest
          Government compliance (EU/China/US)
          Authentication/authorization
          Firewalls
          Payment data storage/handling/compliance
          High level threat modeling (obvious ones)
        Telemetry/monitoring/logs aggregation/Dashboards
          Host level metrics: CPU, Memory, Threads, Disk I/O, Garbage Collection runs
          Fleet - AVG. to first byte response, Surge queue on LB, VIP Spillover, Database preassure, cache tier
          Alarms/setting up thresholds/canaries
          Actions feed/Key business metrics: daily active users, retention, revenue, etc. - Buisness Intelligence
            Control the producer (slow down/speed up is decided by consumer)
            Buffer (accumulate incoming data spikes temporarily)
            Drop (sample a percentage of the incoming data)
            Technically there’s a fourth option — ignore the backpressure — which, to be honest, is not a bad idea if the backpressure isn’t causing critical issues. Introducing more complexity comes at a cost too.
        Costs/optimizations. When using cloud services, it’s important to keep a lid on your costs.
          Autoscaling/adding "elasticity" (discuss traffic patterns: regional/seasonal), failovers
          SSD vs. HDD
          Commodity hardware vs. specialized ("optimized" for Memory, Disk I/O, CPU, GPU)
          Open-source vs Paid vs Built in-house
        Experimentation capability sooner or later comes into large scale products
        Testing capability/testing tools and hooks/Gremlins/Hogs/Gameday-Outages excersice - intoducing chaos into the system
        Deployments/rollbacks/canaries/soak-times/etc.
        Pluggable instrumentation
  JUSTIFY [5 min]
    Throughput of each layer
    Latency caused between each layer
    Overall latency justification