Untitled

>>15367027

I'm not doing that.  But I'll tell you how I would do it.  First of all, let's take care of performance analysis.

According to your specifications, every day you'd have to make 10^6 writes and about 5x10^6x10x(average records viewed per session) reads to the database.  Assume the average session reads 1k records... which I sincerely doubt: 5x10^10 reads per day.  Big problem.

The writes are easily doable, but considering most production databases only support 100-1000 concurrent connections (assume 1000) and an average db query takes 20ms (going off experience), you've got 50k queries per second.  That's only about 8.64% of the database traffic you'd require for this app.

Therefore, we have to use a network of (ideally geographically distributed) database hosts each supplying 1000 persistent connections.  Big problem though.  We also have to account for bursts of traffic, and generally the traffic any one app gets depends on the time of day.  We'd have to ensure that we've got a massive number of hosts to make this anywhere near feasible.

Then we'd have another problem, how do you ensure writes to a geographically distributed set of databases are synchronized?  Anyways maybe I shouldn't harp on the feasibility aspect of it and just go on.  Either way 1 million records per day should not be too difficult to synchronize across all databases, if a small tolerance is allowed for latency of reading new records.  Writing new records is ok as there are many ways to ensure you do not doubly write the same record to db, such as using a hash or UUID based on the data as the key.

You'd need a LOT of connections for the server, and pretty much no one computer is able to currently deal with on the order of 10^5ish connections.  Therefore, we need an API gateway in front of the server, and to also distribute the server across many hosts. Finally, each server should maintain connections to different databases.  It is not required for each server to connect to all databases.  Generally, servers can make significantly more connections than databases, which is why we have a one-to-many relationship between them.

Under this scheme, customer requests are directed to the gateway, and the gateway routes the requests to the appropriate server using an algorithm that considers the geographic proximity aka round trip ping and current server load.

That same request lands at a (hopefully) low ping and low load server and will query its (nearly) synchronized database for the results. Voila.

The client side is relatively easy, there can be requests done on pagination event emitted, or simply use a virtual scroller.


TL;DR - Kafka (or insert some other msg platform) queue of new records synchronizes distributed databases and a load balancing gateway routes client requests to any server in a network of distributed servers.