Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Don't use MongoDB
- =================
- I've kept quiet for awhile for various political reasons, but I now
- feel a kind of social responsibility to deter people from banking
- their business on MongoDB.
- Our team did serious load on MongoDB on a large (10s of millions
- of users, high profile company) userbase, expecting, from early good
- experiences, that the long-term scalability benefits touted by 10gen
- would pan out. We were wrong, and this rant serves to deter you
- from believing those benefits and making the same mistake
- we did. If one person avoid the trap, it will have been
- worth writing. Hopefully, many more do.
- Note that, in our experiences with 10gen, they were nearly always
- helpful and cordial, and often extremely so. But at the same
- time, that cannot be reason alone to supress information about
- the failings of their product.
- Why this matters
- ----------------
- Databases must be right, or as-right-as-possible, b/c database
- mistakes are so much more severe than almost every other variation
- of mistake. Not only does it have the largest impact on uptime,
- performance, expense, and value (the inherit value of the data),
- but data has *inertia*. Migrating TBs of data on-the-fly is
- a massive undertaking compared to changing drcses or fixing the
- average logic error in your code. Recovering TBs of data while
- down, limited by what spindles can do for you, is a helpless
- feeling.
- Databases are also complex systems that are effectively black
- boxes to the end developer. By adopting a database system,
- you place absolute trust in their ability to do the right thing
- with your data to keep it consistent and available.
- Why is MongoDB popular?
- -----------------------
- To be fair, it must be acknowledged that MongoDB is popular,
- and that there are valid reasons for its popularity.
- * It is remarkably easy to get running
- * Schema-free models that map to JSON-like structures
- have great appeal to developers (they fit our brains),
- and a developer is almost always the individual who
- makes the platform decisions when a project is in
- its infancy
- * Maturity and robustness, track record, tested real-world
- use cases, etc, are typically more important to sysadmin
- types or operations specialists, who often inherit the
- platform long after the initial decisions are made
- * Its single-system, low concurrency read performance benchmarks
- are impressive, and for the inexperienced evaluator, this
- is often The Most Important Thing
- Now, if you're writing a toy site, or a prototype, something
- where developer productivity trumps all other considerations,
- it basically doesn't matter *what* you use. Use whatever
- gets the job done.
- But if you're intending to really run a large scale system
- on Mongo, one that a business might depend on, simply put:
- Don't.
- Why not?
- --------
- **1. MongoDB issues writes in unsafe ways *by default* in order to
- win benchmarks**
- If you don't issue getLastError(), MongoDB doesn't wait for any
- confirmation from the database that the command was processed.
- This introduces at least two classes of problems:
- * In a concurrent environment (connection pools, etc), you may
- have a subsequent read fail after a write has "finished";
- there is no barrier condition to know at what point the
- database will recognize a write commitment
- * Any unknown number of save operations can be dropped on the floor
- due to queueing in various places, things outstanding in the TCP
- buffer, etc, when your connection drops of the db were to be KILL'd or
- segfault, hardware crash, you name it
- **2. MongoDB can lose data in many startling ways**
- Here is a list of ways we personally experienced records go missing:
- 1. They just disappeared sometimes. Cause unknown.
- 2. Recovery on corrupt database was not successful,
- pre transaction log.
- 3. Replication between master and slave had *gaps* in the oplogs,
- causing slaves to be missing records the master had. Yes,
- there is no checksum, and yes, the replication status had the
- slaves current
- 4. Replication just stops sometimes, without error. Monitor
- your replication status!
- **3. MongoDB requires a global write lock to issue any write**
- Under a write-heavy load, this will kill you. If you run a blog,
- you maybe don't care b/c your R:W ratio is so high.
- **4. MongoDB's sharding doesn't work that well under load**
- Adding a shard under heavy load is a nightmare.
- Mongo either moves chunks between shards so quickly it DOSes
- the production traffic, or refuses to more chunks altogether.
- This pretty much makes it a non-starter for high-traffic
- sites with heavy write volume.
- **5. mongos is unreliable**
- The mongod/config server/mongos architecture is actually pretty
- reasonable and clever. Unfortunately, mongos is complete
- garbage. Under load, it crashed anywhere from every few hours
- to every few days. Restart supervision didn't always help b/c
- sometimes it would throw some assertion that would bail out a
- critical thread, but the process would stay running. Double
- fail.
- It got so bad the only usable way we found to run mongos was
- to run haproxy in front of dozens of mongos instances, and
- to have a job that slowly rotated through them and killed them
- to keep fresh/live ones in the pool. No joke.
- **6. MongoDB actually once deleted the entire dataset**
- MongoDB, 1.6, in replica set configuration, would sometimes
- determine the wrong node (often an empty node) was the freshest
- copy of the data available. It would then DELETE ALL THE DATA
- ON THE REPLICA (which may have been the 700GB of good data)
- AND REPLICATE THE EMPTY SET. The database should never never
- never do this. Faced with a situation like that, the database
- should throw an error and make the admin disambiguate by
- wiping/resetting data, or forcing the correct configuration.
- NEVER DELETE ALL THE DATA. (This was a bad day.)
- They fixed this in 1.8, thank god.
- **7. Things were shipped that should have never been shipped**
- Things with known, embarrassing bugs that could cause data
- problems were in "stable" releases--and often we weren't told
- about these issues until after they bit us, and then only b/c
- we had a super duper crazy platinum support contract with 10gen.
- The response was to send up a hot patch and that they were
- calling an RC internally, and then run that on our data.
- **8. Replication was lackluster on busy servers**
- Replication would often, again, either DOS the master, or
- replicate so slowly that it would take far too long and
- the oplog would be exhausted (even with a 50G oplog).
- We had a busy, large dataset that we simply could
- not replicate b/c of this dynamic. It was a harrowing month
- or two of finger crossing before we got it onto a different
- database system.
- **But, the real problem:**
- You might object, my information is out of date; they've
- fixed these problems or intend to fix them in the next version;
- problem X can be mitigated by optional practice Y.
- Unfortunately, it doesn't matter.
- The real problem is that so many of these problems existed
- in the first place.
- Database developers must be held to a higher standard than
- your average developer. Namely, your priority list should
- typically be something like:
- 1. Don't lose data, be very deterministic with data
- 2. Employ practices to stay available
- 3. Multi-node scalability
- 4. Minimize latency at 99% and 95%
- 5. Raw req/s per resource
- 10gen's order seems to be, #5, then everything else in some
- order. #1 ain't in the top 3.
- These failings, and the implied priorities of the company,
- indicate a basic cultural problem, irrespective of whatever
- problems exist in any single release: a lack of the requisite
- discipline to design database systems businesses should bet on.
- Please take this warning seriously.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement