Guest User

Don't Use MongoDB

a guest
Nov 5th, 2011
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 7.78 KB | None | 0 0
  1. Don't use MongoDB
  2. =================
  4. I've kept quiet for awhile for various political reasons, but I now
  5. feel a kind of social responsibility to deter people from banking
  6. their business on MongoDB.
  8. Our team did serious load on MongoDB on a large (10s of millions
  9. of users, high profile company) userbase, expecting, from early good
  10. experiences, that the long-term scalability benefits touted by 10gen
  11. would pan out. We were wrong, and this rant serves to deter you
  12. from believing those benefits and making the same mistake
  13. we did. If one person avoid the trap, it will have been
  14. worth writing. Hopefully, many more do.
  16. Note that, in our experiences with 10gen, they were nearly always
  17. helpful and cordial, and often extremely so. But at the same
  18. time, that cannot be reason alone to supress information about
  19. the failings of their product.
  21. Why this matters
  22. ----------------
  24. Databases must be right, or as-right-as-possible, b/c database
  25. mistakes are so much more severe than almost every other variation
  26. of mistake. Not only does it have the largest impact on uptime,
  27. performance, expense, and value (the inherit value of the data),
  28. but data has *inertia*. Migrating TBs of data on-the-fly is
  29. a massive undertaking compared to changing drcses or fixing the
  30. average logic error in your code. Recovering TBs of data while
  31. down, limited by what spindles can do for you, is a helpless
  32. feeling.
  34. Databases are also complex systems that are effectively black
  35. boxes to the end developer. By adopting a database system,
  36. you place absolute trust in their ability to do the right thing
  37. with your data to keep it consistent and available.
  39. Why is MongoDB popular?
  40. -----------------------
  42. To be fair, it must be acknowledged that MongoDB is popular,
  43. and that there are valid reasons for its popularity.
  45. * It is remarkably easy to get running
  46. * Schema-free models that map to JSON-like structures
  47. have great appeal to developers (they fit our brains),
  48. and a developer is almost always the individual who
  49. makes the platform decisions when a project is in
  50. its infancy
  51. * Maturity and robustness, track record, tested real-world
  52. use cases, etc, are typically more important to sysadmin
  53. types or operations specialists, who often inherit the
  54. platform long after the initial decisions are made
  55. * Its single-system, low concurrency read performance benchmarks
  56. are impressive, and for the inexperienced evaluator, this
  57. is often The Most Important Thing
  59. Now, if you're writing a toy site, or a prototype, something
  60. where developer productivity trumps all other considerations,
  61. it basically doesn't matter *what* you use. Use whatever
  62. gets the job done.
  64. But if you're intending to really run a large scale system
  65. on Mongo, one that a business might depend on, simply put:
  67. Don't.
  69. Why not?
  70. --------
  72. **1. MongoDB issues writes in unsafe ways *by default* in order to
  73. win benchmarks**
  75. If you don't issue getLastError(), MongoDB doesn't wait for any
  76. confirmation from the database that the command was processed.
  77. This introduces at least two classes of problems:
  79. * In a concurrent environment (connection pools, etc), you may
  80. have a subsequent read fail after a write has "finished";
  81. there is no barrier condition to know at what point the
  82. database will recognize a write commitment
  83. * Any unknown number of save operations can be dropped on the floor
  84. due to queueing in various places, things outstanding in the TCP
  85. buffer, etc, when your connection drops of the db were to be KILL'd or
  86. segfault, hardware crash, you name it
  88. **2. MongoDB can lose data in many startling ways**
  90. Here is a list of ways we personally experienced records go missing:
  92. 1. They just disappeared sometimes. Cause unknown.
  93. 2. Recovery on corrupt database was not successful,
  94. pre transaction log.
  95. 3. Replication between master and slave had *gaps* in the oplogs,
  96. causing slaves to be missing records the master had. Yes,
  97. there is no checksum, and yes, the replication status had the
  98. slaves current
  99. 4. Replication just stops sometimes, without error. Monitor
  100. your replication status!
  102. **3. MongoDB requires a global write lock to issue any write**
  104. Under a write-heavy load, this will kill you. If you run a blog,
  105. you maybe don't care b/c your R:W ratio is so high.
  107. **4. MongoDB's sharding doesn't work that well under load**
  109. Adding a shard under heavy load is a nightmare.
  110. Mongo either moves chunks between shards so quickly it DOSes
  111. the production traffic, or refuses to more chunks altogether.
  113. This pretty much makes it a non-starter for high-traffic
  114. sites with heavy write volume.
  116. **5. mongos is unreliable**
  118. The mongod/config server/mongos architecture is actually pretty
  119. reasonable and clever. Unfortunately, mongos is complete
  120. garbage. Under load, it crashed anywhere from every few hours
  121. to every few days. Restart supervision didn't always help b/c
  122. sometimes it would throw some assertion that would bail out a
  123. critical thread, but the process would stay running. Double
  124. fail.
  126. It got so bad the only usable way we found to run mongos was
  127. to run haproxy in front of dozens of mongos instances, and
  128. to have a job that slowly rotated through them and killed them
  129. to keep fresh/live ones in the pool. No joke.
  131. **6. MongoDB actually once deleted the entire dataset**
  133. MongoDB, 1.6, in replica set configuration, would sometimes
  134. determine the wrong node (often an empty node) was the freshest
  135. copy of the data available. It would then DELETE ALL THE DATA
  136. ON THE REPLICA (which may have been the 700GB of good data)
  137. AND REPLICATE THE EMPTY SET. The database should never never
  138. never do this. Faced with a situation like that, the database
  139. should throw an error and make the admin disambiguate by
  140. wiping/resetting data, or forcing the correct configuration.
  141. NEVER DELETE ALL THE DATA. (This was a bad day.)
  143. They fixed this in 1.8, thank god.
  145. **7. Things were shipped that should have never been shipped**
  147. Things with known, embarrassing bugs that could cause data
  148. problems were in "stable" releases--and often we weren't told
  149. about these issues until after they bit us, and then only b/c
  150. we had a super duper crazy platinum support contract with 10gen.
  152. The response was to send up a hot patch and that they were
  153. calling an RC internally, and then run that on our data.
  155. **8. Replication was lackluster on busy servers**
  157. Replication would often, again, either DOS the master, or
  158. replicate so slowly that it would take far too long and
  159. the oplog would be exhausted (even with a 50G oplog).
  161. We had a busy, large dataset that we simply could
  162. not replicate b/c of this dynamic. It was a harrowing month
  163. or two of finger crossing before we got it onto a different
  164. database system.
  166. **But, the real problem:**
  168. You might object, my information is out of date; they've
  169. fixed these problems or intend to fix them in the next version;
  170. problem X can be mitigated by optional practice Y.
  172. Unfortunately, it doesn't matter.
  174. The real problem is that so many of these problems existed
  175. in the first place.
  177. Database developers must be held to a higher standard than
  178. your average developer. Namely, your priority list should
  179. typically be something like:
  181. 1. Don't lose data, be very deterministic with data
  182. 2. Employ practices to stay available
  183. 3. Multi-node scalability
  184. 4. Minimize latency at 99% and 95%
  185. 5. Raw req/s per resource
  187. 10gen's order seems to be, #5, then everything else in some
  188. order. #1 ain't in the top 3.
  190. These failings, and the implied priorities of the company,
  191. indicate a basic cultural problem, irrespective of whatever
  192. problems exist in any single release: a lack of the requisite
  193. discipline to design database systems businesses should bet on.
  195. Please take this warning seriously.
Add Comment
Please, Sign In to add comment