Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Schema Design in Mongo
- ======================
- table => collection
- row => json doc
- embed vs. link
- --------------
- * "contains relationship:" embed (pre-joined in a sense)
- * 4MB limit on embedded document size
- order = { _id
- ...
- 'lineitems': []
- 'shippingaddress': {}
- 'total':0
- 'tax':0
- 'subtotal':0
- }
- db.orders.ensureIndex({})
- * reach into objects via expressions
- db.factories.insert( { name: "xyz", metro: { city: "New York", state: "NY" } } )
- db.factories.ensureIndex({'metro.city': 1, 'metro.state':1})
- db.factories.find({'metro.state':'NY'})
- Map/Reduce
- ----------
- map
- _id: ...
- ----------
- * automatically indexed
- * unique
- * invariant
- * or use ObjectID (best for sharded collection)
- Atomic
- compare and swap
- db.inventory.update({_id:n._id, qty:qty_old}, n)
- MySQL to MongoDB
- ================
- physical HW or dedicated VM
- disk + memory = happy mongo
- migrating data
- -------------
- * read from old storage, write to the new storage
- * moved 5 billion rows from MySQL
- - 100,000 inserts/sec
- - cpu-bound
- wordnik reads + creates java objects @ 250/sec (!)
- disk space
- ----------
- CMS and MongoDB
- ==============
- gridfs - rich media storage, binary objects
- Event Logging
- =============
- * who does what/how? -- funnels
- * how valuable are groups of users? -- virality
- * are our changes working? -- retention, funnel conversion
- backend dreams
- --------------
- * flexible
- * scalable
- * queryable
- * easy to work with
- Enter Mongo
- ========
- * schemaless
- * rich data manipulation/access
- * at home in web-centric toolchain
- Event example
- ==============
- [{
- name: 'front_page/broadcast_link',
- date: '',
- unique_id: 'sfsadfas',
- bucket: 'big_red_button'
- },
- {
- name: 'front_page/broadcast_link',
- date: '',
- unique_id: 'sfsadfas',
- bucket: 'small_blue_button'
- }]
- Processing Data
- ---------------
- Python - 1 process, 1 machine
- Map/Reduce - hadoop
- Config Docs - name of events to track, realtime?, unique?
- Generate/Apply MongoDB operations
- example:
- how many times each event occurred per bucket
- ( for small collections, use collection.group() )
- ( for large collections, use collection.mapReduce() )
- for every event that comes in, increment
- each event builds up into a heap (group)
- for e in event:
- key = event['name']
- if key in matchers:
- count_key = ""
- db.event_counts.update()
- ...
- event: someone click the broadcast button
- (auth)
- click "allow" or "disallow" box
- (share with friends)
- start broadcasting
- tracking 36 different events
- * counts
- * periodic map/reduce: map/reduce every 15min for more complex analysis
- generated javascript map/reduce
- Funnel Calculation
- ------------------
- * per user rollup:
- - for each user, which steps in the funnel have they been at with constraints applied
- - a map to get unique users, reduce to count which unique events triggered
- * per bucket rollup:
- - for each bucket, count of users at each step (abandoned/completed)
- calculations done in batch
- Future
- ------
- * migrate postgres stuff to mongo
- * batch jobs for funnel, retention, virality
- Observations
- ------------
- * big deletes seem to slow things down... so capped collection might be a good idea
Add Comment
Please, Sign In to add comment