Advertisement
Guest User

Untitled

a guest
Jun 24th, 2017
72
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 4.09 KB | None | 0 0
  1. Today we had 2 outages as a result of two separate errors, one in implementation and one in design. We have begun the work necessary to prevent these failings in the future, but we understand that what happened today is unacceptable. Here are the two technical post-mortems for the events today.
  2.  
  3. Post-mortem #1:
  4.  
  5. We pushed rooms live at 1pm today and at 1:01pm we began to receive reports of errors. At 1:10pm we identified the problem and pushed a fix to our staging server. That fix has been tested and works. We spoke with Take3's management apparatus and agreed that doing the sale at 230pm was the right call.
  6. At 1:15pm we changed the onsale timer on the rooms from 1pm to 230pm.
  7.  
  8. So what happened:
  9. We made a typo in the code. Take3 requested that we cap fees on rooms at $25 which seemed like a totally reasonable request to make, so we added it. I personally tested this feature, but I didn't test it with expensive enough rooms, which was a mistake. On the $600 room, the only one I tested, the fees were less than $25 so the bug did not trigger. On every other room type, this bug caused the cart to reduce the cost of the room to $0. This, obviously, does not match the price of a room, so it failed our revenue assurance check and our checkout system, rightfully, blocked the sale. We have since corrected this bug in the latest version of the system that is now live.
  10. What are we doing to fix this:
  11.  
  12. We have corrected the configuration error and are expanding our automated and manual testing systems to cover everything in the system, with a particular focus on new product feature testing. We plan to retain additional technical resources (engineers) to work on this problem.
  13.  
  14. We love Take3 and we love events and are super sorry to have caused any drama around their experience. We know how hard it is to build a relationship with an audience and we know those relationships require trust. We are sincerely sorry for violating your trust and we will work to do better in the future.
  15. ___
  16.  
  17.  
  18. Post-mortem #2:
  19. At 2:30pm we set the rooms back live. All of the rooms checked out and all of the rooms were sold at the correct price.
  20.  
  21. At 2:40pm we began to receive reports of failures checking out after all of the rooms were gone. We investigated and found that for a very small number of customers who added the room to their cart at nearly exactly the same time as another customer were given what we call a "cart-lock" on a room. When they tried to checkout, whoever won the race to input credit card details into the system got the room.
  22.  
  23. What went wrong:
  24. This was specifically the experience we were trying to engineer against by designing a locking system for our ticket sales. We had specced our system for our first version to handle the sale of rooms at a 20% faster pace than last year. For reference, rooms sold out last year in approximately 30 minutes. This year rooms sold out in 45 seconds.
  25.  
  26. As a part of the design process, we built a queueing system that had a cap on the number of simultaneous participants who were allowed to buy products at the same time. We set this to 50 people to maximize throughput on the website. Due to the fact so many people were on the website at the same time all trying to buy tickets, the likelihood of a cart-lock collision was dramatically increased. The revenue assurance systems did block double sales of rooms, which is the correct behavior.
  27. What we've learned:
  28.  
  29. This was primarily a design failure. We underestimated the speed at which people would purchase rooms for this event significantly. We also did not take into account the scarcity of resources when designing the queueing implementation for this event.
  30.  
  31. For future events, when resources are particularly scarce, we will limit the number of participants allowed into the lobby at any given time. We will test a redesign of our locking system.
  32.  
  33. Again, please accept our sincerest apologies. Although the system worked almost as intended, we know that almost doesn't count. If you were impacted by this error, please reach out to me or the Take3 management apparatus.
  34.  
  35. Apologetically yours,
  36. Joshua and the rest of the Secret Party Team
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement