Untitled

June 25th Disruption in Aisles Application and API service.

On Monday the 24th we deployed changes to how our hourly Shopify Catalog Sync behaved. We made the decision to adjust the routine from being a synchronous task to push it to background jobs. This would in theory allow us to process catalog syncs for all partners in a much shorter time period. This change was originally brought forth due to the increased load that adding automatic metadata updates into the sync routine would bring. The change was tested the week prior and run on our QA servers without issue on Friday. The change was pushed into production at 10:38am on Monday. At 12:40 we deployed a change that would prevent additional syncs from being added to the queue if a sync request for that partner was already slated to be worked on. This would ensure that only one sync per partner was ever queued up at a time. We noticed that jobs were being worked on and did not see any immediate issues for the rest of the day.

On Tuesday the 25th at around 12:00pm we received a message that Aisles Admin was issuing 503 errors to people trying to access it. At 12:25 we received the first alert from Pager Duty that the Aisles Application failed its scheduled uptime check. We also started to receive more frequent alerts to our Slack channel with 503 errors. At 12:21pm we deployed a revert that re-added logic that would skip a vast majority of api calls as part of the sync process. We had taken this logic out originally as it was found to be the cause of metadata not being automatically sync’ed to Shopify. This change did not seem to alleviate the issues we were having and by 3:15 we deployed a full revert to how we were processing Shopify syncs. We went back to doing the entire sync process synchronously and off the queue. This took us back to the original (and current) implementation of doing one partner’s catalog at a time.

We noticed no immediate changes and it was at this point we started to investigate if some other cause could be affecting our servers performance. Sometime between 4-4:30pm we noticed that there had been a significant increase in the amount of active database connection to the Aisles DB. Usually the number of connections rarely exceeded 20, whereas we were maintaining almost 40-50 active connections during this period of disruption. We began to look into where the connections were coming from and what they were attempting. There were an unusually large number of connections being used to query for:

Select * from order_items where order_items.order_id = $1 and order_items.order_id is not null

With the originating request coming from the aisles server themselves. At about 5pm Cory ran a command to reload the PHP process manager which allowed for the hanging connections to drop. We noticed an almost immediate improvement in performance and continued to monitor until the end of the day without issue.

We’ve added an additional monitoring tool to our Aisles stack to help us stay ahead of any issues. We’ve used New Relic on other application we maintain and have added it to our Aisles Application and API. This now allows us to trace through our database interactions between Phido and Aisles greatly increasing the visibility we have into any potential bottlenecks before they become issues of this scale. Additionally we plan to extract what we thought was the original offender (the updated sync + automatic metadata updates) into separate routines. The original sync logic will continue to perform as is while we launch a separate metadata sync routine that will run in tandem. The new metadata routine will require significantly fewer requests to the Aisles API and can be run at a lower frequency and thus should pose less risk upon implementation.