Advertisement
Guest User

Untitled

a guest
Oct 3rd, 2024
235
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 7.64 KB | None | 0 0
  1. Salesforce Technology Service Delivery
  2. Service Disruption on Salesforce Instances in Multiple Clouds
  3. on October 1-2, 2024
  4. Preliminary Root Cause Analysis
  5. Published on October 3, 2024
  6. We sincerely apologize for any impact this incident may have caused you and your business. At Salesforce, Trust is our #1 value, and security is our top priority. We value transparency and want to take this opportunity to outline the facts regarding a recent service disruption that may have impacted your ability to use multiple Salesforce services, based on our current understanding. Our investigation is ongoing, and we will provide you with updated information as it becomes available.
  7. Executive Summary - What Happened?
  8. The Salesforce infrastructure was disrupted from October 1 to October 2, 2024, when a missing time-specific configuration prevented Core Application (core app) servers from starting up starting October 1, 2024, at midnight UTC. The missing configuration is intended to allow log data to be encrypted using different keys during different time periods to meet compliance requirements.The Salesforce Technology team immediately engaged to resolve the issue, which required an emergency release. Live production app servers perform restarts as part of routine operations. During this incident, the missing configuration prevented the app start up, and
  9.  
  10. reduced the capacity of the fleet, resulting in a performance degradation starting at approximately 07:00 UTC. Further environments were impacted as additional app server restarts took place.
  11. The Technology team performed an emergency release to add the configuration metadata timestamp, enabling the app servers to restart. The full rollout of the emergency release took 14 hours due to the capacity limits on the number of cells that can be upgraded in parallel. While the rollout was in progress, manual efforts to suppress restarts and add the missing metadata mitigated the impact. Further updates to this root cause analysis will be added to this document over time.
  12. How did this issue impact Salesforce services?
  13. During the impact period, users may not have been able to access Salesforce services, and a further subset could log in but experienced poor performance. Users may have received a “We are down for maintenance” error message during the disruption.
  14. Status.Salesforce.com also experienced intermittent periods of service disruption, with users unable to access the Trust site. However, updates to Trust were posted. That post provides details on the impacted instances.
  15. Technical Details Detection and Initial Impact
  16. On October 1, 2024, at 02:40 UTC, the Technology team became aware of an incident impacting seven sandbox instances running at 50% capacity. Some application servers did not restart successfully, and no known customer impact existed.
  17. On October 1, 2024, 06:45 UTC, Salesforce received the first cases from APAC customers, and the incident was upgraded to a Sev0. Some users could not access Salesforce services, and others experienced performance degradation. Users may have received a “We are down for maintenance” error message during the outage period.
  18. Following these early alerts and customer case escalations, the Technology team engaged multiple swimlanes of Subject Matter Experts (SMEs) to address the issues.
  19. Remediation
  20. An early investigation by the SMEs pointed to a configuration issue that required an emergency release to fix the problem. The rollout of the emergency release took up to 14 hours due to capacity limits on the number of cells that could be fixed in parallel. While the rollout was in
  21. The information contained in this document is provided by Salesforce for general information purposes only and is based on information as of the date of distribution and is subject to change. This document and the information contained herein is for the benefit of the intended recipients only and may not be reproduced, disseminated further, or disclosed to any third party except as specifically permitted by the intended recipients' agreement(s) with Salesforce. The incident timeline displayed within this Salesforce Technology Service Delivery Root Cause Analysis reflects the incident investigation timeline (inclusive of remediation and monitoring). While it includes any period during which the Service was unavailable, the investigation, remediation, and monitoring period is generally much longer than the period of unavailability, if any, caused by the disruption.
  22.  
  23. progress, manual efforts to suppress restarts and add the missing metadata were used to mitigate the impact.
  24. To speed up the rollout of the emergency break-fix across all instances, the Technology team added capacity to the instances. The incident was declared resolved at 04:31 UTC on October 2, 2024.
  25. Root Cause Analysis and Next Steps
  26. The Technology team’s remediation actions and early post-incident investigation identified a missing configuration affecting the core application. The missing encryption key configuration is designed to fulfill compliance requirements, with new keys required starting October 1, 2024. The Technology team provisioned additional keys before the due date but did not include necessary configuration metadata, causing the app startup to fail beginning on October 1. The key provisioning process lacks automated safeguards.
  27. The Technology team is currently reviewing how to harden app server startup resiliency and is looking for ways to speed up the deployment of emergency releases.
  28. While the team has identified the trigger of the incident, a deep investigation into the root cause, contributing factors, and further actions is underway. An updated RCA will be provided when available.
  29. We sincerely apologize for the impact this incident may have caused you and your business; Salesforce is fully committed to minimizing downtime when incidents do occur. We also continually assess and improve our tools, processes, and architecture to provide yo
  30. u with the best service possible.
  31. The information contained in this document is provided by Salesforce for general information purposes only and is based on information as of the date of distribution and is subject to change. This document and the information contained herein is for the benefit of the intended recipients only and may not be reproduced, disseminated further, or disclosed to any third party except as specifically permitted by the intended recipients' agreement(s) with Salesforce. The incident timeline displayed within this Salesforce Technology Service Delivery Root Cause Analysis reflects the incident investigation timeline (inclusive of remediation and monitoring). While it includes any period during which the Service was unavailable, the investigation, remediation, and monitoring period is generally much longer than the period of unavailability, if any, caused by the disruption.
  32.  
  33. The information contained in this document is provided by Salesforce for general information purposes only and is based on information as of the date of distribution and is subject to change. This document and the information contained herein is for the benefit of the intended recipients only and may not be reproduced, disseminated further, or disclosed to any third party except as specifically permitted by the intended recipients' agreement(s) with Salesforce. The incident timeline displayed within this Salesforce Technology Service Delivery Root Cause Analysis reflects the incident investigation timeline (inclusive of remediation and monitoring). While it includes any period during which the Service was unavailable, the investigation, remediation, and monitoring period is generally much longer than the period of unavailability, if any, caused by the disruption.
  34.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement