Untitled

I don't have a full RFO yet. What I can tell you...

One of our redundant storage switch stack blades went offline and caused MCLAG to fail on that stack. These are designed to remain online in the event of a single switch/blade failure. In this event, that didn't happen and the entire stack rebooted itself. That stack is responsible for all storage connections from your hypervisors to the SAN. Because that stack went down, your LUNS went offline and thus took down all of your VMs. This also affected many other customers so our job of getting things restored this morning was extensive.

I am waiting to hear from my networking team about why this switch stack didn't perform as it was designed to perform. Generally speaking, the network team doesn't turn these answers out in a few hours. This requires hours of extensive log research both from us and from the manufacturer.

When I do have a solid answer for this, I will let you know. At that point I will be able to make the decisions I need to make to remedy the situation so this can be prevented in the future.