Untitled

The AFS fileserver GUAVA went down at approximately 8:15pm. I noticed the problem at approximately 11:45pm and rebooted the machine around midnight. Service was restored shortly thereafter.
While this problem occured soon after the recent BANANA outage, it is mostly unrelated.
At the end of second 19:59:59, the kernel stepped the clock backwards by one second, consistent with the scheduled leap second. It appears that after this happened, one or more threads began spinning out of control, as evidenced by repeated throttling of two CPUs for thermal management purposes.
At approximately 20:15, as a result of the heavy CPU load, a bug in the kernel scheduler was triggered, resulting in an attempted division by zero in the code responsible for balancing CPU load across cores. The affected process was the kernel softirq daemon, which is involved in interrupt processing. The resulting failure wedged the machine until it could be restarted.
While the divide-by-zero bug seen here is similar to problems we saw last year, it is not the same. Unfortunately, the CPU load balancing code in the particular kernel version used on F14 seems to be riddled with this sort of problem.