Wired picked up some of the details of the Prius software bug that I mentioned this past Sunday. It looks like several major news outlets carried a story on this as far back as May 2005. Wired mentioned the Prius troubles in an article called “History’s Worst Software Bugs“. I am disappointed that they didn’t bring up the fact that dealers are still selling the buggy version of the car.
One of Wired’s “worst” is the Arianne 5 flight 501 disaster. Since I am personally familiar with the event (from work at UIowa Dept of Physics and Astronomy) I might be biased, but I must say that while I’m not sure it was one of history’s worst, it certainly makes a great case study. I use it regularly in presentations on risk management. For example, the backup code was the exact same rev as the primary, and thus the bug (floating point error) that caused a failure in the primary…yup, you guessed it…oops.
Wired suggests a Wikipedia version of events, but their second link points to the original European Space Agency ESA) report “ane5rep.html” (also found hosted at MIT). The ESA provided a very clear analysis of the source of the problem:
“The reason why the active SRI 2 did not send correct attitude data was that the unit had declared a failure due to a software exception. The OBC could not switch to the back-up SRI 1 because that unit had already ceased to function during the previous data cycle (72 milliseconds period) for the same reason as SRI 2.”
But even more interesting is that the floating point error itself could have been handled many ways, or the trajectory tested more accurately, but “It was the decision to cease the processor operation which finally proved fatal. […] The reason behind this drastic action lies in the culture within the Ariane programme of only addressing random hardware failures.”
Dare I say, the risk of software bugs was mis-managed?