OIT replaces data arrays after outage

The cause of the computer system outage during the week of March 2 remains unexplained and the Office of Information Technology has acquired two new storage arrays to prevent such a problem.

The outage, which began the evening of March 2 and lasted until March 8, was caused by failures of a controller inside a Hewlett-Packard storage array, which held 27 terabytes-or 27 trillion bytes-of various University data. In response, OIT replaced the faulty array, but did not explain why the full replacement was necessary if only one of the controllers was needed to keep services online.

"At this point, we still don't know the root cause of the problem with the array," Klara Jelinkova, senior director of shared services and infrastructure, wrote in an e-mail. "HP is reviewing the logs and replicating our environment in their labs, and I expect them to come back to us with a root next week."

Although Duke does not have a mission critical service contract with HP, the company dispatched a support team to Duke, Michael Herrera, manager at HP Worldwide Public Relations, wrote in an e-mail.

"HP rushed a loaner [Enterprise Virtual Array] and additional parts... to the site to replace the one that suffered the outage," Herrera said. He declined to comment on the root cause of the array's failure.

Although the wireless service and the University's Web site was down during the first outage March 2, OIT worked during the second outage March 4 to keep critical services online by running them from the alternative storage arrays, Jelinkova said.

"One controller on the array in question started to reboot on the evening of Monday, March 2," she said. "The easiest way to understand a disk array controller reboot is to think of a PC reboot. If you were working on your PC and it rebooted independently, you'd think, 'I've got a problem here.'"

Jelinkova said OIT contacted HP for an entirely new array immediately, expecting problems to arise again because the source of the problem had not been determined. Despite this early notice, OIT was unable to have the new storage array in place until March 6, and complete restoration of service until March 8.

"These large amounts of data cannot be switched over quickly," Jelinkova said. "[Andrew File System], for example, consists of nearly 13 terabytes of data, with the smallest allocation over 1 terabyte."

AFS is a storage space available to University students, faculty and staff and permits users to store up to 5 gigabytes of personal files on the Duke network.

John Board, assistant chief information officer, said systems at the Pratt School of Engineering had similar failures previously, but OIT had handled this incident well because the security of the data was never compromised.

"Duke's system is nearly bullet-proof in that we pay to duplicate everything," Board said.

OIT's failed array, an HP StorageWorks 8000 EVA, was initially deployed in November 2006 and had been stable prior to this incident.

It is one of the seven storage arrays holding roughly 230 TB of University data, Jelinkova said, adding that Duke is seeking other strategies to improve the reliability of the arrays, in addition to the new arrays.

"We are also looking at a strategy to combine tape back-up with back-up on array," Jelinkova said.

Discussion

Share and discuss “OIT replaces data arrays after outage” on social media.