The outage of a computer system used by airline pilots to file flight plans in the U.S will likely prompt a closer look at a US$2.4 billion telecommunications system that has grappled with numerous problems in the past.
The U.S. Federal Aviation Administration (FAA) offered few details Thursday about the exact nature of the glitch, which caused major delays and flight cancellations in airports across the country. But in a statement, the agency blamed a "software configuration problem" within the FAA Telecommunications Infrastructure (FTI) in Salt Lake City.
That problem brought down a system used mainly for traffic flow and flight planning services for about four hours this morning. The flight management system -- it's called the National Airspace Data Interchange Network (NADIN) -- was affected because it relies on FTI services to operate, the FAA said. There was no indication that the disruption was the result of a cyberattack, the FAA said.
FAA experts were investigating the outage and meeting with Harris Corp., the company that manages FTI to "discuss system corrections to prevent similar outages," the agency said.
In an e-mailed statement, a Harris spokesman said the company is working to "evaluate the interruption" to prevent future outages. "FTI has proven to be one of the most reliable and secure communications networks operating within the civilian government. Safety and security is our highest priority," the company said.
A spokeswoman for the Professional Airways Systems Specialists (PASS) union, which represents more than 11,000 FAA employees, told Computerworld the problem arose when scheduled maintenance on FTI in Los Angeles corrupted a router in Salt Lake City. A back-up router that should have kicked-in when the primary router went down failed to do so, resulting in the widespread outage, she said.
The $2.4 billion FTI program was introduced by the FAA in 2002 to replace seven FAA-owned and leased telecommunications networks. It provides a range of voice, data and video communication services for operations and mission support at more than 4,000 FAA and Defense Department facilities, according to Harris. The FTI network provides switching and routing services, as well as centralized infrastructure security monitoring services for the FAA.
An audit of the program released by the FAA's Inspector General last September cited concerns over delays in the project's implementation and doubts about the promised cost-benefits the network was supposed to yield. The report also noted several technical problems that had caused unscheduled outages to air traffic control operations.
"In some cases, these outages have involved simultaneous loss of both primary and back-up FTI services, which not only disrupts air travel but also creates potential safety risks," the inspector general report warned, pointing to several incidents in recent years.
On Sept. 25, 2007 for instance, all FTI services were lost at the Memphis Air Route Traffic Control Control Center (ARTCC), disrupting air traffic control for several hours and causing 566 flight delays, the report said. The problem stemmed from a "catastrophic failure" of an optical network ring that was supposed to offer built-in fault tolerance. The FAA was vulnerable to the same issues at Atlanta and in Jacksonville.
Another incident occurred on Nov. 9, 2007, when all primary and alternate FTI services were lost at Jacksonville, resulting in 85 flight delays. "We also found that when FTI outages occur, the services are not always restored within contractual timeframes," the inspector general's report said. In some cases, where services are supposed to be restored within three hours, Harris took twice as long to fix the problem. "Several areas remain critical watch items for decision makers as FAA moves forward with FTI," the report said.
Others have been critical of the program as well. PASS, for instance, has in the past voiced concern over safety and efficiency issues related to FTI.
PASS spokeswoman Kori Blalock Keller said the FAA needs to hold Harris accountable for the problems. "If they are going to provide service, we need to make sure they are reliable and they are quick" to respond to outages, she said. Although several FAA technicians were on hand in Salt Lake City today, they couldn't do much to help out because the FTI system is managed by Harris, she said.
According to Keller, the incident will likely prompt Congress to ask the FAA Inspector General for another review of the system. The National Air Traffic Controllers Association (NACTA) has also expressed frustration over FTI. After the failure in Memphis, the organization blasted the network as "unreliable [and] lacking suitable backup" and called it a source of "great frustration and deep concern" for FAA technicians and air traffic controllers. Bill Curtis chief scientist at CAST Software and co-author of the Capability Maturity Model used in software development today, said the outage highlights the havoc that can be created when something goes wrong in large, highly interconnected systems such as the FAA air traffic control system.
"It's not just one system, but a system of systems," he said. "If one of them starts behaving in a funny way, it starts propagating out and causes problems in other systems," said Curtis.