Troubleshooters.Com Presents

Troubleshooting Professional Magazine

 
Volume 9 Issue 3, Summer, 2005
  NASA's Intermittent
Copyright (C) 2005 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Troubleshooting Professional Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.


Steve Litt is the author of the Universal Troubleshooting Process Courseware,
which can be presented either by Steve or by your own trainers.

He is also the author of Troubleshooting Techniques of the Successful Technologist,
Rapid Learning: Secret Weapon of the Successful Technologist, and Samba Unleashed.

[ Troubleshooters.Com | Back Issues | Linux Productivity Magazine ]



 
“We now have an intermittent, transient kind of failure, which is the worst kind of thing to troubleshoot.” -- Wayne Hale (NASA deputy manager of the shuttle program)

CONTENTS

Editor's Desk

By Steve Litt
You are not alone!

All those customers, bosses and co-workers who unloaded on you because you were slow to fix a problem -- you are not alone.

The frustration, the uncertainty, the no-win decisions when faced with an intermittent -- you are not alone.

As I write this article on 7/24/2005, NASA is hoping to launch Space Shuttle Discovery on 7/26. But an intermittent says otherwise.  At 1:30pm on 7/13/2005, with launch scheduled for 3:51pm that day, the mission was scrubbed because liquid hydrogen sensor No. 2 indicated "full" when an empty condition was simulated. A "full" reading during flight, when the tank is really empty, could cause an engine rupture followed by tail damage and disaster.

What made this really ugly is they'd seen it before. The a similar problem occurred during a routine fueling test in April 2005. Subsequent attempts to reproduce the symptom were unsuccessful. The fuel tank was replaced with a newer version, and  some other possible causes were addressed. Some engineers wanted another fueling test, but top management opted to test during the launch sequence.

At 1:30pm during the 7/13/2005 launch sequence, the intermittent once again showed its ugly head.

In the time since 7/13, fourteen teams of engineers have constructed a fault tree and determined a troubleshooting strategy. Wires were wiggled, and three areas of bad grounds were sanded and made tight. The electronic box that processes the fuel gauges' signals have been investigated. Fuel sensors 2 and 4 were exchanged -- a classic doubleswap.

There's so much to investigate:
As I write this article on 7/24/2005, NASA's plan is to launch the morning of 7/26. The fuel gauges will be tested full and simulated empty. My understanding is that if the fuel gauges test good, the launch will proceed, and if the same fuel gauge tests bad in the same way as on 7/13, and if the mode of failure is understood, they might launch anyway.

The stakes are high. If the launch doesn't occur in July, the next launch window occurs in early September, which is the height of hurricane season in a season that produced (so far) 7 named storms in June and July. Because there's an Atlantis launch planned for September, Atlantis would be pushed back. If the launch does occur, and the problem recurs, it could cause a fatal malfunction killing the astronauts and jeopardizing the shuttle program.

I'm glad I don't have to make the decision. And NASA seems a little more human to me.

I somehow assumed that the NASA rocket scientists were so brilliant that they didn't require the Universal Troubleshooting Process. Now, reading about this classic intermittent's savage battle with 14 teams of rocket scientists and the full financial and manpower resources of NASA, it becomes clear that NASA is using the same intermittent busting techniques I teach in the Universal Troubleshooting Process class, and facing the same tough decisions about when the intermittent is declared solved.

If you're a good Troubleshooter, you'll be assigned many intermittent problems, with all the attendant difficulties. At some point you'll probably doubt yourself. If so, remember you're not alone -- NASA has the same problems. So kick back, relax, and remember -- if you're a Troubleshooter, this is your magazine.
Steve Litt is the author of "Troubleshooting Techniques of the Successful Technologist".  Steve can be reached at Steve Litt's email address.

What Actually Happened (Written 4pm 7/24-12:45pm 7/25)

By Steve Litt
This article attempts to chronologically summarize the timeline leading to and including the discovery and handling of Space Shuttle Discovery's fuel gauge intermittent.

The Situation

The Soviet Union launched Sputnik I on October 4, 1957. The 183 pound satellite orbited the earth once. The military uses were obvious, so the United States moved their space program to the highest priority. For the next 12 years, no expense was spared in the U.S. space program, culmintating with the Apollo 11 moon landing on July 20, 1969. Five moon landings followed, the last of which was Apollo 17 in December of 1972.  As of today, no man has set foot on the moon since December 1972.

American priorities changed in the 1970's, with many questioning why we spent so much on space when many Americans were poor and uneducated. Space exploration became less frequent and unmanned. Then, in the 1980's, the U.S. began the shuttle missions. Columbia, our first shuttle, became operational in November 1982. Challenger flew in 1983, Discovery in 1984, Atlantis in 1985, and Endeavor in 1992. The Challenger blew up January 28, 1986, slowing space exploration. The next shuttle launch after the Challenger disaster was October 10, 1988. The shuttles had been grounded for 2.5 years.

Shuttles continued flying. These shuttles added enormously to our knowledge, and the satellites they launched are responsible for our electronic way of life today.

Space Shuttle Columbia shook apart high over Texas on February 1, 2003. The shuttle program was put on hold while NASA explored how to reduce the likelihood of such disasters. Another factor was the economic meltdown of the early 2000's, which put every federal dollar under increased competition. More than ever, the space program was expected to be cost effective.

Another factor was the age of the Space Shuttle fleet. Even though the oldest two shuttles had blown up, the average shuttle age in 2005 is 18 years. Discovery is now 21 years old. When you read about "transistors" in Discovery's fuel gauge electronics box, keep in mind that Discovery was built in 1984.

Such was the situation in 2005, as NASA attempted to restart the Shuttle program.

The Discovery Timeline

Space Shuttle Discovery was tapped to be the first shuttle into space since the Columbia disaster. The following is a timeline as it relates to the fuel gauge problem:

Discussion

I have no contacts at NASA. All info here was gleaned off the Internet. Most of this information was corroborated on several websites, and also seems to agree with what I've heard on radio and TV. The timeline mentions that the problems were traced to electrical interference and grounding problems. I've found no info on how such "tracing" took place -- was it traced by valid troubleshooting, or was it traced simply by following a cascade of possible faults.

Summary

Fuel gauge anomolies were found during an April routine fueling test. Some subsequent tests found the fuel gauges to be accurate, pointed to an intermittent problem. Several possible causes were explored and fixed, including cables, the electronic box for the gauge, and the fuel tank. This is classic corrective maintenance (general maintenance).

During the July 13 launch sequence similar problems occurred, so the launch was scrubbed. Some later fuel gauge tests showed the problem still existed, but still later the symptom went away.

A deep exploration of possible causes was performed, and possible causes were proactively repaired, such as faulty ground connections, of which three were actually found.

It has now been decided to launch on July 26, and during that launch to retest the fuel gauges. There has been some talk of launching with a defective gauge reading if such defect is in the same gauge as the July 13 problem, and if the defect is well understood.

My investigations on the web, especially when reading between the lines, tell me that there has been no positive identification of a root cause, which of course is not uncommon in intermittent problems.
Steve Litt is the creator of the Universal Troubleshooting Process.  Steve can be reached at Steve Litt's email address.

How NASA Coped (Written 4pm 7/24-12:45pm 7/25)

By Steve Litt
The sparsest of all intermittents is an event, and that's just what happened during an April 2005 fueling test. The gauge read full during an empty tank simulation, but later (correctly) read empty. By definition this was an intermittent, in that NASA knew of no way to reproduce the symptom.

Corrective maintenance is a powerful weapon against intermittence. NASA performed corrective maintenance by repairing or replacing the fuel tank, some wiring, and the electronic box that handles the sensor's signal.

Classic Universal Troubleshooting Process theory maintains that one does not attempt corrective maintenance in safety critical situations because it eliminates the opportunity to find the root cause. In an ideal world with infinite funding for NASA, other intermittent busting tactics would have been used.

However, the reality of the world is that there is always a tradeoff between safety and economics. NASA felt that, after corrective maintenance, they could postpone final testing until an actual launch sequence on July 13.

On July 13, routine fuel gauge testing 21/2 hours before launch, the problem recurred. This is very fortunate, because if it had occurred during flight instead of before launch, it might (or might not) have been fatal.

They scrubbed the launch and began looking for the problem. NASA had two choices:
  1. Go into full troubleshooting mode
  2. Stay in pre-launch mode
Going into full troubleshooting mode would have enabled more detailed testing for the root cause, but also would have involved more disassembly and foreclosed on any possibility to launch in July or August, which in turn would have impinged on Atlantis' launch, which is scheduled for the September window. Staying in pre-launch mode would reduce the likelihood of finding the root cause, but would keep open the possibility of launching in July. It was chosen to stay in pre-launch while troubleshooting.

Direct from my troubleshooting course, here is a list of intermittent busting techniques:
NASA is famous for preventive maintenance, but in this case the intermittent slipped through.

Corrective maintenance was exploited --  the fuel tank, electronic box and wiring were addressed between April and July. Three defective electronic grounds were found and corrected after July 13.

They certainly tried to turn the intermittent against itself. Here is a quote from Shuttle program deputy manager Wayne Hale: "The repair that might get us to Sunday would be if we go in and wiggle some of the wires and find a loose connection". In that same news conference Hale said "You laugh" ... "That probably is the first step in any troubleshooting plan. Some technician is going to put his hand on the wires and the connectors ... and start wiggling them."

Ignoring it is not an option in a safety critical situation, so of course NASA didn't ignore it.

My investigation hasn't turned up any evidence of their trying specifically to find a reproduction sequence (convert to a reproducible), but I'm certain they did that.

I know of no use of logs, strip chart recorders or other instrumentation that looks back in time, but then again, I wasn't working there, so my information comes from news sources.

What really impressed me about NASA is a tactic not listed above -- fault tree analysis. Fault tree analysis is very expensive, so it would never be used on consumer computers or the like. But in a situation that's both safety critical and cost critical, creation of a fault tree through cause and effect analysis of the system can provide an exhaustive list of components on which to perform corrective maintenance, thereby making the corrective maintenance much more likely to be effective. If the corrective maintenance is truly effective, the lack of identification of a root cause is less of a problem, although it's still a problem.

Now that everything in the fault tree has been addressed, the plan is to launch on July 26, and thoroughly test the fuel gauges during pre-launch. If the symptom does not appear, the launch will take place. There is some discussion of launching even in the face of symptom occurrence, if the symptom is identical, affecting the identical gauge, and it is understood.

NOTE

As of noon on 7/25/2005, it appears that NASA definitely plans to launch with only 3 sensors if the same sensor malfunctions in the same way and they thoroughly understand the mode by which this malfunction takes place.

Steve Litt is the creator of the Universal Troubleshooting Process.  Steve can be reached at Steve Litt's email address.

Critique of NASA's Handling of this Intermittent (Written 4pm 7/24-12:45pm 7/25)

By Steve Litt
Hey, this is NASA. Every one of their hundreds of engineers is smarter than I am. These guys truly are rocket scientists.

EDITOR'S NOTE

This article was written between 4pm on July 24, 2005, and 12:45pm on July 25, 2005, well before the launch at 10:39am on July 26. I've deliberately stopped writing this article before the launch to prevent myself from Monday morning quarterbacking NASA. Hindsight is always 20/20, and for that reason useless.

After the launch I'll write a separate article in which perhaps I'll look with hindsight at not only NASA's actions but also my writings in this article.

They are also under two tremendously conflicting pressures -- safety and economics. The politics of the situation is momentous. It would be silly to second guess the NASA engineers.

Monday Morning Quarterbacking is never appropriate, so I am rushing this TPM issue to press before launch, so that by definition I cannot be Monday Morning Quarterbacking (unless I have a working crystal ball :-).

The above being said, I'd like to analyze my understanding of NASA'S actions from a Troubleshooting viewpoint.

First let me start with my one point of disagreement. Some have mentioned that if the malfunction occurs on July 26, but it happens to the same gauge in the same way and is thoroughly understood, the launch should occur anyway. I STRONGLY disagree. There is currently a safety policy that you do not launch without all 4 gauges working perfectly. Safety policies should never be changed to accommodate an intermittent.

I fully applaud NASA's decision to scrub the July 13 launch upon seeing this problem. You don't ignore intermittents in safety critical situations. Another point of admiration is their use of a fault tree to reveal, check, maintain and if need be correct possible root causes. It's this kind of behavior that makes them true rocket scientists. Against a brutal intermittent, in a safety critical situation, under extreme time and budgetary pressure, they made a plan and carried it out. They displayed The Attitude.

If it were my call, I'd have done more troubleshooting between April and July 13. Ideally, I'd have persued troubleshooting methods not destructive of the root cause. With the frozen fuel still in the tanks, I'd have persued manipulation (wiggling etc), tried to find a reproduction sequence, tried to do some detective work back in time, perhaps involving logs, journals or strip chart recorders, and possibly used a method such as Root Cause Analysis.

Slightly less ideally, I'd have persued the fault tree in April, corrected/maintained everything revealed, and then done at least one full cryogenic load in May to try to verify the fix, so that we wouldn't arrive in July with an intermittent and only one chance before space to reproduce it.

The preceding two paragraphs outline what I'd do with endless resources. I have no idea of the time and money constraints of NASA, nor how many other events (unexplained anomalies) they regularly run into. Although I'd have tried to handle it a little differently, I have no beef with the way they handled it.

I'm concerned with the prospect of launching if the symptom doesn't appear at 10:39 on July 26. If the intermittent has not been fixed by the corrective maintenance, and chooses to rear its ugly head in space rather than on the launchpad, things could get ugly. I don't know how practical this would be, but I'd prefer perhaps a mock launch on July 25, followed by a real launch on July 27. This would give the symptom two chances to occur on the ground instead of one. Here again, I'm not privy to the economic, political and safety pressures on NASA, nor do I have information on the practicality of performing a launch two days after a trial launch.

I have some suggestions for the future. The fact that three areas of bad grounding were found indicates a weakness in NASA's preventive maintenance up to this point. Into the future, I'd like to see procedures and policies for maintaining all ground connections at intervals commensurate with the ease of such maintenance and its safety ramifications. I'd then like to see an engineering group reformulate all preventive maintenance procedures and policies for the maintenance of this fleet that is now includes craft that are 21, 20 and 13 years old.
Steve Litt is the creator of the Universal Troubleshooting Process.  Steve can be reached at Steve Litt's email address.

Message to the Press: It's Not a "Glitch"(Written 11:45am 7/25-12:30pm 7/25)

By Steve Litt
The press missed an opportunity to help the public understand the significance of intermittent problems. Had the press taken advantage of this opportunity, John Q. Average could have understood why it takes so long for his mechanic to fix where "every few days the car bucks for a few minutes and then goes back to normal".

They could have called it an intermittent problem. Instead they called it a "glitch". They could have explained that in diagnosing intermittent problems, one seeks to make the symptom reappear. Instead, they glossed over it.

Launching rockets might be rocket science, but understanding intermittent problems is not. An intermittent is simply an on again, off again problem for which there is no known way to make it happen. Therefore, diagnostic tests are of limited value, because you don't know whether the symptom went away because of the diagnostic test, as opposed to random chance.

If the press had spoken to me, I could have explained this.

Instead, they called it a "glitch".

The best I can fathom from dictionary definitions is that "glitch" means a sudden, unexpected change, often with the connotation of being minor. There's nothing minor about a problem that could rupture a shuttle's engine, and nowhere in that definition is it stated that the glitch will probably reappear. It may seem a single, random event, but given enough time, it will happen again unless fixed.

If you work for the press, please interview me. You owe it to every car driver and computer user in the country.
Steve Litt is the author of the Universal Troubleshooting Process courseware.   Steve can be reached at Steve Litt's email address.

No Further Symptoms (Written 10:00pm 7/28-10:30pm 7/28)

Discovery launched at 10:39am on July 26, 2005. I saw the launch from 60 miles away -- it was beautiful. Extensive tests on launch day failed to reproduce the fuel gauge symptom, so either this is a very sparse intermittent or NASA's fault tree driven corrective maintenance fixed the root cause. Although I would be skittish about launching with an intermittent not thoroughly understood, the fact is that many times that's just what we have to do.

I'm very glad they did NOT alter their launch policies and launch with a known bad fuel gauge system. That, in my opinion, would have been the wrong decision -- one does not alter a safety policy to accommodate an intermittent, or for any other reason other than proof that the safety policy was unnecessary.

Every Troubleshooter in the world should take pride in the troubleshooting job done by NASA's Engineers. Forced upon them was a sparse intermittent on one of the worlds most complex and technically challenging systems, in a politically charged situation that had brewed for 21/2 years (actually much longer).

Time constraints made non-destructive troubleshooting methods impractical, so they went with the quickest effective weapon on the war on intermittents -- corrective maintenance. But not just any corrective maintenance -- they drove that corrective maintenance with a fault tree derived from a mental model of the system. This is not easy. It really is rocket science.

NASA -- you're outstanding!
Steve Litt is the author of Troubleshooting Techniques of the Successful Technologist.  Steve can be reached at Steve Litt's email address.

There's a Reason They Call it Rocket Science (Written 10:15pm 7/28-11:00pm 7/28)

By Steve Litt
Discovery launched at 10:39am on July 26, 2005, ending a 21/2 year post-Columbia hiatus. The fuel gauge intermittent problem was addressed, extensively tested for, and most likely fixed.

Some foam and insulation fell off during launch, creating a possible safety problem on reentry. This is similar to what destroyed the Columbia, and the thrust of the last 21/2 years was to prevent future occurrence of falling insulation. It obviously didn't work. What now?

For starters, future Shuttle launches are on hold, and some question the future of our space program. That's not good.

During the last 21/2 years, many contingencies have been put in place to address such an event. First, we launch only in daylight so we can see it happen. We now photograph the launched craft from many angles, including high altitude jets. If it happens, we're much more likely to know about it.

Once we know about it, the Astronauts have been given materials and training to fix many types of problems caused by falling insulation. If that can't work, the Astronauts stay at the space station until a rescue craft can be sent.

My point is this: When Columbia shook to pieces over Texas, we didn't see it coming, and for several days we didn't know the cause. This time we know it happened, know what to look for, know how to fix it in space if it can be fixed in space, and have a plan if it can't be fixed. Anybody saying NASA had 21/2 years to fix this problem and failed doesn't understand the magnitude of what NASA has accomplished.

Our shuttles are hugely complex because breaking free of Earth's gravity is a monumental task. It's a challenge, and with challenge comes failure.  Asking six sigmas might be reasonable manufacturing ball bearings, but not when maintaining a space shuttle. Space shuttles are extreme, so there are injuries.

I skateboard from point A to point B, and never leave the ground. My worst injury was a little lost skin and a few bruises. Tony Hawk goes yards in the air, skates vertical, goes 360 in pipes. Would you hold him to the same safety record as me? Never.

There's a reason they call it rocket science!

Committment

We expect so much from our space program, but do we as a nation have the committment to support those expectations? Hiring and keeping the best brains in the world isn't cheap. The preventive maintenance, research and development necessary to consistently get up and come down safely and successfully is expensive.

The fact that the Engineers found three bad grounds during their work on the fuel gauge problem means they must vastly improve their preventive maintenance. But is America willing to pay for it? Or are we looking for performance on the cheap?

If we want a successful, safe space program, we need to pay for it, even though it's very expensive. We need to get the money. Whether we cut social programs, cut the Iraq war, cut the military in general, break medical monopolies, raise taxes, or start selling our national forests, we must pay for the performance we expect.

One could respond that NASA could work smarter and cheaper. That retoric might work in some sectors, but few are smarter than those employed by NASA.

Bottom line, we can either pay for a safe and successful space program, or cede space leadership to China, Russia or the European Union.
Steve Litt is the author of Troubleshooting Techniques of the Successful Technologist.  Steve can be reached at Steve Litt's email address.

Letters to the Editor

All letters become the property of the publisher (Steve Litt), and may be edited for clarity or brevity. We especially welcome additions, clarifications, corrections or flames from vendors whose products have been reviewed in this magazine. We reserve the right to not publish letters we deem in bad taste (bad language, obscenity, hate, lewd, violence, etc.).
Submit letters to the editor to Steve Litt's email address, and be sure the subject reads "Letter to the Editor". We regret that we cannot return your letter, so please make a copy of it for future reference.

How to Submit an Article

We anticipate two to five articles per issue, with issues coming out monthly. We look for articles that pertain to the Troubleshooting Process, or articles on tools, equipment or systems with a Troubleshooting slant. This can be done as an essay, with humor, with a case study, or some other literary device. A Troubleshooting poem would be nice. Submissions may mention a specific product, but must be useful without the purchase of that product. Content must greatly overpower advertising. Submissions should be between 250 and 2000 words long.

Any article submitted to Troubleshooting Professional Magazine must be licensed with the Open Publication License, which you can view at http://opencontent.org/openpub/. At your option you may elect the option to prohibit substantive modifications. However, in order to publish your article in Troubleshooting Professional Magazine, you must decline the option to prohibit commercial use, because Troubleshooting Professional Magazine is a commercial publication.

Obviously, you must be the copyright holder and must be legally able to so license the article. We do not currently pay for articles.

Troubleshooters.Com reserves the right to edit any submission for clarity or brevity, within the scope of the Open Publication License. If you elect to prohibit substantive modifications, we may elect to place editors notes outside of your material, or reject the submission, or send it back for modification. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com. Authors: please understand we can't place hyperlinks inside articles. If we did, only the first article would be read, and we can't place every article first.

Submissions should be emailed to Steve Litt's email address, with subject line Article Submission. The first paragraph of your message should read as follows (unless other arrangements are previously made in writing):

Copyright (c) 2001 by <your name>. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, version  Draft v1.0, 8 June 1999 (Available at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for readability at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest version is presently available at  http://www.opencontent.org/openpub/).

Open Publication License Option A [ is | is not] elected, so this document [may | may not] be modified. Option B is not elected, so this material may be published for commercial purposes.

After that paragraph, write the title, text of the article, and a two sentence description of the author.

Why not Draft v1.0, 8 June 1999 OR LATER

The Open Publication License recommends using the word "or later" to describe the version of the license. That is unacceptable for Troubleshooting Professional Magazine because we do not know the provisions of that newer version, so it makes no sense to commit to it. We all hope later versions will be better, but there's always a chance that leadership will change. We cannot take the chance that the disclaimer of warranty will be dropped in a later version.
 

Trademarks

All trademarks are the property of their respective owners. Troubleshooters.Com (R) is a registered trademark of Steve Litt.
 

URLs Mentioned in this Issue