Troubleshooters.Com® and the UTP subsite present:
Intermittents and Reproducibles
Copyright (C) 1996, 2001 by Steve Litt
CONTENTS
These are two kinds of symptoms, and the way I define them they're opposite and mutually exclusive. Here are the definitions:
Notice the following about these definitions:
An intermittent can be reproduced. However, the troubleshooter cannot cause its reproduction because there is no known procedure to consistently reproduce it. The best the troubleshooter can do is create an environment to increase the odds of the symptom occurring, and wait. When the symptom occurs, the Troubleshooter didn't reproduce it, it reproduced itself.
An intermittent can become a reproducible. This happens when the troubleshooter finds a procedure to consistently reproduce the symptom. In other words, these terms are from the frame of reference of the troubleshooter, not the physical world. In the physical world everything is reproducible if viewed in enough (molecular, atomic, memory bit, etc.) detail.
The picture goes blank within an hour of turning it on is not a reproducible symptom. The word "within" means sometimes it's an hour, sometimes 45 minutes, etc. The exact time is governed by chance. It's probable that some time it will take more than an hour to occur -- maybe much more.
As per the definition, the troubleshooter can reproduce the symptom at will. If the troubleshooter performs a test that stops the known procedure from reproducing the symptom, he or she has clearly ruled out part of the search area. After a number of such tests, the technician will have narrowed the cause to a single component, which can then be repaired or replaced. There are two requirements for mathematical certainty of solution in reproducibles:
With intermittents, there's no mathematical certainty of solution. Indeed, many intermittents are never solved. Here's why. Since reproduction of the symptom isn't in the troubleshooters hands, there's no way of knowing whether a symptom went away because of a test the troubleshooter performed, or because of random chance. Since no conclusive test can rule out part of the search area, the underlying cause can't be traced. Instead, the troubleshooter uses a combination of general maintenance, statistical analysis, intuition, and trial and error. These four tools often lead to a solution, but sometimes don't.
If the troubleshooter can't reproduce the symptom at all, all he or she is left with is general maintenance and guesswork. In this case, the probability of solution is low.
General maintenance: Since intermittents are so tough to troubleshoot, general maintenance starts looking a lot easier. Cleaning every connector in a computer might seem too much work for a reproducible, but compared to the hassle of troubleshooting an intermittent it's downright easy. It's often the best policy to General Maintenance an intermittent, then either test it or give it back to the user/customer to test. Be sure the customer or user is informed of what you did, and what he or she can expect.
Turn the intermittent against itself: I knew a tech who had little electronic knowledge, but he was a Ninja with freeze spray and a heat gun. When he got an intermittent, he'd use the freeze spray and heat gun in various sections of the machine to toggle the symptom, and narrow it down physically. If that didn't work, he'd bend circuit boards and wiggle things looking for bad connections. By turning the intermittent against itself, he actually had an easier time with intermittents than with reproducibles.
Convert the intermittent into a reproducible: A program I wrote used to crash "once in a while", driving the customer crazy. Several attempts were made to fix it using general maintenance and intuition, but the problem persisted. After a week, I went on site and found a sequence of input files that, when played in specific order, would always cause the crash. I put those files in their own directory to convert it to a reproducible. Fifteen minutes with a source code debugger then narrowed it down to a single line of C code. Always try to find a procedure to consistently reproduce the symptom.
Statistical analysis: We all use a human, subjective style of statistical analysis when dealing with intermittents. "It seems to happen more when...", "It seems to happen less when I...", "It seems to happen about once every..." are examples. The real breakthrough will come when a diagnostic machine is able to exercise the system in several ways, record the instances of the symptom, correlate them to the exercises, and statistically evaluate the correlation. When something is three or more standard deviations outside the norm, bingo, you've got your reproduction procedure.
Intermittents and reproducibles are opposites. Reproducibles can be consistently reproduced by a known procedure. This is not true of intermittents. It is a mathematical certainty that reproducibles will be traced to their root cause by a person using a systematic approach, and having sufficient knowledge of the system to devise and interpret conclusive troubleshooting tests. This is not true of intermittents. When confronted with an intermittent, use one or more of these approaches: Ignore it, General maintenance, Turn the intermittence against itself, Convert the intermittent into a reproducible, Statistical analysis.