Troubleshooters.Com
Presents
Troubleshooting
Professional
Magazine
Volume 9 Issue
2, Spring,
2005
My favorite Intermittent
Stories
|
Copyright (C) 2005 by Steve Litt. All rights
reserved.
Materials from guest authors copyrighted by them and licensed for
perpetual
use to Troubleshooting Professional Magazine. All rights reserved to
the
copyright holder, except for items specifically marked otherwise
(certain
free software source code, GNU/GPL, etc.). All material herein provided
"As-Is".
User assumes all risk and responsibility for any outcome.
[ Troubleshooters.Com
| Back Issues | Linux Productivity Magazine ]
|
It is best to do things systematically, since we are only human, and disorder is our worst enemy. -- Hesiod
|
CONTENTS
Editor's Desk
By Steve Litt
Want a worthy adversary? Tackle an intermittent. Reproducible problems
are pretty much a no-brainer for a properly trained Troubleshooter, but
intermittents always provide plenty of challenge.
This issue of Troubleshooting Professional tells of some of my most
memorable battles against intermittents. They were memorable
because they were tough. In many cases they made me look bad. In one
case I couldn't solve it -- someone else did.
The more frustrating the intermittent, the more pleasing the solution.
That should make this issue a true pleasure. So kick back, relax,
and remember -- if you're a Troubleshooter, this is your magazine.
It Only Crashes Twice a
Day
By Steve Litt
On July 28, 1994 I solved one of the most frustrating and tenacious
intermittents that ever squared off against me. You remember the summer
of 1994 -- All-4-One's "I Swear" and Lisa Loeb's "Stay" topped the
charts. We were gearing up for what would become one of the most boring
and lopsided presidential elections ever staged as Bob Dole tried in
vain to upset Bill Clinton. Of course, all of that was drowned out by
news of OJ Simpson.
For a couple weeks before July 28 I'd heard anecdotes referring to my
application crashing. My application used a 16 port Digiboard to query
16 different title company databases. No TCP/IP, no hardware or
software flow control, my app had to run around the ports, pick up
messages, send commands, and circulate back through in time that the
replies wouldn't fall off the end of a very small ring buffer. The
information obtained by the digiboard was placed in small intermediate
files, which were processed by a continuously running app that read
them, incorporated them into a database, and deleted them. It worked
beautifully. 99.99% of the time.
A couple weeks before, the company sales manager brought me onsite to
fix this intermittent. We stayed there 4 hours, it happened once, and I
tried a couple things to make it less likely to crash. But as time went
on, it was clear I hadn't fixed it.
On July 28 I was sent on site with a mandate
-- don't leave til it's fixed. Upon arrival, I told the site manager,
who oversaw 30 keypunchers, to stop all work the instant the app
crashed. Sure enough, two hours later it crashed, and the site manager
immediately stopped all work.
It was a Novell network, so I used Netware's salvage command to bring back
the last 10 intermediate files, marked them read-only so they could not
be deleted, and started up the collection app. Sure enough, it crashed.
Tried it again, it crashed again. I had just converted the intermittent
into a reproducible.
The next step was to isolate which file caused the crash. I fed in 5 files
instead of 10, then 2, and finally isolated it to a single file.
I ran the collection app in a debugger,
and found that a certain variable was "magically" changing value after
it had been set by the subroutine designed to set it. Even though the
change happened outside the subroutine, I traced it through the
subroutine.
After a couple minutes, I saw that
I was passing a pointer to something
that went out of scope as it exited the subroutine. Yes, the pointer
still pointed to the same stack location containing the information,
but that stack location was now fair game for any other subroutine's
local variables. When a record contained an extraordinarily
long string would it overwrite that part of the stack, at which time
the app would crash.
I placed the word static
in front of the char array declaration to prevent it from going out of
scope. The app never crashed again.
|
FINAL NOTE:
This intermittent was cured by converting the intermittent to a
reproducible -- a well known anti-intermittent tactic. As for me, I've
made a lot of programming mistakes since then, but I NEVER AGAIN passed
back a pointer to a locally declared array.
|
The Keyboard
Assaulted the Computer?
By Steve Litt
"Dad -- my computer won't boot".
So began my gruelling bout with a relentless intermittent.
The cause seemed obvious -- the words "keyboard failure" were on the
screen,
and the computer beeped continuously. The operating system was ruled
out
because it didn't have a chance to load. I reseated the keyboard cable
connector,
rebooted, and the symptom was gone. The keyboard cable connector had
probably
come loose over time -- maybe Brett had kicked it. The problem was fixed and
time marched on.
"Dad -- my computer won't boot".
Days had passed and the symptom cropped up again. This time reseating
it
didn't fix it -- every third or forth reboot the problem cropped up
again.
Looking at the keyboard cable connector, I saw that the large DIN
keyboard
was adapted by a solid DIN to PS/2 adapter, and that adapter and the
mouse
cable connector pushed against each other. I replaced his keyboard with
a
genuine PS/2 keyboard, rebooted several times to verify that the
symptom
had vanished. Done! I made a note not to use DIN to PS/2 connectors
anymore,
unless they included flexible cable so they wouldn't push on the mouse
cable
connector. Brett's computer was fixed, I went on to other things, and
time
marched on again.
"Dad -- my computer won't boot".
Different keyboard, same symptom. I called the vendor who sold me the
computer.
They
told
me it was probably a bad keyboard connector on the motherboard, and to
bring
it in. Because I couldn't bring it in on that particular day,
I gave Brett a new keyboard,
once
again fixing the problem. Two weeks went by.
"Dad -- my computer won't boot".
That's it. I brought the computer, and two of the offending keyboards
(one
DIN and one PS/2) to my vendor. Amazingly, the symptom happened
instantly
when I booted using my PS/2 keyboard. I cheered. Symptom reproduction
saved
me from withering looks and sympathetic words reserved for those
reporting
strange symptoms that can't be reproduced.
The vendor then began to experiment. He plugged in one of his keyboards
--
no symptom. Not to be outdone, I plugged in my other keyboard, and the
symptom
occurred. He plugged in another one of his keyboards, no symptom.
My vendor then plugged my keyboard into another computer, also sporting
an
Asus board, and the symptom occured on that computer. Common sense
would
say "defective keyboard", except I had two similarly defective
keyboards
and assured him I had a third at home. I mentioned that each keyboard
took
about a week or 2 to "go bad".
I then said the words that would change the whole focus of the hunt:
"it's
almost like the computer is damaging the keyboards somehow".
We looked at each other and chuckled. That was just too wierd to
contemplate.
We agreed the vendor would not change out the motherboard until we
found
the root cause, and left it cooking with one of his keyboards, to see
if
it would "damage" his keyboard. I drove home.
But a question, once vocalized, works on your subconscious. Mix that
with
some opportunities, and the plot can thicken...
The Plot Thickens
My son's computer had an Asus motherboard I bought from my vendor. I
had also
bought a complete computer with a similar Asus board from them a month
earlier.
Both computers displayed a symptom which I would have missed had my
attention
not been drawn by my son's no-boot situation -- a very short beep on
bootup.
Interestingly, the beep was sometimes shorter than others. Sometimes it
was
little more than a click, and once in a while it was a double or triple
click.
The sound reminded me of a stereo with a dirty volume control or tape
monitor
switch. Out from my subconscous came stereo repair general maintenance
--
clean all switches and controls with contact cleaner. Occasionally when
I
ran out of contact cleaner I'd use WD40 -- it worked great and the
switch
or control really performed smoothly. Could my keyboard problem be an
oxidized
keyboard cable connector? Maybe there was galvanic action (electrical
current
caused by dissimilar metals) between the motherboard's keyboard
connector
and the keyboard cable connector, and the galvanic action was causing
corrosion.
If only I hadn't left my computer at the shop, I could have tried WD40
in
the keyboard cable connector.
Then my other Asus equipped computer started exhibiting the same
keyboard
failure noboot symptom, and I jumped for joy.
I sprayed WD40 into my keyboard cable connector, inserted and removed
it
30 times to clean it, and turned on the machine. The symptom still
occurred,
but subjectively it occurred less. Remembering a long ago problem where
a
seemingly defective keyboard turned out to be a mouse problem, I
sprayed
WD30 into the mouse connector, inserted and removed it 30 times, and
turned
on the machine. The machine booted solidly many times in a row.
If this had been a reproducible problem, my work would have been done.
But
with an intermittent, it's not for sure whether the problem disappeared
because
of the WD40, or whether it disappeared just by the luck of the draw. I
needed
info. I described the problem on the mailing list of my local Linux group,
and crossed my fingers. This sounded so crazy I
wouldn't
be surprised if people called me nuts.
A guy named Ozz responded with the most reassuring phrase in the
English
language: "This is actually a known problem". He went on to describe
something
called "fretting corrosion" occurring on tin connectors, and mentioned
that
the AMP connector website contained two white papers, one called "The
Tin
Commandments" and one called "The Golden Rules", describing design and
maintenance
of tin plated and gold plated connectors. URL's for these two papers
are
in this magazine's URL's section. Intriguingly, Ozz mentioned that this
is
a problem especially with memory modules. I read both white papers, and
the
pieces began falling into place.
The Wisdom of the Experts
AMP's white papers describe something called "fretting corrosion",
which
happens to all tin plated connectors. It's worst when one of the mating
connectors
is gold and the other is tin, but occurs even when both are tin.
Tin reacts with oxygen and other materials to form a thin layer of
oxide.
This thin layer is not enough to significantly reduce connectivity, but
it's
more than enough to prevent further oxidation or corrosion. That's why
tin
can remain shiny even though it combines quite willingly with the
oxygen
in our atmosphere.
However, when the tin is "plugged into" another connector, any movement
and
vibration chafes away that protective oxide, leaving bare metal which
itself
oxidizes. Over time more and more oxide forms and is chafed away, and
this
excessive oxidation product starts to separate the two connectors.
Resistance
rises, and eventually functional conductivity is lost. "Eventually" can
be
as little as a few hours with excessive vibration.
The AMP white paper went on to say that fretting corrosion can be
minimized
by placing a lubricant between the mating surfaces. The
lubricant
will minimize chafing off of the protective oxide, and thus minimize
production
of new oxide.
IT MADE SENSE!!! Now I understood why this was happening, and why WD 40
stopped
the problem, and why merely reseating the connector produced only a
very
temporary improvement, if any. Now I strongly suspected a legitimate
answer
to remark that "something must be damaging the keyboards". That
something
was fretting corrosion.
The WD 40 allowed the machine to boot perfectly for a couple weeks,
then things
went bad again. Subsequent investigation revealed that some, but not
all,
of the keyboards I'd tried had less pins than normal keyboard
connectors,
which would certainly make the connection less stable and more subject
to
fretting corrosion. Lubrication plus a connector with the normal number
of
pins corrected the problem long term.
|
FINAL NOTE:
It's been well over a year, and my son's computer boots perfectly,
every time. Following this long, drawn out battle, I now incorporate
electronic contact lubrication in my preventive maintenance techniques.
Over time I abandoned WD 40 and settled on Lube Job Electronics
Lubricant from blowoff.com. Lube Job is designed specifically for
electronic contact lubrication, and unlike some other electronics
lubricants, it's very economical.
|
Relapse: The
Persistent Intermittent
By Steve Litt
This intermittent first appeared around the start of November, 2004. It
was solved for good on February 26, 2005. I solved it twice. The first
solution was on November 19, 2004, after which the symptom hybernated
until early February, 2005. This intermittent was costly in terms of
computer productivity, and in terms of my time troubleshooting it.
It happened on my main desktop computer, which is by no means a simple
machine. My desktop has a rather complete Mandrake 10 Linux
installation, including Kmail and Mozilla, neither of which has
impressed me as especially stable.
|
EDITOR'S NOTE
Kmail and Mozilla don't seem stable by Linux standards. In comparison
with the programs I ran under Windows 98 several years ago, they're
rock solid. It should be noted that I rebooted my Windows 98 machine a
half dozen times per day.
|
Hardware wise, this machine had two 200GB hard disks, a DVD reader and
a CD reWriter, an AMD xp2600+ processor (they run hot), 1.5GB of 400Mhz
DDR memory (the AMD website says I should be using 333Mhz memory), it
has one inblowing and one outblowing fan besides the fans on the power
supply, yet it still runs fairly hot. Its onboard LAN has been disabled
in bios, and it's hooked to the local area network through an IDE NIC.
It accesses a scanner and camera through USB, and an HP4050 via a
parallel port.
|
EDITOR'S NOTE
During the 11/2004-2/2005 period the machine actually evolved. By the
end, the 200MB system drive had been replaced with a 250MB.
Somewhere during this time period, the DVD reader and CD rewriter were
replaced by a single DVD+RW drive.
|
In other words, it's not a simple machine.
Somewhere in late October or early November 2004 I noticed that it
would lock up. Sometimes for a few seconds, sometimes for minutes or
permanently. At first these lockups were rare, but as mid November
approached, it happened several times per day. Some of these hangs were
accompanied by a stopping of the clock on the taskbar. These hangs
seemed to happen more frequently when my computer was under stress,
like doing a disk backup or running my prime number generator program.
After a couple days I noticed that these hangs were preceded by a
single click. There would be a click, and then 1-30 seconds later there
would be a hang. At the time it sounded to me like the click came from
the UPS I bought at Sams Club. I arranged to return the UPS to Sams
Club. They ran a purchase history for me and gave me 7 days to return
the UPS.
During that 7 days I ran with a different UPS, and indeed, the
frequency of such hangs dropped precipitously. But there were a couple
hangs during those 7 days. On the 7th day I returned the UPS. Yes, I
knew the UPS wasn't the whole story, but subjectively it made the
problem more frequent. I was up against a "use it or lose it" deadline
with the return -- I returned it.
Within a few days I knew the problem was nowhere near solved. It came
back slowly, and by November 15 it was as bad as it had been before the
UPS swap.
I should mention at this point that I had not performed a thorough
investigation on this problem, because to do so would have greatly
impacted my work. I kept hoping either that a lucky guess like the UPS
would fix it, or that the problem would become reproducible. But on
November 19 I finally admitted that this problem was preventing me from
doing my work. I opened the case and went on in.
If you've read my books or taken my course you know that one of my
favorite intermittent busting tactics is "turning the intermittent
against itself". An intermittent's power comes from its continual state
changes. If you can correllate any factor to those state changes,
you've found the root cause -- often without any in-depth knowledge of
the underlying system or technology. You look for such correllations
with manipulation -- physical, thermal, whatever. I went in, and with
the machine running, I wiggled cables and cards.
Some folks say it's irresponsible to wiggle things in a running
computer. You can break something. That's true, but look at my
situation...
I could replace the entire computer, new and fully loaded with
hardware, for less than maybe $800.00. I'd be unlikely to break the
whole computer -- a $100 motherboard or a $150 disk would be more
like it. So then the question arises -- how much extra troubleshooting
time should I spend protecting against a possible $150.00 loss,
especially given that I've NEVER broken a computer with on-line
physical manipulation. Remember, if the computer weren't already
broken, the physical manipulation could not have harmed anything.
I went in and wiggled. One IDE cable seemed to affect the symptom, so I
replaced it. Wiggling some more, the symptom recurred. Finally I saw
that I could trigger the symptom almost at will by wiggling the power
connector to one of the drives. I replaced the power connector (with
the power off), turned it back on, and wiggled some more. Nothing.
Nada. I banged everything really hard. Nothing. I ran my prime number
program to stress the system. Nothing. I performed a disk backup, which
really stresses the system. Nothing. It was fixed. I taped the bad
connector shut, labeled it BAD, and buttoned everything up. It was
fixed. Time marched on...
Must have been late January or early February I heard a click. A little
sound -- a harmless sound -- most would have missed it. But to me it
was the most ominous sound in the world -- the sound my hard disk made
back in the days of intermittence. Nothing happened, time went on. I
started hearing the click more often. Then one day the computer hung,
just like the bad old days. Suspecting hard disks, I sought to find
which of the two was bad. A hard disk test utility called smartctl
indicated that my system disk had problems, but my data disk was OK.
Booting the Knoppix Linux-on-a-disk, the data disk's partitions could
be mounted, but not the system disk's partitions. I bought a new disk
and started a backup. THE SYSTEM COULD NOT BACK UP!
I booted Knoppix, archived the data partitions, and used sftp to
transfer the newly made archives to a different computer. I replaced
the system disk on 2/26/2005, fired it up, and it worked perfectly.
It's been working perfectly since then (this is being written about a
month later. This computer is now one of the most stable I've worked
with -- Kmail and Mozilla never hang.
What happened?
What happened? It's a fair question. How was an intermittent cured,
then pop up two months later?
Obviously this disk was fitted with a bad power supply connector back in
October/November. The bad power supply connector turned the disk off and on
several times a day, without benefit of any kind of hardware or software
"orderly shutdown". It's likely that the power cycles to the disk were not
clean, but instead were the spiky type of power cycles you get with a loose
connection. My theory is that the power cycling of the disk in November did
irreversable damage to the drive and gravely shortened its life.
Now ready to fail, the hard drive could not long sustain the constant
demands of a daily driver computer. In February the disk started
cutting in and out. It was proven bad and replaced. There's no reason
to anticipate further consequential damage.
|
FINAL NOTE:
In both cases of intermittence, my initial intermittent busting tactic
was to ignore the problem, because on systems where safety isn't an
issue, an living with an infrequent intermittent is more cost effective
than troubleshooting it. Once it got more frequent, I used physical
manipulation to turn the intermittent against itself.
With the second occurrence I once again ignored until it could be
ignored no longer, then I used tools (smartctl, knoppix) to verify the
bad part, and replaced it.
|
Bios Defeats Litt
By Steve Litt
1990. My super-wonderful timesheet program was used by a large law
firm. A third of a billion dollars passed through that program
annually. But sometimes the data input facility hung, locking up the
whole computer. I managed to find out that the hang always occurred when the
program tried to turn numlocks on during input to a numeric field.
I managed to narrow it down to certain computers. Some computers
displayed this problem, some didn't. It didn't depend who was logged in
or whose data was being input -- certain computers displayed the
tendency to hang and the rest never hung.
Days became weeks became months and I could never find a root cause.
Finally my co-worker Ken examined the "bad" computers, found they all
had the same bios version, and none of the "good" computers had that
bios version. He did a little research on that bios version and
discovered that bios version does not allow you to toggle numlocks from
software. I disabled the auto numlocks feature on the timesheet front
end and the problem stopped. A year later the old computers with the
bad bios were sent to the glue factory, and I reenabled the numlocks
feature. No further problems occurred.
|
FINAL NOTE:
Nobody's perfect -- not even the author of the 1990 classic,
"Troubleshooting: Tools, Tips and Techniques". Even great
troubleshooters fail.
In that instance, and in many others, Ken displayed uncommon
troubleshooting ability, often doing so through concise process and
research.
|
The Switch
that Would Not Stay Rebooted
By Steve Litt
My main computer kept losing its connection to the Internet, and in
fact to all of my LAN. I rebooted the switch it was hooked to (the main
switch next to my IPCop gateway), and BANG, the connection was
restored. An hour later I lost my connection again, and again rebooting
the main switch fixed it. Because I was researching the grub bootloader at the time, I
rebooted my computer quite often. Every time I rebooted it, I had to
reboot the main switch too.
After an hour this became too much. I stopped working and began
troubleshooting. My first step was to look at other computers on the
LAN. My experimental computer plugs into a Linksys switch whose uplink
port connects to the main switch. The experimental computer couldn't
see the LAN either. I rebooted the switch the Linksys switch and both
the experimental computer and the main computer saw the network. Hours
passed with no further problems. Days passed with no further problems.
It was fixed.
I'm no network expert, but here's what I think happened. The Linksys
switch got itself into a bad state and was sending out bad packets.
These bad packets would in turn put the main switch into a bad state,
either as a function of time, or when my main computer was rebooted --
whichever came first. I cannot imagine why rebooting the main computer
would trigger the symptom, but it did.
Rebooting the Linksys switch got it out of its bad state and eliminated
its broadcast of bad packets, thus solving the problem.
|
FINAL NOTE:
Why is this story in a magazine about intermittents? It was a hard
failure of the Linksys switch!
The definition of an intermittent is a problem for which there is no known reproduction procedure.
When I first encountered this problem, all I knew was that my computer
dropped its LAN connection "every once in a while". There was no known
way to make it happen. Later I found it could be reproduced by booting
the computer, but by that time I had already applied corrective
(general) maintenance -- rebooting switches and reconnecting network
cables with electronic lubricant.
The breakthrough came when I realized the switch I rebooted was more of
a symptom than a cause -- the cause was the Linksys switch.
|
Sylvia's Bucking Buick
By Steve Litt
Sylvia told me that her car was bucking. On further questioning, the
symptom sounded a lot like what would happen on a car with a dirty fuel
filter, but when I tried to reproduce it, I couldn't. Finally, while
driving home from a restaraunt one night a few weeks later, I felt it
bucking. It bucked at 45 miles per hour. It was the same sort of
temporary power loss you'd experience from a clogged fuel filter.
I questioned Sylvia how long it had been since she had her fuel filter
changed, how long since her last tuneup, and how old her air filter
was. It was over 2 years on all three. I took it to a tire-n-tune to
get it tuned up and the filters replaced. After the service, I picked
up the car, smug in the belief that the problem was over. Within a mile
it started bucking.
Now I had a problem. The root cause could be almost anything. Well,
anything except the three most likely suspects -- fuel filter, air
filter and tuneup. I had someone check the fuel pressure. It was on the
low end of normal -- shouldn't be a problem.
I went to one auto tech I use sometimes and had him drive the car. He
reproduced the symptom, and felt the problem was in the transmission,
which was something he didn't service.
I took the car to Cool Shift transmission, where Fred drove the car and
declared the transmission in fine working order.
I had a perfectly
serviced car that just happened to buck at 45 miles per hour. Ughhh.
You might wonder at this point why I didn't just bring it to a shop and
have them figure out what was wrong. The problem is, shops are even
worse at intermittents than I am. They don't have the time to reproduce
the symptom. Bringing it to a shop at this point probably would have
resulted in lots of replaced parts and no symptom cessation. I was
determined to wait until either the problem became constant, or I found
a reproduction sequence.
We continued to drive the car, hoping it wouldn't stall on the freeway.
One cold morning I drove to Valencia College. After signing up for a
course, I started pulling out of their parking lot. Shivering, I turned
on the heater. The car started bucking. Hmmm!
I turned off the heater, and the car quit bucking. On -- bucked. Off --
didn't buck.
Was it drain on the electrical system, or something else? I turned on
the lights, and it bucked slightly. Turned em off and they stopped. On
-- bucked. Off -- stopped. I turned on both the heater and the lights,
and turned on the brights for good measure. The car bucked so badly it
almost stalled. Turned them all off, the car rode smoothly.
G o t c h a S u c k e r !
I had a theory it was the electrical system. But more importantly I had a
reproduction procedure. Cool-Shift transmission was right around the
corner so I went there and had Fred measure the voltage. 14.0 volts --
a little lower than I'd like to see it, but nowhere near indicating a
problem. At home I re-measured it with my own voltmeter -- 14.0. I took
it to Batteries Plus, whose tests showed both the battery and the
alternator to be functioning properly.
I started formulating all sorts of wierd theories. Perhaps there was a
resistive connection between the alternator and the ignition system
such that even though the battery voltage read 14.0, where the ignition
got its power it was more like 10 volts when the lights and heater were
turned on:

My resistive connection hyptothesis would have explained it, but as it
turns out, the real root cause was a whole lot wierder. Read on...
I looked in the car in search of a loose electrical connection and
found none. It's not surprising -- a 1987 Buick Century with a 3.8
liter engine is packed pretty tight. Armed with a reproduction
procedure, I headed to the best diagnosticians I know:
Zych's Certified Auto Services
on 436 in Altamonte Springs, Florida.
Arriving at Zych's, I gave them a full, written symptom description,
told them verbally about turning on the lights and heater, and then
insisted they reproduce the symptom while I was still there. They
reproduced it easily; I left the car.
Zych's is a lightning fast shop, but the car was there for two full
days -- a record. Every time I called they were still diagnosing it --
it wasn't an easy problem. Finally they called me and said it was fixed
-- a bad alternator!
Jim Zych, the owner, said he initially ruled out the alternator based
on the nice 14.0 volt reading at the battery, but after he had tried
absolutely everything else that could cause it, he swapped in a known
good alternator, and sure enough, the symptom vanished. There was
something about the old alternator, having nothing to do with the DC
output, that was causing the ignition system to malfunction. I paid him
for alternator, installation, and a diagnostic fee, and went home
knowing I'd gotten great service for a very reasonable price.
I think I know what was wrong with that alternator, and how I could
have diagnosed it myself. I'll bet you dollars to donuts the alternator
was putting out all sorts of spiky AC that was interfering with the
electronic ignition system or the computer system that drove it. I'll
bet if I'd placed an oscilloscope across the battery instead of a
voltmeter, the problem would have been immediately obvious. I'll bet if
I'd placed a capacitor across the alternator the problem would have
vanished. But of course hindsight is 20/20. The real point of the story
is that one of the trickiest problems I've ever seen in any machine or
system got solved.
|
FINAL NOTE:
Of all the possible outcomes, this outcome was the best that could be
hoped for. A typical, and much less desireable outcome, would go
something like this:
Upon discovery of the hesitation, the driver would immediately bring
the car to his "local mechanic", who, having absolutely no idea how to
reproduce the symptom, would begin a long, costly course of diagnosis
by serial replacement. A couple weeks and a couple thousand dollars
later the problem might or might not be fixed.
In my case, the driver understood that the chance of solution was slim
to none unless he found a symptom reproduction procedure. He spent a
week or two driving the car and looking for a reproduction procedure.
Once he found it, he understood the difficulty of the problem and
brought it not just to anyone, but to the best diagnosticians available.
Armed with a consistent reproduction procedure, the shop found
the root cause and fixed the problem. Everyone played their part just
right.
|
The Power Supply Event
By Steve Litt
This happened about a week ago. I was merrily working away when my
computer shut off. Powered down. At first I figured the power went out
-- our local power company is surprisingly unreliable, at least for a
power company in a developed nation. But then I remembered my computer
was on an uninterruptable power supply. The lights in the house were
still on. Oh Oh!
Repeated attempts to press the power button did nothing. I plugged a
known good power supply into the mobo, and the machine counted memory.
Good, probably my power supply had gone bad.
Just to be sure, I plugged the old power supply back into the
motherboard. It counted memory and booted up. Oh Oh!
When you can toggle a symptom by repeatedly replacing and restoring a
part, it's a reproducible problem. When you replace the part and it
works, and then restore the original and it still works, you have an
event -- the sparsest type of intermittent.
Leaving the old power supply plugged in, I buttoned it back up. If the
same symptom happens even one more time, I'll perform the following
tests:
- See how hot the power supply is, and if hot, blow on it with the
blowing part of a shop vac for several minutes
- Wiggle the power supply to mobo connection, and see if the
symptom goes away
- Make absolutely sure the computer is getting power
- Turn off and then back on the switch on the power supply itself
- Disconnect and reconnect the power supply to mobo connection
- Swap in a known good power supply
If #1 fixes it, the power supply is overheating and shutting down.
Investigate why. Do the fans turn ok?
If #2 fixes it, there's a loose connection in the power supply lead or
the mobo, and I'll need to find out which.
If #3 fixes it, find out why the 120V connection to the computer wasn't
working.
If #4 fixes it, either the computer or the power supply got itself into
some illegal state requiring power recycle
If #5 fixes it, either it's a loose connection or a wierd illegal state.
If #6 fixes it when the others didn't, I'll assume the power supply is
bad and replace it.
|
FINAL NOTE:
Extensive experience made me retest the original power supply. At each
stage of the troubleshoot, I constructed diagnostic tests to be quick.
Therefore, when it came time to "swap" the power supply, I merely
removed the power supply to the motherboard, and replaced it with that
of a known good power supply. I neither unmounted the old power supply
nor mounted the new one on the case.
When it counted memory after the temporary replacement, it looked
mightily like a simple case of bad power supply. But strange things
happen in troubleshooting, so just to make sure I restored the original
power supply connection. The symptom did NOT reappear, meaning there
was something odd going on.
There is not, at this moment, sufficient evidence to replace the power
supply. The root cause of this problem could have been outside the
power supply, in which case it would have recurred later (after
spending $100.00 for a new high quality, 2 fan power supply).
So far, this mishap was an event -- the sparsest of all sparse
intermittents. It happened once and never again. If this were a safety
critical system, I'd need to take it offline and perform extensive
tests to ascertain the root cause. However, because it's not safety
critical, by far the best course of action is to continue using it.
Either the event will never recur, or it will recur (hopefully more
often) and I'll be able to troubleshoot it to a root cause.
|
Are We Having Fun Yet?
By Steve Litt
The stories in this Troubleshooting Professional Magazine don't exactly
portray me in the best possible light, do they?
There's an interesting paradox in the life of a troubleshooter. By
honing our divide-and-conquer skills, which work perfectly on
reproducible problems, we gain stellar reputations as troubleshooters.
Based on those stellar reputations, we're assigned the toughest
problems -- intermittents -- for which our divide-and-conquer skills
are only marginally effective.
In other words, the longer we've been in the business, the more we
understand how nasty intermittent problems can be. Intermittents
occasionally make even the best troubleshooter look like a buffoon. The
thing that separates the ninja troubleshooter from the hack is the
customer/user perception of the handling:
- Is the customer made aware that it's an intermittent, and kept in
the loop?
- Must the customer/user put up with and pay for diagnosis by
serial replacement?
- Is good logic used in determining when to ignore the
intermittent, and when to troubleshoot it?
- Does the troubleshooter endanger people? Equipment? Business
productivity?
- Is the troubleshooter aware of the full range of intermittent
busting tactics?
- Does the troubleshooter logically pick the safest, most accurate
and most cost effective intermittent busting tactic?
Every troubleshooter needs to read and think about intermittents. Not
just once, but frequently. Here are some resources explaining
intermittent busting tactics:
Letters to the Editor
All letters become the property of the publisher (Steve Litt), and
may
be edited for clarity or brevity. We especially welcome additions,
clarifications,
corrections or flames from vendors whose products have been reviewed in
this
magazine. We reserve the right to not publish letters we deem in
bad taste
(bad language, obscenity, hate, lewd, violence, etc.).
Submit letters to the editor to Steve Litt's email address, and be
sure
the subject reads "Letter to the Editor". We regret that we cannot
return
your letter, so please make a copy of it for future reference.
How to Submit an Article
We anticipate two to five articles per issue, with issues coming out
monthly.
We look for articles that pertain to the Troubleshooting Process, or
articles
on tools, equipment or systems with a Troubleshooting slant. This can
be
done as an essay, with humor, with a case study, or some other literary
device.
A Troubleshooting poem would be nice. Submissions may mention a
specific product,
but must be useful without the purchase of that product. Content must
greatly
overpower advertising. Submissions should be between 250 and 2000 words
long.
Any article submitted to Troubleshooting Professional Magazine must
be
licensed with the Open Publication License, which you can view at
http://opencontent.org/openpub/.
At your option you may elect the option to prohibit substantive
modifications.
However, in order to publish your article in Troubleshooting
Professional
Magazine, you must decline the option to prohibit commercial use,
because
Troubleshooting Professional Magazine is a commercial publication.
Obviously, you must be the copyright holder and must be legally able
to
so license the article. We do not currently pay for articles.
Troubleshooters.Com reserves the right to edit any submission for
clarity
or brevity, within the scope of the Open Publication License. If you
elect
to prohibit substantive modifications, we may elect to place editors
notes
outside of your material, or reject the submission, or send it back for
modification.
Any published article will include a two sentence description of the
author,
a hypertext link to his or her email, and a phone number if desired.
Upon
request, we will include a hypertext link, at the end of the magazine
issue,
to the author's website, providing that website meets the
Troubleshooters.Com
criteria for links and that the
author's
website first links to Troubleshooters.Com. Authors: please understand
we
can't place hyperlinks inside articles. If we did, only the first
article
would be read, and we can't place every article first.
Submissions should be emailed to Steve Litt's email address, with
subject
line Article Submission. The first paragraph of your message should
read
as follows (unless other arrangements are previously made in writing):
Copyright (c) 2001 by <your name>. This
material
may be distributed only subject to the terms and conditions set forth
in
the Open Publication License, version Draft v1.0, 8 June 1999
(Available
at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for
readability
at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest
version
is presently available at http://www.opencontent.org/openpub/).
Open Publication License Option A [ is | is not]
elected,
so this document [may | may not] be modified. Option B is not elected,
so
this material may be published for commercial purposes.
After that paragraph, write the title, text of the article, and a
two
sentence description of the author.
Why not Draft v1.0, 8 June 1999 OR LATER
The Open Publication License recommends using the word "or later" to
describe
the version of the license. That is unacceptable for Troubleshooting
Professional
Magazine because we do not know the provisions of that newer version,
so
it makes no sense to commit to it. We all hope later versions will be
better,
but there's always a chance that leadership will change. We cannot take
the
chance that the disclaimer of warranty will be dropped in a later
version.
Trademarks
All trademarks are the property of their respective owners.
Troubleshooters.Com
(R) is a registered trademark of Steve Litt.
URLs Mentioned in this Issue