Troubleshooters.Com Presents

Troubleshooting Professional Magazine

 
Volume 9 Issue 2, Spring, 2005
My favorite Intermittent Stories
Copyright (C) 2005 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Troubleshooting Professional Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.


Steve Litt is the author of the Universal Troubleshooting Process Courseware,
which can be presented either by Steve or by your own trainers.

He is also the author of Troubleshooting Techniques of the Successful Technologist,
Rapid Learning: Secret Weapon of the Successful Technologist, and Samba Unleashed.

[ Troubleshooters.Com | Back Issues | Linux Productivity Magazine ]



 
It is best to do things systematically, since we are only human, and disorder is our worst enemy. -- Hesiod

CONTENTS

Editor's Desk

By Steve Litt
Want a worthy adversary? Tackle an intermittent. Reproducible problems are pretty much a no-brainer for a properly trained Troubleshooter, but intermittents always provide plenty of challenge.

This issue of Troubleshooting Professional tells of some of my most memorable battles against intermittents. They were  memorable because they were tough. In many cases they made me look bad. In one case I couldn't solve it -- someone else did.

The more frustrating the intermittent, the more pleasing the solution. That should make this issue a true pleasure. So kick back, relax, and remember -- if you're a Troubleshooter, this is your magazine.
Steve Litt is the author of "Troubleshooting Techniques of the Successful Technologist".  Steve can be reached at Steve Litt's email address.

It Only Crashes Twice a Day

By Steve Litt
On July 28, 1994 I solved one of the most frustrating and tenacious intermittents that ever squared off against me. You remember the summer of 1994 -- All-4-One's "I Swear" and Lisa Loeb's "Stay" topped the charts. We were gearing up for what would become one of the most boring and lopsided presidential elections ever staged as Bob Dole tried in vain to upset Bill Clinton. Of course, all of that was drowned out by news of OJ Simpson.

For a couple weeks before July 28 I'd heard anecdotes referring to my application crashing. My application used a 16 port Digiboard to query 16 different title company databases. No TCP/IP, no hardware or software flow control, my app had to run around the ports, pick up messages, send commands, and circulate back through in time that the replies wouldn't fall off the end of a very small ring buffer. The information obtained by the digiboard was placed in small intermediate files, which were processed by a continuously running app that read them, incorporated them into a database, and deleted them. It worked beautifully. 99.99% of the time.

A couple weeks before, the company sales manager brought me onsite to fix this intermittent. We stayed there 4 hours, it happened once, and I tried a couple things to make it less likely to crash. But as time went on, it was clear I hadn't fixed it.

On July 28 I was sent on site with a mandate -- don't leave til it's fixed. Upon arrival, I told the site manager, who oversaw 30 keypunchers, to stop all work the instant the app crashed. Sure enough, two hours later it crashed, and the site manager immediately stopped all work.

It was a Novell network, so I used Netware's salvage command to bring back the last 10 intermediate files, marked them read-only so they could not be deleted, and started up the collection app. Sure enough, it crashed. Tried it again, it crashed again. I had just converted the intermittent into a reproducible.

The next step was to isolate which file caused the crash. I fed in 5 files instead of 10, then 2, and finally isolated it to a single file. I ran the collection app in a debugger, and found that a certain variable was "magically" changing value after it had been set by the subroutine designed to set it. Even though the change happened outside the subroutine, I traced it through the subroutine.

After a couple minutes, I saw that I was passing a pointer to something that went out of scope as it exited the subroutine. Yes, the pointer still pointed to the same stack location containing the information, but that stack location was now fair game for any other subroutine's local variables. When a record contained an extraordinarily long string would it overwrite that part of the stack, at which time the app would crash.

I placed the word static in front of the char array declaration to prevent it from going out of scope. The app never crashed again.

FINAL NOTE:

This intermittent was cured by converting the intermittent to a reproducible -- a well known anti-intermittent tactic. As for me, I've made a lot of programming mistakes since then, but I NEVER AGAIN passed back a pointer to a locally declared array.

Steve Litt is the creator of the Universal Troubleshooting Process.  Steve can be reached at Steve Litt's email address.

The Keyboard Assaulted the Computer?

By Steve Litt
"Dad -- my computer won't boot".

So began my gruelling bout with a relentless intermittent.

The cause seemed obvious -- the words "keyboard failure" were on the screen, and the computer beeped continuously. The operating system was ruled out because it didn't have a chance to load. I reseated the keyboard cable connector, rebooted, and the symptom was gone. The keyboard cable connector had probably come loose over time -- maybe Brett had kicked it. The problem was fixed and time marched on.

"Dad -- my computer won't boot".

Days had passed and the symptom cropped up again. This time reseating it didn't fix it -- every third or forth reboot the problem cropped up again. Looking at the keyboard cable connector, I saw that the large DIN keyboard was adapted by a solid DIN to PS/2 adapter, and that adapter and the mouse cable connector pushed against each other. I replaced his keyboard with a genuine PS/2 keyboard, rebooted several times to verify that the symptom had vanished. Done! I made a note not to use DIN to PS/2 connectors anymore, unless they included flexible cable so they wouldn't push on the mouse cable connector. Brett's computer was fixed, I went on to other things, and time marched on again.

"Dad -- my computer won't boot".

Different keyboard, same symptom. I called the vendor who sold me the computer. They told me it was probably a bad keyboard connector on the motherboard, and to bring it in. Because I couldn't bring it in on that particular day, I gave Brett a new keyboard, once again fixing the problem. Two weeks went by.

"Dad -- my computer won't boot".

That's it. I brought the computer, and two of the offending keyboards (one DIN and one PS/2) to my vendor. Amazingly, the symptom happened instantly when I booted using my PS/2 keyboard. I cheered. Symptom reproduction saved me from withering looks and sympathetic words reserved for those reporting strange symptoms that can't be reproduced.

The vendor then began to experiment. He plugged in one of his keyboards -- no symptom. Not to be outdone, I plugged in my other keyboard, and the symptom occurred. He plugged in another one of his keyboards, no symptom.

My vendor then plugged my keyboard into another computer, also sporting an Asus board, and the symptom occured on that computer. Common sense would say "defective keyboard", except I had two similarly defective keyboards and assured him I had a third at home. I mentioned that each keyboard took about a week or 2 to "go bad".

I then said the words that would change the whole focus of the hunt: "it's almost like the computer is damaging the keyboards somehow".

We looked at each other and chuckled. That was just too wierd to contemplate.

We agreed the vendor would not change out the motherboard until we found the root cause, and left it cooking with one of his keyboards, to see if it would "damage" his keyboard. I drove home.

But a question, once vocalized, works on your subconscious. Mix that with some opportunities, and the plot can thicken...

The Plot Thickens

My son's computer had an Asus motherboard I bought from my vendor. I had also bought a complete computer with a similar Asus board from them a month earlier. Both computers displayed a symptom which I would have missed had my attention not been drawn by my son's no-boot situation -- a very short beep on bootup. Interestingly, the beep was sometimes shorter than others. Sometimes it was little more than a click, and once in a while it was a double or triple click.

The sound reminded me of a stereo with a dirty volume control or tape monitor switch. Out from my subconscous came stereo repair general maintenance -- clean all switches and controls with contact cleaner. Occasionally when I ran out of contact cleaner I'd use WD40 -- it worked great and the switch or control really performed smoothly. Could my keyboard problem be an oxidized keyboard cable connector? Maybe there was galvanic action (electrical current caused by dissimilar metals) between the motherboard's keyboard connector and the keyboard cable connector, and the galvanic action was causing corrosion. If only I hadn't left my computer at the shop, I could have tried WD40 in the keyboard cable connector.

Then my other Asus equipped computer started exhibiting the same keyboard failure noboot symptom, and I jumped for joy.

I sprayed WD40 into my keyboard cable connector, inserted and removed it 30 times to clean it, and turned on the machine. The symptom still occurred, but subjectively it occurred less. Remembering a long ago problem where a seemingly defective keyboard turned out to be a mouse problem, I sprayed WD30 into the mouse connector, inserted and removed it 30 times, and turned on the machine. The machine booted solidly many times in a row.

If this had been a reproducible problem, my work would have been done. But with an intermittent, it's not for sure whether the problem disappeared because of the WD40, or whether it disappeared just by the luck of the draw. I needed info. I described the problem on the mailing list of my local Linux group, and crossed my fingers. This sounded so crazy I wouldn't be surprised if people called me nuts.

A guy named Ozz responded with the most reassuring phrase in the English language: "This is actually a known problem". He went on to describe something called "fretting corrosion" occurring on tin connectors, and mentioned that the AMP connector website contained two white papers, one called "The Tin Commandments" and one called "The Golden Rules", describing design and maintenance of tin plated and gold plated connectors. URL's for these two papers are in this magazine's URL's section. Intriguingly, Ozz mentioned that this is a problem especially with memory modules. I read both white papers, and the pieces began falling into place.

The Wisdom of the Experts

AMP's white papers describe something called "fretting corrosion", which happens to all tin plated connectors. It's worst when one of the mating connectors is gold and the other is tin, but occurs even when both are tin.

Tin reacts with oxygen and other materials to form a thin layer of oxide. This thin layer is not enough to significantly reduce connectivity, but it's more than enough to prevent further oxidation or corrosion. That's why tin can remain shiny even though it combines quite willingly with the oxygen in our atmosphere.

However, when the tin is "plugged into" another connector, any movement and vibration chafes away that protective oxide, leaving bare metal which itself oxidizes. Over time more and more oxide forms and is chafed away, and this excessive oxidation product starts to separate the two connectors. Resistance rises, and eventually functional conductivity is lost. "Eventually" can be as little as a few hours with excessive vibration.

The AMP white paper went on to say that fretting corrosion can be minimized by placing a lubricant between the mating surfaces. The lubricant will minimize chafing off of the protective oxide, and thus minimize production of new oxide.

IT MADE SENSE!!! Now I understood why this was happening, and why WD 40 stopped the problem, and why merely reseating the connector produced only a very temporary improvement, if any. Now I strongly suspected a legitimate answer to remark that "something must be damaging the keyboards". That something was fretting corrosion.

The WD 40 allowed the machine to boot perfectly for a couple weeks, then things went bad again. Subsequent investigation revealed that some, but not all, of the keyboards I'd tried had less pins than normal keyboard connectors, which would certainly make the connection less stable and more subject to fretting corrosion. Lubrication plus a connector with the normal number of pins corrected the problem long term.



FINAL NOTE:

It's been well over a year, and my son's computer boots perfectly, every time. Following this long, drawn out battle, I now incorporate electronic contact lubrication in my preventive maintenance techniques. Over time I abandoned WD 40 and settled on Lube Job Electronics Lubricant from blowoff.com. Lube Job is designed specifically for electronic contact lubrication, and unlike some other electronics lubricants, it's very economical.

Steve Litt is the author of the Universal Troubleshooting Process courseware.   Steve can be reached atSteve Litt's email address.

Relapse: The Persistent Intermittent

By Steve Litt
This intermittent first appeared around the start of November, 2004. It was solved for good on February 26, 2005. I solved it twice. The first solution was on November 19, 2004, after which the symptom hybernated until early February, 2005. This intermittent was costly in terms of computer productivity, and in terms of my time troubleshooting it.

It happened on my main desktop computer, which is by no means a simple machine. My desktop has a rather complete Mandrake 10 Linux installation, including Kmail and Mozilla, neither of which has impressed me as especially stable.

EDITOR'S NOTE

Kmail and Mozilla don't seem stable by Linux standards. In comparison with the programs I ran under Windows 98 several years ago, they're rock solid. It should be noted that I rebooted my Windows 98 machine a half dozen times per day.

Hardware wise, this machine had two 200GB hard disks, a DVD reader and a CD reWriter, an AMD xp2600+ processor (they run hot), 1.5GB of 400Mhz DDR memory (the AMD website says I should be using 333Mhz memory), it has one inblowing and one outblowing fan besides the fans on the power supply, yet it still runs fairly hot. Its onboard LAN has been disabled in bios, and it's hooked to the local area network through an IDE NIC. It accesses a scanner and camera through USB, and an HP4050 via a parallel port.

EDITOR'S NOTE

During the 11/2004-2/2005 period the machine actually evolved. By the end, the 200MB system drive had been replaced with a  250MB. Somewhere during this time period, the DVD reader and CD rewriter were replaced by a single DVD+RW drive.

In other words, it's not a simple machine.

Somewhere in late October or early November 2004 I noticed that it would lock up. Sometimes for a few seconds, sometimes for minutes or permanently. At first these lockups were rare, but as mid November approached, it happened several times per day. Some of these hangs were accompanied by a stopping of the clock on the taskbar. These hangs seemed to happen more frequently when my computer was under stress, like doing a disk backup or running my prime number generator program.

After a couple days I noticed that these hangs were preceded by a single click. There would be a click, and then 1-30 seconds later there would be a hang. At the time it sounded to me like the click came from the UPS I bought at Sams Club. I arranged to return the UPS to Sams Club. They ran a purchase history for me and gave me 7 days to return the UPS.

During that 7 days I ran with a different UPS, and indeed, the frequency of such hangs dropped precipitously. But there were a couple hangs during those 7 days. On the 7th day I returned the UPS. Yes, I knew the UPS wasn't the whole story, but subjectively it made the problem more frequent. I was up against a "use it or lose it" deadline with the return -- I returned it.

Within a few days I knew the problem was nowhere near solved. It came back slowly, and by November 15 it was as bad as it had been before the UPS swap.

I should mention at this point that I had not performed a thorough investigation on this problem, because to do so would have greatly impacted my work. I kept hoping either that a lucky guess like the UPS would fix it, or that the problem would become reproducible. But on November 19 I finally admitted that this problem was preventing me from doing my work. I opened the case and went on in.

If you've read my books or taken my course you know that one of my favorite intermittent busting tactics is "turning the intermittent against itself". An intermittent's power comes from its continual state changes. If you can correllate any factor to those state changes, you've found the root cause -- often without any in-depth knowledge of the underlying system or technology. You look for such correllations with manipulation -- physical, thermal, whatever. I went in, and with the machine running, I wiggled cables and cards.

Some folks say it's irresponsible to wiggle things in a running computer. You can break something. That's true, but look at my situation...

I could replace the entire computer, new and fully loaded with hardware, for less than maybe $800.00. I'd be unlikely to break the whole computer -- a $100 motherboard or a $150 disk would be more like it. So then the question arises -- how much extra troubleshooting time should I spend protecting against a possible $150.00 loss, especially given that I've NEVER broken a computer with on-line physical manipulation. Remember, if the computer weren't already broken, the physical manipulation could not have harmed anything.

I went in and wiggled. One IDE cable seemed to affect the symptom, so I replaced it. Wiggling some more, the symptom recurred. Finally I saw that I could trigger the symptom almost at will by wiggling the power connector to one of the drives. I replaced the power connector (with the power off), turned it back on, and wiggled some more. Nothing. Nada. I banged everything really hard. Nothing. I ran my prime number program to stress the system. Nothing. I performed a disk backup, which really stresses the system. Nothing. It was fixed. I taped the bad connector shut, labeled it BAD, and buttoned everything up. It was fixed. Time marched on...

Must have been late January or early February I heard a click. A little sound -- a harmless sound -- most would have missed it. But to me it was the most ominous sound in the world -- the sound my hard disk made back in the days of intermittence. Nothing happened, time went on. I started hearing the click more often. Then one day the computer hung, just like the bad old days. Suspecting hard disks, I sought to find which of the two was bad. A hard disk test utility called smartctl indicated that my system disk had problems, but my data disk was OK. Booting the Knoppix Linux-on-a-disk, the data disk's partitions could be mounted, but not the system disk's partitions. I bought a new disk and started a backup. THE SYSTEM COULD NOT BACK UP!

I booted Knoppix, archived the data partitions, and used sftp to transfer the newly made archives to a different computer. I replaced the system disk on 2/26/2005, fired it up, and it worked perfectly. It's been working perfectly since then (this is being written about a month later. This computer is now one of the most stable I've worked with -- Kmail and Mozilla never hang.

What happened?

What happened? It's a fair question. How was an intermittent cured, then pop up two months later?

Obviously this disk was fitted with a bad power supply connector back in October/November. The bad power supply connector turned the disk off and on several times a day, without benefit of any kind of hardware or software "orderly shutdown". It's likely that the power cycles to the disk were not clean, but instead were the spiky type of power cycles you get with a loose connection. My theory is that the power cycling of the disk in November did irreversable damage to the drive and gravely shortened its life.

Now ready to fail, the hard drive could not long sustain the constant demands of a daily driver computer. In February the disk started cutting in and out. It was proven bad and replaced. There's no reason to anticipate further consequential damage.


FINAL NOTE:

In both cases of intermittence, my initial intermittent busting tactic was to ignore the problem, because on systems where safety isn't an issue, an living with an infrequent intermittent is more cost effective than troubleshooting it. Once it got more frequent, I used physical manipulation to turn the intermittent against itself.

With the second occurrence I once again ignored until it could be ignored no longer, then I used tools (smartctl, knoppix) to verify the bad part, and replaced it.

Steve Litt is the author of "Rapid Learning: Secret Weapon of the Successful Technologist".  Steve can be reached at Steve Litt's email address.

Bios Defeats Litt

By Steve Litt
1990. My super-wonderful timesheet program was used by a large law firm. A third of a billion dollars passed through that program annually. But sometimes the data input facility hung, locking up the whole computer. I managed to find out that the hang always occurred when the program tried to turn numlocks on during input to a numeric field.

I managed to narrow it down to certain computers. Some computers displayed this problem, some didn't. It didn't depend who was logged in or whose data was being input -- certain computers displayed the tendency to hang and the rest never hung.

Days became weeks became months and I could never find a root cause. Finally my co-worker Ken examined the "bad" computers, found they all had the same bios version, and none of the "good" computers had that bios version. He did a little research on that bios version and discovered that bios version does not allow you to toggle numlocks from software. I disabled the auto numlocks feature on the timesheet front end and the problem stopped. A year later the old computers with the bad bios were sent to the glue factory, and I reenabled the numlocks feature. No further problems occurred.

FINAL NOTE:

Nobody's perfect -- not even the author of the 1990 classic, "Troubleshooting: Tools, Tips and Techniques". Even great troubleshooters fail.

In that instance, and in many others, Ken displayed uncommon troubleshooting ability, often doing so through concise process and research.
 
Steve Litt is the author of the Universal Troubleshooting Process courseware.   Steve can be reached at Steve Litt's email address.

The Switch that Would Not Stay Rebooted

By Steve Litt
My main computer kept losing its connection to the Internet, and in fact to all of my LAN. I rebooted the switch it was hooked to (the main switch next to my IPCop gateway), and BANG, the connection was restored. An hour later I lost my connection again, and again rebooting the main switch fixed it. Because I was researching the grub bootloader at the time, I rebooted my computer quite often. Every time I rebooted it, I had to reboot the main switch too.

After an hour this became too much. I stopped working and began troubleshooting. My first step was to look at other computers on the LAN. My experimental computer plugs into a Linksys switch whose uplink port connects to the main switch. The experimental computer couldn't see the LAN either. I rebooted the switch the Linksys switch and both the experimental computer and the main computer saw the network. Hours passed with no further problems. Days passed with no further problems. It was fixed.

I'm no network expert, but here's what I think happened. The Linksys switch got itself into a bad state and was sending out bad packets. These bad packets would in turn put the main switch into a bad state, either as a function of time, or when my main computer was rebooted -- whichever came first. I cannot imagine why rebooting the main computer would trigger the symptom, but it did.

Rebooting the Linksys switch got it out of its bad state and eliminated its broadcast of bad packets, thus solving the problem.


FINAL NOTE:

Why is this story in a magazine about intermittents? It was a hard failure of the Linksys switch!

The definition of an intermittent is a problem for which there is no known reproduction procedure. When I first encountered this problem, all I knew was that my computer dropped its LAN connection "every once in a while". There was no known way to make it happen. Later I found it could be reproduced by booting the computer, but by that time I had already applied corrective (general) maintenance -- rebooting switches and reconnecting network cables with electronic lubricant.

The breakthrough came when I realized the switch I rebooted was more of a symptom than a cause -- the cause was the Linksys switch.
 
Steve Litt is the author of "Troubleshooting Techniques of the Successful Technologist".  Steve can be reached at Steve Litt's email address.

Sylvia's Bucking Buick

By Steve Litt
Sylvia told me that her car was bucking. On further questioning, the symptom sounded a lot like what would happen on a car with a dirty fuel filter, but when I tried to reproduce it, I couldn't. Finally, while driving home from a restaraunt one night a few weeks later, I felt it bucking. It bucked at 45 miles per hour. It was the same sort of temporary power loss you'd experience from a clogged fuel filter.

I questioned Sylvia how long it had been since she had her fuel filter changed, how long since her last tuneup, and how old her air filter was. It was over 2 years on all three. I took it to a tire-n-tune to get it tuned up and the filters replaced. After the service, I picked up the car, smug in the belief that the problem was over. Within a mile it started bucking.

Now I had a problem. The root cause could be almost anything. Well, anything except the three most likely suspects -- fuel filter, air filter and tuneup. I had someone check the fuel pressure. It was on the low end of normal -- shouldn't be a problem.
I went to one auto tech I use sometimes and had him drive the car. He reproduced the symptom, and felt the problem was in the transmission, which was something he didn't service.

I took the car to Cool Shift transmission, where Fred drove the car and declared the transmission in fine working order. I had a perfectly serviced car that just happened to buck at 45 miles per hour. Ughhh.

You might wonder at this point why I didn't just bring it to a shop and have them figure out what was wrong. The problem is, shops are even worse at intermittents than I am. They don't have the time to reproduce the symptom. Bringing it to a shop at this point probably would have resulted in lots of replaced parts and no symptom cessation. I was determined to wait until either the problem became constant, or I found a reproduction sequence.

We continued to drive the car, hoping it wouldn't stall on the freeway.

One cold morning I drove to Valencia College. After signing up for a course, I started pulling out of their parking lot. Shivering, I turned on the heater. The car started bucking. Hmmm!

I turned off the heater, and the car quit bucking. On -- bucked. Off -- didn't buck.

Was it drain on the electrical system, or something else? I turned on the lights, and it bucked slightly. Turned em off and they stopped. On -- bucked. Off -- stopped. I turned on both the heater and the lights, and turned on the brights for good measure. The car bucked so badly it almost stalled. Turned them all off, the car rode smoothly.

G o t c h a   S u c k e r !

I had a theory it was the electrical system. But more importantly I had a reproduction procedure. Cool-Shift transmission was right around the corner so I went there and had Fred measure the voltage. 14.0 volts -- a little lower than I'd like to see it, but nowhere near indicating a problem. At home I re-measured it with my own voltmeter -- 14.0. I took it to Batteries Plus, whose tests showed both the battery and the alternator to be functioning properly.

I started formulating all sorts of wierd theories. Perhaps there was a resistive connection between the alternator and the ignition system such that even though the battery voltage read 14.0, where the ignition got its power it was more like 10 volts when the lights and heater were turned on:

Electrical system with resistive connection

My resistive connection hyptothesis would have explained it, but as it turns out, the real root cause was a whole lot wierder. Read on...

I looked in the car in search of a loose electrical connection and found none. It's not surprising -- a 1987 Buick Century with a 3.8 liter engine is packed pretty tight. Armed with a reproduction procedure, I headed to the best diagnosticians I know: Zych's Certified Auto Services on 436 in Altamonte Springs, Florida.

Arriving at Zych's, I gave them a full, written symptom description, told them verbally about turning on the lights and heater, and then insisted they reproduce the symptom while I was still there. They reproduced it easily; I left the car.

Zych's is a lightning fast shop, but the car was there for two full days -- a record. Every time I called they were still diagnosing it -- it wasn't an easy problem. Finally they called me and said it was fixed -- a bad alternator!

Jim Zych, the owner, said he initially ruled out the alternator based on the nice 14.0 volt reading at the battery, but after he had tried absolutely everything else that could cause it, he swapped in a known good alternator, and sure enough, the symptom vanished. There was something about the old alternator, having nothing to do with the DC output, that was causing the ignition system to malfunction. I paid him for alternator, installation, and a diagnostic fee, and went home knowing I'd gotten great service for a very reasonable price.

I think I know what was wrong with that alternator, and how I could have diagnosed it myself. I'll bet you dollars to donuts the alternator was putting out all sorts of spiky AC that was interfering with the electronic ignition system or the computer system that drove it. I'll bet if I'd placed an oscilloscope across the battery instead of a voltmeter, the problem would have been immediately obvious. I'll bet if I'd placed a capacitor across the alternator the problem would have vanished. But of course hindsight is 20/20. The real point of the story is that one of the trickiest problems I've ever seen in any machine or system got solved.


FINAL NOTE:

Of all the possible outcomes, this outcome was the best that could be hoped for. A typical, and much less desireable outcome, would go something like this:

Upon discovery of the hesitation, the driver would immediately bring the car to his "local mechanic", who, having absolutely no idea how to reproduce the symptom, would begin a long, costly course of diagnosis by serial replacement. A couple weeks and a couple thousand dollars later the problem might or might not be fixed.

In my case, the driver understood that the chance of solution was slim to none unless he found a symptom reproduction procedure. He spent a week or two driving the car and looking for a reproduction procedure. Once he found it, he understood the difficulty of the problem and brought it not just to anyone, but to the best diagnosticians available. Armed with a consistent reproduction procedure, the shop found the root cause and fixed the problem. Everyone played their part just right.

Steve Litt is the creator of the Universal Troubleshooting Process.  Steve can be reached at Steve Litt's email address.

The Power Supply Event

By Steve Litt
This happened about a week ago. I was merrily working away when my computer shut off. Powered down. At first I figured the power went out -- our local power company is surprisingly unreliable, at least for a power company in a developed nation. But then I remembered my computer was on an uninterruptable power supply. The lights in the house were still on. Oh Oh!

Repeated attempts to press the power button did nothing. I plugged a known good power supply into the mobo, and the machine counted memory. Good, probably my power supply had gone bad.

Just to be sure, I plugged the old power supply back into the motherboard. It counted memory and booted up. Oh Oh!

When you can toggle a symptom by repeatedly replacing and restoring a part, it's a reproducible problem. When you replace the part and it works, and then restore the original and it still works, you have an event -- the sparsest type of intermittent.

Leaving the old power supply plugged in, I buttoned it back up. If the same symptom happens even one more time, I'll perform the following tests:

  1. See how hot the power supply is, and if hot, blow on it with the blowing part of a shop vac for several minutes
  2. Wiggle the power supply to mobo connection, and see if the symptom goes away
  3. Make absolutely sure the computer is getting power
  4. Turn off and then back on the switch on the power supply itself
  5. Disconnect and reconnect the power supply to mobo connection
  6. Swap in a known good power supply
If #1 fixes it, the power supply is overheating and shutting down. Investigate why. Do the fans turn ok?

If #2 fixes it, there's a loose connection in the power supply lead or the mobo, and I'll need to find out which.

If #3 fixes it, find out why the 120V connection to the computer wasn't working.

If #4 fixes it, either the computer or the power supply got itself into some illegal state requiring power recycle

If #5 fixes it, either it's a loose connection or a wierd illegal state.

If #6 fixes it when the others didn't, I'll assume the power supply is bad and replace it.


FINAL NOTE:

Extensive experience made me retest the original power supply. At each stage of the troubleshoot, I constructed diagnostic tests to be quick. Therefore, when it came time to "swap" the power supply, I merely removed the power supply to the motherboard, and replaced it with that of a known good power supply. I neither unmounted the old power supply nor mounted the new one on the case.

When it counted memory after the temporary replacement, it looked mightily like a simple case of bad power supply. But strange things happen in troubleshooting, so just to make sure I restored the original power supply connection. The symptom did NOT reappear, meaning there was something odd going on.

There is not, at this moment, sufficient evidence to replace the power supply. The root cause of this problem could have been outside the power supply, in which case it would have recurred later (after spending $100.00 for a new high quality, 2 fan power supply).

So far, this mishap was an event -- the sparsest of all sparse intermittents. It happened once and never again. If this were a safety critical system, I'd need to take it offline and perform extensive tests to ascertain the root cause. However, because it's not safety critical, by far the best course of action is to continue using it. Either the event will never recur, or it will recur (hopefully more often) and I'll be able to troubleshoot it to a root cause.

Steve Litt is the author of the Universal Troubleshooting Process courseware.   Steve can be reached at Steve Litt's email address.

Are We Having Fun Yet?

By Steve Litt
The stories in this Troubleshooting Professional Magazine don't exactly portray me in the best possible light, do they?

There's an interesting paradox in the life of a troubleshooter. By honing our divide-and-conquer skills, which work perfectly on reproducible problems, we gain stellar reputations as troubleshooters. Based on those stellar reputations, we're assigned the toughest problems -- intermittents -- for which our divide-and-conquer skills are only marginally effective.

In other words, the longer we've been in the business, the more we understand how nasty intermittent problems can be. Intermittents occasionally make even the best troubleshooter look like a buffoon. The thing that separates the ninja troubleshooter from the hack is the customer/user perception of the handling:
Every troubleshooter needs to read and think about intermittents. Not just once, but frequently. Here are some resources explaining intermittent busting tactics:
Steve Litt is the author of "Rapid Learning: Secret Weapon of the Successful Technologist".  Steve can be reached at Steve Litt's email address.

Letters to the Editor

All letters become the property of the publisher (Steve Litt), and may be edited for clarity or brevity. We especially welcome additions, clarifications, corrections or flames from vendors whose products have been reviewed in this magazine. We reserve the right to not publish letters we deem in bad taste (bad language, obscenity, hate, lewd, violence, etc.).
Submit letters to the editor to Steve Litt's email address, and be sure the subject reads "Letter to the Editor". We regret that we cannot return your letter, so please make a copy of it for future reference.

How to Submit an Article

We anticipate two to five articles per issue, with issues coming out monthly. We look for articles that pertain to the Troubleshooting Process, or articles on tools, equipment or systems with a Troubleshooting slant. This can be done as an essay, with humor, with a case study, or some other literary device. A Troubleshooting poem would be nice. Submissions may mention a specific product, but must be useful without the purchase of that product. Content must greatly overpower advertising. Submissions should be between 250 and 2000 words long.

Any article submitted to Troubleshooting Professional Magazine must be licensed with the Open Publication License, which you can view at http://opencontent.org/openpub/. At your option you may elect the option to prohibit substantive modifications. However, in order to publish your article in Troubleshooting Professional Magazine, you must decline the option to prohibit commercial use, because Troubleshooting Professional Magazine is a commercial publication.

Obviously, you must be the copyright holder and must be legally able to so license the article. We do not currently pay for articles.

Troubleshooters.Com reserves the right to edit any submission for clarity or brevity, within the scope of the Open Publication License. If you elect to prohibit substantive modifications, we may elect to place editors notes outside of your material, or reject the submission, or send it back for modification. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com. Authors: please understand we can't place hyperlinks inside articles. If we did, only the first article would be read, and we can't place every article first.

Submissions should be emailed to Steve Litt's email address, with subject line Article Submission. The first paragraph of your message should read as follows (unless other arrangements are previously made in writing):

Copyright (c) 2001 by <your name>. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, version  Draft v1.0, 8 June 1999 (Available at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for readability at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest version is presently available at  http://www.opencontent.org/openpub/).

Open Publication License Option A [ is | is not] elected, so this document [may | may not] be modified. Option B is not elected, so this material may be published for commercial purposes.

After that paragraph, write the title, text of the article, and a two sentence description of the author.

Why not Draft v1.0, 8 June 1999 OR LATER

The Open Publication License recommends using the word "or later" to describe the version of the license. That is unacceptable for Troubleshooting Professional Magazine because we do not know the provisions of that newer version, so it makes no sense to commit to it. We all hope later versions will be better, but there's always a chance that leadership will change. We cannot take the chance that the disclaimer of warranty will be dropped in a later version.
 

Trademarks

All trademarks are the property of their respective owners. Troubleshooters.Com (R) is a registered trademark of Steve Litt.
 

URLs Mentioned in this Issue