Troubleshooting Professional Magazine
The Universal Troubleshooting Process
[ Troubleshooters.Com | Back Issues ]
Rapid Learning: Secret Weapon of the Successful Technologist.
There are other aspects of the use of the UTP (Universal Troubleshooting Process) that need clarification. When is the UTP appropriate? How must it be changed in extremely safety critical situations? These are some of the subjects perused in this month's magazine.
So kick back, relax, and read this magazine. And remember, if you're a Troubleshooter or Technologist, this is your magazine. Enjoy!
It's not as huge a change as you might imagine, because "Prepare" would
look something like this:
So the question is, shall I change it?
As you remember from the December 2000 mag, malfunctioning well defined systems require only restoration to as-designed state and behavior, which is accomplished by finding and fixing the root cause. Contrast that with malfunctioning fuzzily defined systems, which require finding of the root cause (analyzing the problem state), followed by an additional step of analyzing the solved state in order to design a new system with optimum benefit. In December of 2000 I stated this as the major decision point in choosing the Universal Troubleshooting Process over a more generic problem solving strategy such as Kepner-Tregoe or Root Cause Analysis. It's a decision point, but it's not the major one.
In troubleshooting computers, we board swap constantly. It's cheap, easy, fast and safe, and we have abundant spares for everything the computer contains. Board swapping is lightning quick -- usually quicker than mentally deducing what could be the root cause and what could not be. Computers are well defined systems.
Now consider a fuzzily defined system -- a business. Unlike the computer, there are no interchangeable parts. Every person is different, and except in the most basic shop floor positions, you can't swap people. People have no specification or service manual.
And even if two "equivalent" people could be found, swapping them would yield no information until the replacement person was retrained in the new job. And retraining time is unknown, so such a swap creates a situation resembling an intermittent.
Imagine a corporation whose symptom is "our revenue is down 42% from 3 years ago". Imagine further that there's a belief that the root cause is the advertising agency. Certainly if this were a computer we'd simply swap the part and see if the symptom gets toggled. But with an advertising agency, it's not so simple. Imagine swapping advertising agencies:
Therefore, do not use the Universal Troubleshooting Process on systems whose diagnostic tests are not abundant, cheap, easy, fast and safe.
On the other hand, recognize that the Universal Troubleshooting Process has been optimized for systems whose diagnostic tests are abundant, cheap, easy, fast and safe. That's what it does. That's its life's calling. In less time than it takes to go half way through a Kepner-Tregoe problem specification (the famous is/is not distinction revealing tool), you can use repeated quick and cheap tests to pinpoint the root cause and replace the defective component with the Universal Troubleshooting Process.
When available, take advantage of abundant, cheap, easy, fast and safe diagnostic tests. Think of it this way: The fact that hiking boots can carry you through any terrain doesn't make them optimal for all terrains. In a 10K race on smooth pavement, you'd choose running shoes instead. That smooth pavement is analogous to abundant, cheap, easy, fast and safe diagnostic tests.
Factory floors are hybrids. The individual machines are well defined systems amenable to the Universal Troubleshooting Process. The work and product flow between those machines is fuzzily defined and not appropriate for the Universal Troubleshooting Process. That work and product flow is typically troubleshot by something called the Theory of Constraints.
An office is a similar hybrid, with individual computers, the software running them and the networking connecting them being a well defined system best troubleshot with the Universal Troubleshooting Process. The work and information flow going through those computerized systems is a fuzzily defined system best troubleshot with something like Kepner-Tregoe, which treats the entire organization as a system.
Fuzzily defined systems lack abundant, cheap, easy, fast and safe diagnostic tests. The reason is simple enough. Without a defined as-designed set of states and behaviors, tests cannot instantly evaluate the effect of a diagnostic change. This is why the Universal Troubleshooting Process is universally non-applicable to fuzzily defined systems.
Well defined systems usually offer abundant, cheap, easy, fast and safe diagnostic tests. They are typically manufactured or created with many test points. They're typically modular (design, engineering and manufacturing are easier that way). Testing equipment and software are typically widely available, and if not cheap, at least not expensive when their cost is divided over the myriad of system problems they help diagnose.
Additionally, in well defined systems finding the root cause is sufficient to solve, because analyzing the solved state degenerates into "restore to as-designed state and behavior", whereas in fuzzily defined systems finding the root cause is not sufficient, because one still needs to do design work (called analyzing the solved state) to determine how best to modify the system. Naturally, such work would be a waste of time solving a problem in a well defined system.
Occasionally we encounter complex, one of a kind systems with insufficient test points and few diagnostic tools on the open market. Even in such cases, if the system is well defined, it's usually cost efficient to have Engineering design and create custom made diagnostic tools.
In other words, almost all well defined systems yield sufficiently abundant, cheap, easy, fast and safe diagnostic tests to make use of the Universal Troubleshooting Process by far the most efficient choice.
But there's another factor...
Take a look at the Universal Troubleshooting Process:
#4 is out. You certainly would not do anything to reproduce a symptom possibly leading to a few million deaths.
#5 is out. You would want to know the exact root cause so you could prevent all occurrences of that root cause. If Thursday's countdown was due to a loose screw, you wouldn't want to tighten all screws (general maintenance) before identifying the root cause. Imagine if the symptom went away and you didn't know what fixed it (and therefore when it might happen again). Also, you need to know why the screw loosened. Only then can you prevent future occurrence. Otherwise, your alternative would be to stand by helplessly knowing that sooner or later the problem (or a similar problem) would occur again.
#6, #7 and #8 are greatly changed. The system is so safety critical that there's a predefined procedure for diagnostic tests, replacements, and post-fix tests. Indeed, it's likely you'd need approval to take a voltmeter reading. The downside risk of a slipped probe is so great that only tests agreed to be necessary and revealing are appropriate.
Additionally, the definition of root cause is different in extremely
safety critical situations. In a computer, if a blown modem prevents booting,
the root cause is said to be the modem. But look at the missile system:
By "drilling down" to causes in people and procedures, we not only prevented future occurrence, but also likely prevented many other similar problems with parts stored in storerooms.
The bottom line is that extremely safety critical systems require such extensive modification to the Universal Troubleshooting Process that you're better off with Root Cause Analysis. Some organizations might choose to augment root cause analysis with the Universal Troubleshooting Process, but clearly the dominant methodology in extremely safety critical systems is Root Cause Analysis.
In the preceding diagram it's important to note that "extremely safety critical" means danger to life and limb, typically danger to many people. It would typically not mean the brakes on a single car, simply because the consumer would not pay the price of the other analysis types.
Note also that there's a tiny minority of well defined systems not possessing abundant, cheap, easy, fast and safe diagnostic tests. For those systems the choice is either to create better diagnostic tests, or use one of the other problem solving methodologies.
Finally, sometimes well defined systems need to be fixed beyond "as designed". In such cases an effective method is to use the Universal Troubleshooting Process and its included Bottleneck Analysis to find bottlenecks, and then if necessary use technological design methodologies to design the better system. However, if the newly designed system involves significant changes to work and information flow, work and product flow, or other human interactions, it may be best to augment the Universal Troubleshooting Process with something like the Theory of Constraints or Kepner-Tregoe.
Speaking of Kepner-Tregoe, you can incorporate Kepner-Tregoe type techniques right into your use of the Universal Troubleshooting Process. Read on...
The problem specification isn't rocket science. It's a simple matrix
analysis. A simple example of such a matrix analysis might look like this:
|What is the deviation?||It is the fact that email can't be accessed.||It is not anything else that has been observed||n/a|
|Where does it happen?||It is occurring on John's computer and Tiffany's computer||It is not occurring (can't be reproduced) on my computer||
|When did it start?||Friday at about 2:00 pm||OK before then||Partial restore done on accounting department server Friday between 1:45 and 2:15pm|
As you can see, the preceding matrix analysis points an accusing finger at the data restore on the accounting server Friday afternoon, thereby pointing out what would likely be a fruitful path for investigation. This makes it a powerful tool for those times when you get stuck and can't figure how to further narrow the scope of the problem.
The purpose of "is" and "is not" is to highlight distinctions. Therefore:
Don't be concerned if you can't discern a distinction from a question. Not every question produces a distinction.
Many questions produce consequent distinctions. For instance, Bill and Fred are in the accounting department. And as a consequent distinction, the accounting department has its own MS Exchange server (which may have a problem).
Balance the time spent on matrix analysis against time saved in troubleshooting. Matrix analysis is an excellent tool when you're stuck don't know how to narrow the scope further, but it's time consuming and therefore a waste of time when you know of quick, easy, cheap and safe diagnostic tests yet to be done.
When you get the book, turn to chapter 2, "Problem Analysis". The chapter is about 25 pages long, and once you read it you'll thoroughly understand what I call matrix analysis (and what Mssrs Kepner and Tregoe call "Problem Specification").
Now that you have the whole book, read it. Not only will it give you a vital tool to solve fuzzily defined system problems, but it will also give you insight into generic problem solving which will help you hone your skills in the Universal Troubleshooting Process.
The book is very inexpensive, typically less than $20.00. Buy it. Now! It's very understandable. You may even want to take Kepner-Tregoe courses.
But don't let a slick talking salesman convince you to substitute Kepner-Tregoe courses for Universal Troubleshooting Process courses. The two are optimized for totally different uses. If you were to choose Kepner-Tregoe techniques to solve problems in well defined systems with abundant, cheap, easy, fast and safe diagnostic tests (in other words, technological problems), your competitors using the Universal Troubleshooting Process would run circles around you.
Troubleshooters.Com is still here. We're still small, still debt free, still in the black, and still a beacon of best practices in the Troubleshooting arena.
Tell your friends and co-workers about Troubleshooters.Com. T.C visitors get the best Troubleshooting information available at any price. And if they need to train employees, the material is available in a course format for a surprisingly low cost.
So happy birthday T.C. And we know there will be many, many more.
That's not good for productivity.
As best I can tell, this is a result of upgrading from Mandrake 7.2 to Mandrake 8.0. It appears that some Mandrake 8 installations have such fonts, and some don't. The basic problem is that if a website declares its own fonts, and those fonts include arial, the resulting print is extremely pixelated. By making the fonts super big you change them from unreadable to supremely ugly, but there's no doubt it's a problem.
So I fixed it.
In XF86Config-4 I commented out this line:
FontPath "unix/:-1"and replaced it with these lines:
FontPath "/usr/X11R6/lib/X11/fonts/misc:unscaled" FontPath "/usr/X11R6/lib/X11/fonts/75dpi:unscaled" FontPath "/usr/X11R6/lib/X11/fonts/100dpi:unscaled" FontPath "/usr/X11R6/lib/X11/fonts/mozilla-fonts"You can read the details and the reasons at http://www.troubleshooters.com/linux/cookiecrumbfonts.htm (link in URL's section).
My fix isn't pretty. But Linux is so configurable that you can make your fonts as beautiful as you want. You can even copy your Windows fonts to your Linux box and set Linux to render Windows fonts. I chose not to do that because I don't want to depend on proprietary fonts. But it's doable. Several web pages tell you exactly how to make your fonts as pretty and accurate as you want. Most of those pages are howto's documented on linuxdoc.org.
Linux isn't perfect. There are some hassles. But because Linux' configuration is entirely accessible, you can fix any quirk that comes your way. Windows does fonts excellently. Linux does not. But you can make Linux fonts perform as well as Windows fonts. Putting this in perspective, does Windows let you eliminate its weak point, bluescreens and hangs, with a couple simple configuration changes?
But all troubleshooting content remained in Troubleshooters.Com's root, and over time the site's root directory became huge and convoluted. So in May I did what should have been done (with 20/20 hindsight) five years ago -- I made a ./utp directory (stands for Universal Troubleshooting Process to contain future troubleshooting related pages. Existing high traffic pages couldn't be moved -- it would have broken too many links. But hesley.htm, tcourses.htm and utpfaq.htm were moved to the new ./utp directory, necessitating all Troubleshooters.Com links to those files to be changed. Ughhh!
Linux made it easier:
find . -type f | grep "tcourses" > fixit.shNow I ran these VI commands on fixit.sh:
ivi <esc>In other words, I joined all the separate lines into one line containing all the files, and then preceded it with vi, so a single vi session had 66 buffers each containing a file to be changed. On the first buffer I did this:
:%s/tcourses\.htm/utp\/tcourses.htm/gcThat fixed each link, giving me the opportunity to confirm. I also checked for other occurrences of tcourses, hesley and utpfaq. Once the file was fixed I wrote it and deleted its buffer and went on to the next one. The substitute command and the searches were still on the history list, so subsequent files were quick. I did a substantial reorganization of a 468 file website in less than an hour.
Back in my Windows days it would have taken me most of a day, and I would have been finding dead links for weeks afterward. Oh, I suppose I could have gone out and bought proprietary versions of web lint tools, and they might have worked. But Linux makes it easy right out of the box.
Kind of like the productivity gains you get from evaluating log files.
Not all operating systems give you log files. Windows 9x doesn't. But all UNIX and UNIX workalike systems, including Linux and BSD, give you extensive logs.
A log file is like a strip chart recorder. It constantly records events on the system.
This is a lifesaver with intermittent problems. If you find a blown process, and determine a file created or written by that blown process (like a core file or partially written output file), you now have the timestamp of the event. By going back in your log files you can see the processes error message, as well as information about other processes running that same time. Often the log files yield enough info to deduce a root cause.
Logs are also excellent tools for diagnosing reproducible problems. Of course you could diagnose a reproducible by pure divide and conquer, but with available log files, it's much faster to augment divide and conquer with perusal of the log files. Once again, log files yield a snapshot of the processes at the time, error messages, and which process failed.
One of the coolest ways of using log files is real time:
[root@mydesk /root]# tail -f -n0 /var/log/messages Jul 30 13:10:00 mydesk CROND: (root) CMD ( /sbin/rmmod -as) Jul 30 13:10:49 mydesk kernel: cdrom: open failed.The preceding command runs a tail -f (tail follow) on the main log file (/var/log/messages). The -n0 means print nothing that's already happened. The first line of output you see below it occurred when a CD was placed in the CD drive (this is Mandrake, which has automounting). The second line occurred when a user tried to access /mnt/cdrom with an ls command, and failed because the CD isn't formatted.
In this trivial example it's easier to look at the error message returned by the command, but many times a log file's ability to issue real time messages makes troubleshooting much easier.
If you use most Linux variants, your main log file is /var/log/messages. Many other interesting log files are contained in the /var/log directory -- get to know them.
By submitting content, you give Troubleshooters.Com the non-exclusive, perpetual right to publish it on Troubleshooters.Com or any A3B3 website. Other than that, you retain the copyright and sole right to sell or give it away elsewhere. Troubleshooters.Com will acknowledge you as the author and, if you request, will display your copyright notice and/or a "reprinted by permission of author" notice. Obviously, you must be the copyright holder and must be legally able to grant us this perpetual right. We do not currently pay for articles.
Troubleshooters.Com reserves the right to edit any submission for clarity or brevity. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com. Authors: please understand we can't place hyperlinks inside articles. If we did, only the first article would be read, and we can't place every article first.
Submissions should be emailed to Steve Litt's email address, with subject line Article Submission. The first paragraph of your message should read as follows (unless other arrangements are previously made in writing):