The most exciting thing about being editor is when I find something great, I can tell the world. Eliyahu M. Goldratt's new book, Critical Chain, is a perfect example.
For those of you not familiar with Mr. Goldratt (and every Troubleshooter should be), he's an author who has taken Bottleneck Analysis where no man (or woman) has dared go before. For most of the population, it's a fuzzy concept. For Troubleshooters like us, it's a vital Troubleshooting Tool. For Goldratt, it's nothing less than a universal mathematical law, a piece of THE TRUTH.
Using the "business novel" genre he's made famous, Goldratt creates a "can't put it down" story showing exact use of Bottleneck Analysis to create miracles in project management, financial analysis, and other business problems. The reader is led down the path to discovery, making all the same mistakes and discoveries as he would learning from experience.
This issue of Troubleshooting Professional Magazine is devoted to Bottleneck Analysis. Mr. Goldratt describes and defines it with authority. My contribution in this issue will be to apply his thought processes to the repair of well defined systems. And to bring up the tantalizing possibility that maybe, just maybe, the guy repairing desktop computers today can be the guy repairing corporations tomorrow.
When the symptom description includes relative words like "too", "not enough", "slow", "we need more", "not as fast" or "insufficient", you know you'll be taking Bottleneck Analysis out of your mental tool-belt:
These complaints something in common: They DO NOT say the system doesn't work.
Nobody said the printer doesn't print, the car doesn't move, the software produces wrong results, the music is inaudible, the bike doesn't move, or the network doesn't transfer information. All they're saying is that the speed, throughput or degree of the system is wrong, and it's creating a problem for the user.
Most such problems involve a subsystem or process that's too slow. But note the complaint "the car idles too fast". That's every bit as much a bottleneck analysis problem. Too slow, too fast, doesn't matter. Whenever the speed or throughput of a system is wrong, it's time for Bottleneck Analysis.
So far we've discussed cases where the system is under-performing the design specifications. But it's equally applicable when you want to improve its performance beyond the design specifications. Anyone who's "souped up" a car or "overclocked" a computer is familiar with the concept.
So that's it. Whenever you want to change the throughput or speed of a system or subsystem, Bottleneck Analysis is a vital tool.
Steps 6 and 7 of the Universal Troubleshooting Process are:
This is perfectly correct for a bottleneck situation. As an intellectual exercise, you might want to re-phrase it to get the "feel" of bottleneck analysis:
Therefore, bottleneck analysis is completely compatible with the Universal Troubleshooting Process.
Note that sometimes the bottleneck is the root cause, and sometimes it isn't. The root cause is often a signal input to the bottleneck. For instance, your car overheats when idling. You car stalls when cold. You note that the automatic choke is open a little too much when cold. You verify that if you force the choke closed more, the symptom disappears. The bottleneck is definitely the automatic choke. But why is it open too much? Further investigation might reveal that a sensor to a computer controlling the choke is bad, in which case the root cause is a signal input to the bottleneck.
Suppose, instead, further investigation revealed that the choke's inputs were just fine, and the choke's spring had gotten weak. In this case the bottleneck is also the root cause (as long as you consider the spring to be a part of the choke).
Fill a 16 oz water glass and a 16 oz soft drink bottle with water. Turn them upside down. The bottle empties much slower because bottle's neck restricts flow. Hence the name. The point is that the slowest part of a flow path will determine the speed of the entire flow path. Actually, the previous sentence is too restrictive, because there are in fact many different situations that resemble bottlenecks, each with their own grounding in the practical world and in mathematics. Let's examine an ultra-simple example:
Every day, Joe walks 2 miles to school. The first mile is on a concrete road, where he can go 4 miles an hour with reasonable effort. The next is through a field of 3 foot deep snow, where he can go 1 mile an hour with reasonable effort. The trip will take him 1.25 hours: 1 through the snow and .25 on the concrete. Where's the bottleneck?
Give him a bicycle so he can go 16 miles an hour on the concrete (but can't use his bicycle on the snow). He'll ride the first mile in .0625 hours, then walk through the snow in 1 hour. The trip will take him 1.0625 hours -- a 17.5 percent speed improvement over walking, using a 300% improvement in resource on the concrete. The concrete isn't the bottleneck.
Now take away the bicycle and give him snowshoes. Imagine he can do 2 miles an hour on snow with reasonable effort with the snowshoes. He walks the first mile in .25 hours, and snowshoes the next mile in .5 hours. Elapsed time: .75 hours. Here a mere 100% improvement in snow performance yields a 66.2 percent speed improvement. Now that's more like it. The snow is the bottleneck.
This simple example shows some principles of bottleneck theory. First, "fixing" non-bottlenecks is almost useless. Joe could have done the concrete mile at the speed of light, and it still would have taken slightly more than an hour to get to school.
The second illustrated principle is that a good way of finding the bottleneck is to see what has a material effect on the system as a whole. In this case, a twofold increase in bottleneck speed produced an almost twofold increase in total system performance.
Perhaps the easiest introduction to the concept of bottlenecks is geometry, in which lengthening of one component (the bottleneck) produces a commensurate lengthening of the whole, while lengthening of the other fails to produce a significant increase in the whole.
Sometimes the bottleneck and non-bottleneck are additive. In these cases the best representation is two line segments, as shown below:
short part = 1)
(Double the non-bottleneck, increase total by 20%)
(Double bottleneck, increase total by 80%)
In other cases the bottleneck and non-bottleneck are not additive, in which case it can often better be represented by a right triangle. Note that in these cases there's an even stronger bias toward fixing the bottleneck and leaving the non-bottleneck alone.
(Long=4, Short=1, Hypotenuse=4.12)
(Double Short side, Hypotenuse increases 8.5%)
(Double Long side, Hyp increases 95%)
This article will be old-hat to you Quality Movement people, but please be patient. A fundamental teaching of the Quality Movements is you never optimize the parts, you optimize the whole. Goldratt spends a substantial portion of his books teaching us never to optimize the parts, but instead to optimize the whole. But is this also true for machines? You bet it is!
Grab the manual that came with your computer's motherboard. Turn to the chapter on BIOS settings, and note the default settings. Note the default settings for CPU internal cache and CPU external (L2) cache. Both on. Cache is an extra bunch of (very fast) memory which duplicates main memory. In order to use cache the CPU needs to synchronize the cache memory with the main memory by running a sophisticated (extra) program. Doesn't that slow down the CPU? You bet it does. Doesn't it create more work for the computer as a whole? Absolutely. Doesn't it therefore slow down the whole computer? No, it speeds it up!
With Cache off, the CPU is bottlenecked by the need to access slow (60ns for EDO) memory. Not only is the memory slow, but it must traverse a really slow system bus to get to the CPU. The CPU spends a lot of time just waiting for memory.
Now turn on the cache. The CPU's cache handling software keeps copies of the most recently accessed memory locations in the ultra-fast (up to 10 times faster) cache memory. Since most computer programs access the same memory locations repeatedly, up to 90% of all memory accesses are to fast cache (called "cache hits") rather than the slow EDO.
Let's look at the math. Assume a program which, with cache disabled, a program takes 1 minute to run. 40 seconds are memory accesses, while 20 seconds are pure CPU calculations (no disk access).
Now enable the cache, which contains enough memory so that the program gets 90% cache hits. Assume the cache memory is 5 times faster than system memory. Assume that the CPU cache handling software makes the CPU 20% slower.
First, we know we have 20 seconds of pure CPU calculations. When we turn on cache, the 20% CPU performance penalty goes into effect, so the program spends 24 seconds on CPU calculations.
Since the cache hit percentage is 90%, we know that 10% of memory access is the same old slow speed. Since non-cached memory access took 40 seconds, non-cache-hit memory access in the cached system will take 10% of 40 seconds, or 4 seconds.
Cache-hit memory accesses are 90%. If cache memory were the same speed as system memory, that would be 90% of 40 seconds or 36 seconds. However, since cache is five times faster than system memory, cache-hit accesses would be 36/5, or 7.2 seconds.
So the program would take the following:
|No Cache||With Cache|
|Non-cache-hit memory access:||40||seconds||4||seconds|
|Cache-hit memory access:||0||seconds||7.2||seconds|
|% Saving ((60-35.2)/60):||n/a||41.3||percent|
In other words, by de-optimizing the CPU by 20%, we sped up the computer by 41%. What we really did, of course, was offload some of the work from the bottleneck, the system memory, and give it to the CPU, which had plenty of time to spare.
Disks operate about 100 times slower than the slowest system memory, so they'll be a solid bottleneck on any program doing much disk access. A disk cache program works the same as memory-cache, with a disk-cache program cutting slow disk access by a significant factor, by burdening the under-worked CPU and memory with extra work. The result is a large performance improvement.
There's an urban myth that using Disk Compression on your computer slows it down. Don't believe everything you hear. I have a Pentium II, 300 mhz with 128 meg of 10ns SDRAM and a Western Digital 8.2 meg disk. When I first built the system, I used a well known benchmarking program to test various configurations.
To see how badly my disk compression was slowing performance, I removed disk compression. Much to my surprise, my total system performance slowed even more. I reran the test several times, and was able to consistently toggle faster performance by turning on disk compression. The (total system) performance improvement with disk compression was 3% to 10%, depending on interpretation.
Upon a minute's consideration, it's not surprising. My disk is 1000 times slower than my SDRAM and CPU. Even with disk caching, it's still a stone bottleneck. Every single sector I write to that disk comes right off the system performance.
Now enable disk compression. Imagine, for the sake of argument, that my disk is 100 times slower than the rest of my system after enabling disk cache. Imagine that only 10% of my program uses the disk. Imagine further that disk compression cuts non-disk system performance in half, and cuts the size of files in half. Represent a unit of non-disk (without compression) work as X. Imagine a unit of disk work is Y, and, as stated above, Y is 100 times greater than X.
Total = 90 * X + 10 * Y
Total = 90 * X + 10 * 100 * X
Total = 1090 * X
Define X1 as a unit of non-disk work WITH COMPRESSION, and Y1 as a unit of disk work WITH COMPRESSION.
Total = 90 * X1 + 10 * Y1
Remember that compression cuts the performance of the CPU in half and cuts the amount of disk work necessary in half. Therefore:
Total = 90 * X * 2 + 10 * Y * 1/2
Total = 180 * X + 5 * Y
We stated at the beginning that the disk is 100 times slower than the rest of the system. Therefore:
Total = 180 * X + 5 * X * 100
Total = 680 * X
Savings from compression is (1090-680)/1090 = 37.6%
Obviously, my 3% to 10% improvement was much less than the 37.6% predicted by the previous (contrived) example. I'd attribute the difference to these real-life facts:
If you've read much on Troubleshooters.Com and Troubleshooting Professional Magazine, you know I define Troubleshooting as restoring the system to its as-designed state. That's a special case, with a guaranteed single "right" answer. You're simply using Bottleneck Analysis to help find THE Root Cause.
Most Bottleneck Analysis occurs while trying to improve, "soup up" or design a system. In a previous example we reviewed how we could put a little extra burden on the CPU to take significant burden off the RAM. It seemed a no-brainer. But what if CPU's cost 100 times more than they do? What if a 386/16 chip cost $2000? Would it be smart to burden your processor with cache overhead? Or would it make more sense to brute force it and fill your entire 32 meg with 5ns memory on a special fast bus?
Everyone knows modern software is horribly inefficient. In 1984 I had an Z80 8bit processor with 64K of ram (that's right, K) and a 170Kb (that's right, Kb) floppy that ran CPM WordStar a little faster than I can type. Today I have a Pentium II 300 with 128M of 10ns SDRAM and a 8.4Mb, 9.5ns access hard disk. Sometimes when I use Netscape Gold 3 in editor mode, typing into a table inside a table, I can type ahead of the cursor.
Netscape Gold my best webpage editing tool. And obviously, it is doing a lot more than WordStar, but you get the idea. I'll bet you 10 to 1 if you give me the source code to the Netscape Gold editor, and a month or two, I could improve its speed by a factor of two.
But would that make sense? Programmer hours are scarce and expensive. Would they be better off using the programmer hours to add needed features, especially when in a few months the speed of computers will double, brute forcing the performance?
In design and improvement, sometimes the real bottlenecks are time and money.
To view such ambiguities, it's interesting to use an electrical analogy. You'd have a driving force as an input, resistors and/or "current sources" as processes, and an output:
Of course, the output is will probably have some back pressure:
And, as we all learned in our electrical courses, every voltage source (like the battery shown) has a series resistance. That's what keeps a flashlight battery from pumping 400 amps into a short circuit:
Now, to illustrate a point, lets add a second process (resistor) to the system, and put values on everything:
Note the driving force minus the back pressure is 9 volts. The total resistance is 2.2 + 100 + 1200 + 22, or 1324.2. Among the resistors, proc2 is the bottleneck. A change to any other resistor would be insignificant. The only resistor that could double the output current is proc2.
But there's another way to double the output current. You could increase the driving force to 21 volts. Note however, that reducing the output's backpressure even to zero would nowhere near double the output current. This seems to typify many bottleneck situations, where there's one driving force bottleneck, and one force consumer bottleneck, either of which can increase performance. Note the following:
Note that the decision of driving force over resistance (or even a combination of both) is economic.
NOTE: When you read Eliyahu M. Goldratt's book, The Goal and Critical Chain, increasing the driving force (raw materials or original deadline) does nothing to increase the throughput of the individual processes. Does this rule out the electrical analogy?
No. Mr. Goldratt's examples come from the world of factory floors and project management. In such worlds, resources such as machines usually produce at a specific rate, regardless of upstream pressure. In other words, a punch press produces X per hour whether it has a ten foot stockpile or whether it has a 1 foot stockpile. As long as there's enough input for one last cycle, it produces the same. Once the input flow drops below the punch press's capacity, the punch press produces at the input rate.
There's an electrical equivalent of this -- a current limiter.
(Academically simple, real-world inefficient)
This particular regulator will pass all current below 1 amp, but as the rest of the circuit becomes capable of passing more than 1 amp, this regulator will limit it to 1 amp. Increasing the input voltage will not increase the current above 1 amp. In fact, it will limit current under any voltage pressure up to the point where the voltage or power consumption causes a catastrophic failure of one of its components (could we somehow relate that to job burnout in today's work environment?).
Here's a simple electronic equivalent of Goldratt's factory floor, with fixed-throughput machines and stockpiles (the capacitors are the floor space devoted to stockpiles, and the voltage across the capacitors are the stockpiles).
It's just like a very simple factory floor. The 1 amp current limiter is the bottleneck, so the capacitor at its left begins to fill up (stockpile). When that capacitor charges to the point where its back pressure limits the 2 amp current limiter from conducting 2 amps, the cap to the left of the 2 amp begins to charge. This is equivalent to a factory floor where a stockpile becomes so unmanageable that upstream processes shut down and develop their own stockpiles.
The throughput of the circuit above can't be improved by changes to the driving force, the output, or the 2 or 4 amp current limiters. Bigger capacitors won't do the trick. The only way to improve the throughput of this circuit is to increase the capacity of the 1 amp current limiter. The bottleneck.
The "law of diminishing returns" is well known in the general population. It generally refers to the fact that continued improvement of a specific resource or factor yields diminishing returns.
For instance, imagine you're a transportational bicycle rider who is in fairly good shape. You define "performance" as your commute time, 12 miles each way, over moderate hills. Where do you get the best bang for your buck? You want to buy a brand new bicycle. Here are your choices:
|$75||A one-speed coaster brake job. One gear means your legs will often be mismatched to the terrain and wind conditions. You may have to walk up some hills. Worse yet, you'll rev out on tailwind downhill conditions, thus failing to make up the time you lost on the hills. Upright riding position increases wind resistance.||2.0||n/a|
|$150||An inexpensive "mountain bike". Multiple gears mean reasonable speeds on all terrains and wind conditions. Mountain bike riding position is a reasonbly low one for reasonable wind resistance. Tire friction from cheap tires, friction from slightly mis-adjusted brakes, missed shifts from inexpensive derailleur, chain friction from an inexpensive chain, and other minor factors reduce speed, but not to the same extent as wind resistance.||1.6||20%|
|$300||A reasonable "road bike", with better riding position for less wind resistance. Various mis-adjustments from less expensive components reduce speed, but not to the same extent as wind resistance.||1.33||16.9%|
|$600||A "road bike" with good components. Frame flex and slight weight disadvantage reduce speed on accelleration and hills.||1.26||5.3%|
|$1200||A lightweight "road bike" with good components, and a fairly stiff frame for better accelleration.||1.22||3.2%|
As you can see, the bottleneck for $75 is riding position and gears. At $150 it's riding position. At $300, it's wind resistance (without possibility of further improvement from riding position). At $600 on-up you're improving factors that are not the bottleneck, as the results bear out. The average transportational rider wouldn't pay double for a 5.3% improvement, especially when they can probably do better than that by improving their strength, maintaining their bike a little better, or wearing less porous clothing.
Now imagine you're a world-class bike racer. You're in supurb shape. If you train any harder you'll injure yourself. Unlike the transportational rider above, you define performance as the number of seconds you're off the winning time in a 60 mile race. Most of the race is in a draft line, so wind resistance is less of a factor. To keep the draft, you frequently must accelerate quickly, so weight and frame stiffness are definitely factors.
|$1200||A lightweight "road bike" with good components, and a fairly stiff frame. The components won't stand up to the pounding of the 60 mile race, and the frame flex will inhibit accelleration enough for the rider to lose the draft.||1800
|$2400||A production racing bike with a stiff frame and top-notch components. The rider can keep the draft but may need to expend additional energy that may hurt him in the final sprint..||12||15000%|
|$4800||A custom-made racing bike. Geometry and configuration matches the rider's legs, habits and preferences, giving an additional advantage.||9||25%|
|$9600||Custom made, with everything absolutely top of the line. The differences here are as much psychological as mechanical, but can still yield an advantage.||8||11%|
Here the definition of performance justifies a much more expensive bike. In fact, everything under $2400 is useless. At $1200, the bottleneck is the bikes inability to support adequate accelleration. At $2400 it became an issue of the bike's geometry not exactly matching the rider. At $4800, the remaining bottleneck seems to be psychological. And of course, on everything $2400 and over, the riders strength, endurance and ability are a major factor..
The above examples illustrate several concepts. First, the common concept of "diminishing returns" is really a phrase to illustrate the point that once you've improved a bottleneck to the point where it's a non-bottleneck, it's not cost effective to further improve it.
The second point is that "bang for your buck" can mean different things depending on your definition of performance. Once again, the ultimate decision is economic.
And of course, once again we've seen bottlenecks come in driving-force/resistance pairs and the economic decisions they create (see also Electrical Analogies. Note that past $600, the transportational rider finds it easier to increase his speed by building muscle rather than buying a better bike, while the world-class racer has already developed his legs pretty much to the maximum, so he must exploit every last bit of bicycle technology.
The meeting included rank and file programmers, project managers, MIS director, and a high-level partner in the organization, who chaired the meeting. One item on the agenda was the Word-Processing Department's HP2000 printer, which was printing at only 1/3 its 20 page per minute specification. The time was early 1989.
You remember 1989. You could buy your favorite song on 45 RPM vinyl. George Bush had just assumed the presidency. The economy-busting Gulf War was still a year in the future. And a 20 page per minute printer was the size of a kitchen stove and cost a kings ransom.
I spoke up claiming I could find out why the printer was so slow. The high level partner, who hardly knew me, thought for a second, then told me to leave the meeting immediatly and begin solving the problem.
I quickly reproduced the symptom -- 7 pages per minute. Printing several pages, I noticed the problem wasn't as bad on partial pages. In fact, using a file containing ten formfeeds and nothing more, it printed at its rated 20 pages per minute. The printer mechanism supported 20, but the per-character rate was slow. To confirm, I printed a page with a sizable graphic. Several minutes!
Why so slow on a byte basis? In an effort to swap in an HP LaserJet II to confirm that these files should print faster, I noticed it was attached via a serial cable. AHAH! That's probably the problem.
A call to HP confirmed that our printer came with a serial port. To obtain a parallel port, we needed to buy an expensive card for the printer, and have an HP tech install it. Big bucks. I had to be absolutely sure this was the problem before spending the money to upgrade.
The answer was simple and elegant. I configured the computer's serial port to transmit at 4800, instead of its usual 9600 baud. If the serial connection were indeed the bottleneck, the printer's speed would be cut almost in half on large files. Indeed, pages with large graphics did take almost twice as long, ruling out the printer itself. I reported back to the partner, who authorized installation of a parallel port.
This illustrates that often the best way to prove a bottleneck is REDUCE its throughput. There are two advantages:
The work was ordered. Serious money was on the line. Would I look like a hero, or a bum?
A few days later, as Communism crumbled throughout Europe, the Word-Processing Department's HP 2000 assumed its rightful role as a 20 page per minute printer.
If you're like me, this month's Troubleshooting Professional Magazine has left you with more questions than answers. I feel especially humbled, because up until now, if you had asked me "do you know Bottleneck Analysis", I would have answered "Of course -- I do it every day". And if you've read many of my writings, you know how much scorn I heap on those who give that answer to the question "Can you Troubleshoot".
Only after reading Critical Chain and writing this issue did I realize there are resistance type bottlenecks, constant-current types, and driving force types. I've just now realized that in the absence of a constant-current type, a system can have one resistance type and one driving force type, either of which can improve performance.
Perhaps most important, look what an important tool bottleneck analysis can be in improvement of incompletely-defined systems, such as business entities, biological entities, and "souped-up" systems. It's a major bridge between Troubleshooting and Problem Solving (the distinction is explained on website ProblemSolving.Com. Could mastery of Bottleneck Analysis triple our salaries?
For the reasons above, you can consider this issue part 1. Part 2 will run in a near-future Troubleshooting Professional issue. In the mean time, please email me with any and all observations on Bottleneck Analysis. Believe me, I need the help.
If you're like me, it's humbling to have more questions than answers. I have such a long way to go. It makes me feel less like an authority. At least that's one way to look at it.
The other viewpoint is that third-graders have all the answers, but research scientists have more questions than answers. Their job is to answer those questions one at a time, and to generate more questions.
So maybe you and I are in good company.
We anticipate two to five articles per issue, with issues coming out monthly. We look for articles that pertain to the Troubleshooting Process. This can be done as an essay, with humor, with a case study, or some other literary device. A Troubleshooting poem would be nice. Submissions may mention a specific product, but must be useful without the purchase of that product. Content must greatly overpower advertising. Submissions should be between 250 and 2000 words long.
All submissions become the property of the publisher (Steve Litt), unless other arrangements are previously made in writing. We do not currently pay for articles. Troubleshooters.Com reserves the right to edit any submission for clarity or brevity. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com.
Submissions should be emailed to Steve Litt's email address, with subject line Article Submission. The first paragraph of your message should read as follows (unless other arrangements are previously made in writing):
After that paragraph, write the title, text of the article, and a two sentence description of the author.