Troubleshooters.Com and T.C Linux Library and djbdns Intro Present

Troubleshooting djbdns

Copyright (C) 2011 by Steve Litt, All rights reserved. Material provided as-is, use at your own risk. 




CONTENTS:



Introduction

A lot of us are using djbdns instead of BIND for DNS. Many (including me) consider djbdns simpler and more secure.

But djbdns can get bunged up in a bad install or by a bad distro package, and when it does get bunged up it can seem like a huge black box. The fact that it's run by daemontools only makes the black box seem
bigger. A bad djbdns install, whether via distro package or compile, can seem overwhelming. This document is written to give you:
  1. An understandable mental model of djbdns and the daemontools application that runs it.
  2. a list of quick and handy diagnostic tests to quickly narrow down the problem to the root cause Using that mental model.
In other words, the purpose of this document is to help the average Linux/Unix/BSD user troubleshoot djbdns.

About Troubleshooting: A Summary

Troubleshooting is the act of restoring a sub-performing system back to its as-designed state. It's performed using the Universal Troubleshooting Process (abbreviated UTP):
  1. Get the Attitude
  2. Make damage control plan
  3. Get a complete and accurate symptom description
  4. Reproduce the symptom
  5. Do the appropriate general maintenance
  6. Narrow it down to the root cause
  7. Repair or replace the defective component
  8. Test
  9. Take pride in your solution
  10. Prevent future occurrence of this problem
The preceding process is used against a Mental Model of the system being repaired. The Mental Model is a block diagram of the system, and is used to decide the optimal diagnostic tests to perform in order to continue narrowing down the root scope (Step 6). The decision of which diagnostic to perform next is based on the Quadruple Tradeoff:
All other things being equal, even divisions (splitting the remaining possible root cause scope exactly in half) leads to the quickest solution. Likelihood and ease often cause optimal splits to be other than even. You should never do unsafe tests. Either find a way to make a test safe, or don't do it at all.

In order to do everything discussed so far in this article, you need to operate your brain optimally for troubleshooting. Here is a summary of optimal brain operation for troubleshooting:
This article is a short summary of the Universal Troubleshooting Process and related mental state. To learn more, purchase "Twenty Eight Tales of Troubleshooting" from Troubleshooters.Com at http://www.troubleshooters.com/bookstore/, or see our troubleshooting content at http://www.troubleshooters.com/tuni.htm.

djbdns Mental Models

This articles contains the mental models necessary to quickly, accurately and systematically find the root cause of a djbdns problem.

djbdns Overview Mental Model

The following diagram is a mental model of a running, or should-be-running, djbdns system. The major components are daemontools, dnscache, tinydns, and the three special user accounts.



You can determine which of these four major components is at fault in a couple minutes. First, take twenty seconds to test all three usernames with the id command. If one or more don't exist, create them and see if the problem went away.

Beyond that, you can test dnscache and tinydns using dig @listenerip domain. If that command yields an an answer, the service is acting correctly and can be ruled out, at least for now.

Using various ps commands you can see what's running -- whether each process is running too many times, not running, or running once. Last but not least, you can run either dnscache or tinydns without daemontools by getting into the tinydns or dnscache directory (the one you would link to the /service directory, and perform the following command:
./run; log/run
You can run and test daemontools on its own with a test service, but this is covered later in this document.  Anyway, the point is that within a couple minutes you're narrowed the root cause scope down to one major component, making for much easier troubleshooting.

Mental Models For a Minimal Service

In the djb world, dnscache and tinydns are both called services. They are both run in the background as daemons, they both are intended to be started and supervised by daemontools, and they both have a manditory directory structure with certain manditory files. Qmail, is another software package that runs as a djb service supervised by daemontools.




Directory structure of a minimalist djb daemon

Minimal djb application log directory tree





Minimal Service: Process and data flow

Tree for the dnscache directory

Block diagram of dnscache


The tinydns tree

Block diagram of tinydns

Diagnostic Tests

Diagnostic tests don't occur in a vacuum. For one thing, they're run on a system with an architecture (layout, Mental Model, whatever you want to call it). This architecture includes specific directories, IP addresses, etc. This is addressed in the Assumptions part of this article.

Then there's the fact that diagnostic tests are run from the highest level, represented by the Preliminary High Level Tests part of this article. From there, based on results, you go to lower level diagnostic tests, represented by the If dnscache Didn't Work and If tinydns Didn't Work parts of the article, and still lower levels such as the If Dnscache Can't Query Local Domains and If Dnscache Cannot Be Queried By Other Hosts parts of this article. I chose to do it as several discreet sub-articles rather than a predefined diagnostic or script because the former is easier to write, easier to use, and easier to mold to the situation.

Once again, here's a list of this article's sub-parts:



Assumptions

In the following, assume the following:

Preliminary High Level Tests

This set of diagnostic tests should take less than two minutes, and will substantially increase your understanding of the situation. One thing you should notice is all the dig commands specify which

TEST
MEANING
ping 8.8.8,8
See if you can hit the Internet by pinging the main Google public DNS. If you can't ping by IP address, you have deeper problems and no local DNS resolver will work. If this doesn't ping, fix it before going on.
dig @8.8.8.8 -x  8.8.8.8
Make sure that the Google public DNS at 8.8.8.8 reverse-resolves itself. If not, you have a very wierd problem: investigate.
dig @192.168.100.2 -x  8.8.8.8 Try to reverse-resolve Google's public DNS server at 8.8.8.8, using your dnscache at 192.168.100.2. The answer should be something like google-public-dns-a.google.com. If this doesn't work, your resolver (dnscache) is not working.
dig @192.168.100.2  google.com
Try to resolve google.com using your dnscache. The answer should be a list of several IP addresses. If this doesn't work, your resolver (dnscache) is not running or not working.


dig @127.0.0.1 wincli.domain.cxm
Query for mydesk.domain.cxm on your authoritative DNS server (tinydns) at 127.0.0.1. The answer should be 192.168.100.5. If this doesn't work, your authoritative server (dnscache) is not working or is misconfigured.
dig @127.0.0.1 -x 192.168.100.5 Reverse-query for 192.168.100.5 on your authoritative DNS server (tinydns) at 127.0.0.1. The answer should be wincli.domain.cxm. If this doesn't work, your authoritative server (dnscache) is not working or is misconfigured.

NOTE: If the forward authoritative query works but not the reverse, or vice versa, it's probably a misconfigured tinydns rather than a nonfunctional one. Fix the /service/tinydns/root/data file, and then from within /service/tinydns/root directory, run make.
ps ax | grep svscan
On standard installations this command should produce a line of output looking like this:
 1031 ?        S      0:21 svscan /service
If the directory after the word svscan is anything other than /service, then you have a non-standard installation, and you need to carefully evaluate everything. In this document, everywhere I use the directory /service, it's because that's the directory listed in the svscan process. If that process lists another directory, either all symlinks must be to that directory, or you must change your bootup to run svscan on directory  /service.

If dnscache Didn't Work

If dnscache didn't work, you can safely assume you have some sort of dnscache problem. The problem might be as simple as somebody turned off dnscache, but one way or another you need to find the problem.

Note that the problem might be that the daemontools software isn't doing its job and running dnscache. In that case, some of these diagnostics run dnscache without daemontools, thus either ruling out dnscache itself as a cause, or giving you a way to troubleshoot dnscache without daemontools.

SITUATION
TEST MEANING
If dnscache didn't work
ps ax | grep dnscache This shows what dnscache software is running. If everything's correct, it should look like this:
slitt@mydesk:~$ ps ax | grep dnscache
  568 pts/22   S+     0:00 grep dnscache
30977 ?        S      0:00 supervise dnscache
30979 ?        S      0:19 /usr/local/bin/dnscache
slitt@mydesk:
If you don't see the line referring to  /usr/local/bin/dnscache, your dnscache software is not running and that's why you cannot forward or backward resolve. If you also don't see the line referring to  supervise dnscache, that's an indication that the root cause resides in the daemontools software, and there may be nothing wrong with your dnscache.


If the supervise dnscache line is there but not the /usr/local/bin/dnscache
svc -u /var/djb/service/dnscache
sleep 5
ps ax | grep dnscache
dig @192.168.100.2 google.com

Dnscache might have just been turned off, in which case the first command turns it back on. It might take as long as 5 seconds to turn it back on, hence the sleep command. The final two commands test to see whether dnscache now works.
If the preceding test failed to fix the problem, or if there was no supervise dnscache process
cd /var/djb/service/dnscache
./run
## CHECK FOR ERROR MESSAGES
ps ax | grep /usr/local/bin/dnscache
dig @192.168.100.2 google.com
Here you're running dnscache without using daemontools. You are NOT running the dnscache logging facility. You're running dnscache in the foreground, so you can see and correct all error messages. Also, whenever you query this dnscache, you'll see what would normally go into the log. The final two lines check whether dnscache can work on its own, without daemontools.

If you need to shut off this foreground process, just type Ctrl+C to stop it.
If dnscache still doesn't work
ps ax | grep dnscache If you see multiple instances of supervise dnscache or  /usr/local/bin/dnscache, that could be causing dnscache to malfunction. See the section titled Killing Multiple Instances.
If you got dnscache working

If you got it working with the svc -u command, all is well. If you had to do the ./run command, that means that dnscache works OK on its own, but it's not being reliably started by daemontools. See the section titled Troubleshooting Daemontools Problems.

If tinydns Didn't Work

If tinydns didn't work, you can safely assume you have some sort of tinydns problem. The problem might be as simple as somebody turned off tinydns, but one way or another you need to find the problem.

Note that the problem might be that the daemontools software isn't doing its job and running tinydns. In that case, some of these diagnostics run tinydns without daemontools, thus either ruling out tinydns itself as a cause, or giving you a way to troubleshoot tinydns without daemontools.

SITUATION
TEST MEANING
If tinydns didn't work
ps ax | grep tinydns This shows what dnscache software is running. If everything's correct, it should look like this:
slitt@mydesk:~$ ps ax | grep tinydns
21550 pts/22   S+     0:00 grep tinydns
30932 ?        S      0:00 supervise tinydns
30935 ?        S      0:00 /usr/local/bin/tinydns
slitt@mydesk:
If you don't see the line referring to  /usr/local/bin/tinydns, your tinydns software is not running and that's why authoritative DNS isn't working. If you also don't see the line referring to  supervise tinydns, that's an indication that the root cause resides in the daemontools software, and there may be nothing wrong with your tinydns. If you see more than one of either, see the section titled Killing Multiple Instances. If you see exactly one of each, that's an indication that your tinydns would work just fine if only it were configured correctly, in which case you should see the section titled Configuring Your data File.
If the supervise tinydns line is there but not the /usr/local/bin/tinydns svc -u /var/djb/service/tinydns
sleep 5
ps ax | grep tinydns
dig @127.0.0.1 wincli.domain.cxm
Tinydns might have just been turned off, in which case the first command turns it back on. It might take as long as 5 seconds to turn it back on, hence the sleep command. The final two commands test to see whether tinydns now works.
If the preceding test failed to fix the problem, or if there was no supervise tinydns process cd /var/djb/service/tinydns
./run
## CHECK FOR ERROR MESSAGES
ps ax | grep /usr/local/bin/tinydns
dig @127.0.0.1 wincli.domain.cxm
Here you're running tinydns without using daemontools. You are NOT running the tinydns logging facility. You're running tinydns in the foreground, so you can see and correct all error messages. Also, whenever you query this tinydns, you'll see what would normally go into the log. The final two lines check whether tinydns can work on its own, without daemontools.

If you need to shut off this foreground process, just type Ctrl+C to stop it.
If tinydns still doesn't work ps ax | grep dnscache If you see multiple instances of supervise tinydns or  /usr/local/bin/tinydns, that could be causing tinydns to malfunction. See the section titled Killing Multiple Instances.
If you got dnscache working
If you got it working with the svc -u command, all is well. If you had to do the ./run command, that means that tinydns works OK on its own, but it's not being reliably started by daemontools. See the section titled Troubleshooting Daemontools Problems

If Dnscache Can't Query Local Domains

SITUATION
TEST MEANING















































If Dnscache Cannot Be Queried By Other Hosts


SITUATION
TEST MEANING
















































Killing Multiple Instances

Troubleshooting daemontools Problems

Configuring Your data File






Back to Troubleshooters.Com * Back to Linux Library