Troubleshooters.Com Presents

Linux Productivity Magazine

September 2006

Making Backups Easier

Copyright (C) 2006 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Linux Productivity Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.

See also Troubleshooting Techniques of the Successful Technologist
and Rapid Learning: Secret Weapon of the Successful Technologist
by Steve Litt

[ Troubleshooters.Com | Back Issues |Troubleshooting Professional Magazine ]



We're entering a new world in which data may be more important than software.  -- Tim O'Reilly

CONTENTS

Editor's Desk

By Steve Litt
It was a tough day. I was frazzled. Then things went bad.

My wife asked me to find an urgently needed email. My automated email search swamped the processor. I deleted all the emails from what I thought was the trash folder. A couple minutes later I looked at my inbox and saw nothing there. Two and a half years of email gone!

No problem. I'd restore from backup. Except that my latest backup was a month and three days old. All email, between 7/3 and 8/10/2006, not associated with a mailing list or directly enquiring about my troubleshooting course was gone. Irretrievably!

Step 9 of the Universal Troubleshooting Process is "Take Pride". In that step not only do you take pride in your accomplishment (losing a month of email isn't an especially proud accomplishment), but you ask how you were brilliant (again, that's laughable) and how you can do better next time.

How could I do better next time? I needed to identify the root cause of the data loss and fix that root cause to prevent future occurrence.

One thing's for sure -- it wasn't for lack of knowledge. My basic backup strategy was formed in the 1980's, recorded in the July 1998 Troubleshooting Professional Magazine, and revised for Linux in the August 2002 Linux Productivity Magazine.

What was obvious is that I didn't follow my own procedures. My procedures call for weekly backups. The backup on the first of the month is to write-once media, while all the rest are to rewriteables. If I'd followed my procedures, I'd have lost only a single week. Why didn't I follow my procedures?

Because they were immenesly inconvenience. That nice little script that worked so well with 200 MB backups was hours-long drudgery with 7GB backups. During backup I couldn't work, first because working would cause a verification failure, and then because working would cause a DVD write buffer overrun. The difficulty and inconvenience of these huge backups caused me to forego my own backup procedures and policies, and my departure from those procedures and policies caused me to lose a month of general purpose email.

An easier backup method was needed.

My thoughts drifted back to an August 2005 GoLUG presentation by Kevin Korb, entitled Backups using rsync (URL in URLs section of this magazine). Kevin had demonstrated how you could use rsync to back up every single day, and it would make a complete backup, but only transfer the files that had been changed. In other words, I could back up my computer in 5 minutes per day.

Kevin also mentioned that of course you need backups on tape or CD or DVD or some other cheap removable media, but if you use the Rsync method you can back up the Rsync mirror rather than your live machine, which means you can continue working full speed ahead while your backup is being burned to DVD.

Last but not least, Kevin demonstrated how you could use Unix type hard links to create a system of incremental backups whose disk space used was only the space needed for the changed files, but each looked like a full backup and restored like one.

So Kevin's backup techniques yield the best of all worlds:
In August 2005 I'd made a mental note to investigate Kevin's methods, but you know how things go -- there were always other priorities. When I lost a month's emails out of one mailbox a year later,  priorities changed. I stopped all work and used Kevin's techniques to completely revamp my backup system.

My August 10, 2006 loss of a month's email in a single mailbox was one of the most fortunate things that ever happened to me. It was disasterous enough to get my attention, but nowhere near as disasterous as a loss of a month's worth of ALL my data would have been. Had I lost my troubleshooting course mailbox, or my records of customer purchases, or the month's worth of work on my new book, it could have set me back months or even endangered my business.

This Linux Productivity Magazine updates the backup philosophy espoused in the July 1998 Troubleshooting Professional Magazine, and the Linux-centric backup techniques discussed in the August 2002 Linux Productivity Magazine. It explains how to implement Kevin Korb's backup techniques for a "power user" Linux desktop. So kick back, relax, enjoy, and remember that if you use GNU/Linux, this is your magazine.
Steve Litt is the author of Troubleshooting Techniques of the Successful Technologist.   Steve can be reached at his email address.

Is Linux Productivity Magazine Back for Good?

By Steve Litt
Linux Productivity Magazine ceased publication in the summer of 2004, after hurricanes Charley and Frances wiped out our roof and temporarily stopped all Troubleshooters.Com business activities. This is the first Linux Productivity Magazine since then. From now on, Linux Productivity Magazine will be an occasional publication. For that reason the magazines will no longer have volume and issue numbers. From now on, new Linux Productivity Magazines only when I have suitable content and the time to write it.

So check back from time to time. It will be worth it.
Steve Litt is the author of Manager's Guide to Technical Troubleshooting.   Steve can be reached at his email address.

My Backup Philosophy

By Steve Litt
My backup philosophy was completely described in the July 1998 Troubleshooting Professional Magazine. The August 2002 Linux Productivity Magazine did not change that philosophy at all, but simply adapted it for Linux, and newer hardware and free software archiving solutions. My backup philosophy didn't change appreciably between 1998 and 2006:
Upon backup, comparison must be successfully made between the burned backup and the material on the hard disk. Then a CRC (or md5sum in Linux) of the backed up data must be included on the media so that years from now, the validity of the backup can be ascertained.

The backup must back up the directories you need backed up, and this must be verifyable. This can be done by performing a tree command or tar tzvf command.

Restorability implies restorable soon after backup, restorable years after backup, and restorable after disaster (hurricane, earthquake, terrorist act). This means reliable media for the short term, and ubiquitously common and accepted media and file format for the long term, offsite backups, and even out of state backups.

Easy to use means configurability, minimum of human intervention, minimum thought, and minimum time off work. We'll discuss that later in this article.

As far as a good backup system, it must include a proper mix of daily, weekly, monthly, quarterly and yearly backups. It must have provision for offsite and even out of state backups.

Ease of Use

Although my backup philosophy was first written in 1998, I've been conscious of it since the late 80's or early 90's. Watching a coworker try and fail to restore three different backups convinced me that restorability must never be taken for granted. The sight of floppy failures, and even more so frequent tape failures, taught me the need for good media. Failure to restore proprietary formats taught me the value of ubiquitous hardware and software formats. And of course my May 1987 disk crash sans backup taught me that backups must be frequent and regular.

In the late 80's and early 90's I backed up to floppy disk -- the only format available to a man of modest means. I used Fastback and Central Point Backup to write the floppies. Even though I had less than 100MB of data back in those days, the backup process was a gruelling, error prone exercise in human intervention. Basically you took most of the day off and backed up your computer. Unfortunately, with the exception of expensive and error prone tape, floppies were the only choice.

I backed up to tape briefly between 1993 and 1995. Windows 95 could not accommodate the tape drive, so back to floppies I went after installing Win95. Now saddled with much more data than in the early 1990's, backups were a miserable all-day affair, but they had to be done monthly, and so they were. Backups were spanning 40 floppies, and often times one bad floppy meant starting the whole process over again. Media for a single backup set cost over $40.00. "Hot" projects were backed up to other parts of the hard disk on a daily basis.

In 1996 the IOMega Zip Drive changed my life. At 100MB per disk, I could set the backup process running, leave for an hour or two, and come back to a backed up computer. I couldn't work on my computer during that time, but I could do other work or exercise or chores. Life was easy!

But my data grew, and soon a backup required two or even three Zip disks. Sure, life was easier than in the floppy days, but it now required significant user intervention to switch the Zip disks.

Life got easier again in late 1998 when I bought a CD writer. Once again, a backup consumed a single piece of media, and was accomplishable with a "set it, forget it" process. My first CD writer was an incredibly slow 2x, but subsequent ones got faster, until whole backups could be done in well less than an hour.

It didn't last. By early 2001 I needed two CDs for a backup, by mid 2002 I needed three. 2003 brought four CD backups, and by the time I had to make backups during the 2004 hurricane assault on Florida, I was up to five. In January 2005 I bought a DVD writer, and once again I had a one media backup, powered with a set it/forget it script.

And once again I grew out of it. I joined lots of mailing lists whose messages required backup. My wife bought a digital camera whose images were easier to process and store on my computer than hers, but needed backup. In August 2005 my backups started using two DVDs, breaking my carefully crafted backup scripts. As I write this in August 2006, it's getting harder and harder to fit my backups on two DVDs. I've thought of getting doublesided DVDs, but how could you handle those without getting fingerprints on a recording surface?

Data Grows

The point is this: Data grows! Throughout your lifetime you accumulate more data than you purge (absent a catostrophic disk crash sans backup), and as time goes on the number and size of the files get bigger. As backup media grow bigger and faster, they barely keep up with your increased backup needs. Today, backing up to DVD on my desktop computer takes almost as much time and effort as my 1989 CPBackup and floppy backups.

There's a Better Way

There was no choice in 1989. There was no such thing as a network for 1989's Average Joe. Who had the money for their own copy of Novell Netware? Who had the money for network cards, and who had the expertise to get it all running? If you wanted to back up in 1989, and didn't have a king's ransom for a tapedrive, you backed up directly from your desktop to floppy.

Windows 95 democratized networking by including TCP/IP in the operating system. So did GNU/Linux. By the mid 1990's you could have backed up over a network, and then written the result to Zip drive or tape. Except that back then few had the money for multiple computers.

And don't forget the cost of hard disks. I think it was 1994 when Egghead Software advertised a 1GB drive for only $899.95. I called to see if it was in stock, rushed right in and bought it, immediately left the store before they could "realize their mistake", drove a block away, parked in a parking lot, and did a victory dance. That night I bragged to all my friends that I'd bought a disk for less than a buck a Megabyte.

Now Sams Club has a 200GB hard drive for $99.95. That's fifty cents a Gigabyte. I won't buy it. I think I can get a better deal.

Somewhere between 1998 and 2003, the average Joe Geek came to own two network-equipped computers, and could afford the RAM and drive space to make them both useful. The concept of a backup server was now possible. The one remaining problem was this: Fast as 100 megabit networks can be, it still takes a long time to transfer several gigabytes.

Of course that's no problem at all -- Rsync has been around for a long time -- at least since Febrary 1999 when its original author, Andrew Tridgell, completed his PhD thesis.

NOTE
If Andrew Tridgell's name sounds familiar, it's probably because of another free software project he originate: Samba.

So once Geeks equipped with the proper equipment learned that Rsync was a great backup solution, and learned how to use Rsync, the "better way" was born.

What Rsync does is test all files on the acquiring side with the same files on the side being backed up. If the size and modification time are different, the file is transferred. Otherwise, the files are assumed identical, and the file is not transferred. In practice, this means only changed files are transferred -- a huge improvement.

Of course, you and I know the files could be different even though the file's name, size and modification time are identical. But think of what that it would take to silently change the file without disturbing the size or mod date. Limited disk corruption could do it. A malicious program could do it through OS level calls storing the original date and size, modifying the file, truncating to the same size, and setting the mod date back to the original one. Or a malicious program could do it through BIOS level writes to a certain head, cylinder and sector.

Notice that in all of these eventualities, probably the version you'd want to keep is the version on the backup machine, because the version on the desktop being backed up is probably corrupt. If you're really paranoid about the possibility of silently changed files, my suggestion would be once a month to generate a report on them. I haven't investigated it, but I think you could create such a report with the -I and --only-write-batch=FILE options. You could also use the -I option alone to force checksumming even with unchanged filedates, but as mentioned in the preceding paragraph, the most likely explanation for silent file changes is disk corruption or malware.

So basically, the better way is to transfer only changed files to the backup server. This saves network traffic. Because the process runs on the backup server, it conserves your workstation's CPU power. In the better way, you can have a collection of backups that use only the space of incremental backups, but contain all files, like a full backup. Last but not least, when it's time to burn your weekly or monthly copy to removable media, this is done on the little used backup server rather than your heavily used workstation.

You needn't buy a new computer for use as your backup server. If you have a coworker or family member who doesn't come near using the capacity of his or her computer, you can use that computer as your backup server. You'll ssh in under your own username, which he or she can't touch. Because it's your own account, you also can't accidentally delete or change his or her data.

My Philosophical Change

For the most part, my backup philosophy hasn't changed in 15 years. However, the definition of "easy to use" has changed to accommodate both the multigigabyte size of today's backups, and the availability of other computers, cheap networking, and the Rsync program.

This definition change creates a tactical change in which I now back up in two stages:
  1. Incrementally back up over the network to update a "mirror"
  2. Burn to DVD from the "mirror"
So it's not really a philosophical change. It's simply that over the years, what's considered "easy" has changed.
Steve Litt is the author of the Twenty Eight Tales of Troubleshooting.   Steve can be reached at his email address.

Activities in Rsync Assisted Backup Systems

By Steve Litt
Rsync assisted backup systems accommodate the following activities:
  1. Pulling changed files from the workstation to the backup server.
  2. Using hardlinks to create "incremental backups".
  3. Creating tarballs from the backup server's directory mirror.
  4. Recording backup metadata.
  5. Burning the tarballs onto removable media.
  6. After burn verification

Pulling changed files from the workstation to the backup server.

This is typically a 5 minute process that should be done every day. By doing it every day, a disk crash on a single computer costs you only one day's worth of data.

This is simple enough. Let's start by viewing the script that backs up my digital photos:

RSYNC_RSH="ssh -c arcfour -o Compression=no -x"
rsync -vaHx --progress --numeric-ids --delete \
--exclude-from=pbup_backup.excludes --delete-excluded \
slitt@192.168.100.2:/scratch/pictures/ /stevebup/rsync/pictures/

The preceding is based on Kevin Korb's "Backups using rsync" presentation notes (URL in URLs section). The script is two commands. The first sets an environment variable to facilitate Rsync transfer over an ssh session spawned by Rsync itself. The second command, which is written in three lines to prevent walking off the screen, pulls all changed files from /scratch/pictures/ on the workstation (192.168.100.2) to the mirror directory at /stevebup/rsync/pictures on the backup server, where this command is being run.

Here's an explanation of the options for the first command:
-c arcfour Use the arcfour cypher for encryption. It's fast.
-o Compression=no Don't compress the data before sending it over the network wire. The idea here is that the time it takes to compress and decompress the data will exceed the added time transmitting more bytes over the network line, which may or may not be true depending on network load, processor capability and load. You might want to experiment.
-x Disables X11 forwarding. Rsync has no GUI component, so why incur performance penalties and security risks by allowing GUI through the ssh connection.


Here's an explanation of the options for the second command:
-v Verbose
-a Archive mode -- preserve all times, ownership, permissions and the like.
-H Preserve hard links
-x One file system -- don't cross filesystem boundaries. For instance, if you were backing up /usr/, and /usr/local/ had its own partition, /usr/local/ wouldn't be included. This prevents the backup from following symlinks all over your hard disk. If you really need the separate partition backed up, you can do it with a subsequent command.
--progress Shows a progress meter. Not essential, but lets the user know the process isn't hung when large files are transferred.
--numeric-ids Don't map uid/gid values by user/group name. Doing so would create havoc if the backup server used different numeric IDs for the user being backed up.
--delete If the file is no longer on the workstation, delete it from the mirror too. This sounds like an opportunity to lose data, but in fact, because you'll be implementing a hardlinks based series of incremental backups, and because you'll frequently be burning the mirror to removable media, deleted files will be available for restorral if needed. If one didn't delete from the mirror the files deleted from the workstation, it would create a housekeeping nightmare.
--exclude-from=pbup_backup.excludes pbup_backup.excludes contains a list of trees or files you want excluded from the backup. Trees should be terminated with a forward slash. If you don't want to exclude anything at this time, the file can be blank.
--delete-excluded Imagine for a moment that, after you've been backing up for awhile, you decide to exclude the temp tree. In order for the mirror to be a true mirror, the temp tree would need to be deleted from the mirror as well. That's what this option facilitates.
slitt@192.168.100.2:/scratch/pictures/ The source directory. Because we're using a "pull to the backup server" rather than "push from the workstation" approach, the source directory is remote and needs the username and IP address (or hostname or URL). The trailing forward slash is manditory. Lacking the trailing slash, you wouldn't get what you think you'd get.
/stevebup/rsync/pictures/ The destination directory, which in this case is on the mirror on the backup server. The trailing slash is manditory. Lacking the trailing slash, you wouldn't get what you think you'd get.

I have several similar scripts to Rsync several directories, and an umbrella script to run them all.

For more detailed information on Rsync and backup, see Kevin Korb's page (URL in URLs section).

Using hardlinks to create "incremental backups"

Hardlinks are different names for the same file, and are very different from symbolic links. Unlike symbolic links, there's no filename that's "the real filename" while the other(s) is/are "synonyms". With hardlinks, every name is equally important, and equally dispensible. The fact that the file originated under one name doesn't make it any more important or less dispensible once other hardlinks are made to the file. The following shellscript demonstrates these concepts:

#!/bin/bash

function showfile()
{
echo -n $1
echo -n ': '
cat $1
echo
}

rm -f first.txt
rm -f second.txt

echo -n first > first.txt
ln first.txt second.txt
showfile first.txt
showfile second.txt
ls -i -1 first.txt second.txt
echo
echo -n ': append to first' >> first.txt
showfile first.txt
showfile second.txt
ls -i -1 first.txt second.txt
echo
echo -n '-- append to second' >> second.txt
showfile first.txt
showfile second.txt
ls -i -1 first.txt second.txt
echo
echo -n ', redirect to first' > first.txt
showfile first.txt
showfile second.txt
ls -i -1 first.txt second.txt
echo
rm -f first.txt
echo -n 'Brand new first' > first.txt
showfile first.txt
showfile second.txt
ls -i -1 first.txt second.txt
echo

rm -f first.txt
rm -f second.txt
In the code to the left, the showfile() function simply shows the contents of the file in a clear way.

First we delete both first.txt and second.txt. Next we create first.txt, link in the name second.txt, and prove they contain the same thing. Then we append to first.txt, and show that the appended text shows up in second.txt, because second.txt and first.txt are just different names for the same file. Next we append to second.txt, and once again show the file contents are the same.

Next we redirect an entirely new content into first.txt and display both files, and they're both the same. What this proves is that although the redirect (the single right angle bracket) truncates the file, it doesn't delete and recreate fhe file.

Finally, we delete the file and then redirect new text into it. Finally the two files contain different info, because now they're not two names for the same file. At this point, second.txt is the name of the file that was the original first.txt, while the current first.txt is a brand new file.

Throughout these experiments, the ls -i command demonstrates when the two names refer to the same file content, and when they refer to different file contents.

Hardlinks are multiple names referring to the same file. If the file's contents are changed while referenced from one filename, those changes are visible from the other hardlinks. A simple ls -i command proves that.

Things change when one of the filenames is deleted, and then a new file is created under that filename. Now they refer to two separate files.

This is how the incremental system, recommended by Kevin Korb, works. The cp -al command copies a tree by making another tree whose every file and directory is a hard link to the original. Immediately after the copy, the two trees are identical in every respect and are in fact indistinquishable in every respect except the name of the top directory.

Then you perform an Rsync whose destination is the original tree. If a file on the workstation being backed up is changed, what Rsync does on the backup server is delete and recreate the file. This breaks the "link" to the original version of the file on in the primary mirror tree, but in the incremental tree created with the cp -al command, the original file is still alive and well.

What this means is with Kevin's ingenious increment method, with 20 increments plus the mirror, an unchanged file uses space only once, and has 20 names pointing to it (they're actually the same filename, but in different directory trees). Only files that have changed consume extra space.

So I've created a shellscript called ./make_incremental.sh:

#!/bin/bash
datestamp=$(head -n1 /stevebup/rsync/meta/timestamp.txt)
incname=inc_$datestamp

cd /stevebup/
echo "PLEASE WAIT, MAKING INCREMENTAL COPY, cp -al rsync $incname..."
cp -al rsync $incname

The incrementation is accomplished by the cp -al command. The timestamp.txt file was created earlier in the process, so that every backup has a record of when it was started, and also so that backup can have an intelligent name when it becomes an incremental backup.

Creating tarballs from the backup server's directory mirror

I've had 35 Megaton flame wars about my belief in backing up to an archive (in this case, a .tgz file). All sorts of people have called me wrong. My detractors bring up all sorts of excellent reasons, such as the fact that a single blown archive bit invalidates the whole backup (or at least that .tgz file). It's much harder to restore individual files with a single file backup. It's difficult and time consuming to determine what files are contained (this argument is invalidated by my creation of .lst files).

All I can say is I still back up to archives, and have no plan to change that fact. My reasons haven't changed since 1990:
With archive file backups, I know whether my backup is good. And what I've found over the years is that with Zip disks and CDs, shelf life is years and years and years, with 100% accuracy almost every time, making the "one bad bit destroys the backup" argument moot. In fact, I've never found a Zip disk or write-once CD or write-once DVD that started out good but went bad after time.

Now that I've told you why I do it the way I do, let me tell you how.

I could have used a simple shellscript, but I wanted some additional capabilities:
Inserting those features would have rendered an un-pretty shellscript, so I did it as a Ruby program. It's called makeTgzs.rb. Here's what it looks like:

#!/usr/bin/ruby

require 'date'

$tgzdir = "/stevebup/tgz/"
$rsyncdir = "/stevebup/rsync/"


$tgzs = [
["d", "d/"],
["sl", "slitt/"],
["p", "pictures/"],
["a", "a3b3/"],
["ca", "classic/a"],
["i", "inst/"],
["l", "tclogs/"],
["me", "meta/"],
]


class Logger
attr_reader :logfname
attr_accessor :logfile
attr_accessor :stagename
attr_accessor :errors
def initialize(logfname)
@logfname = logfname
@errors = 0
@logfile = File.new(@logfname, "w")
end

def begins()
puts "Begin #{stagename}..."
@logfile.puts "Begin #{stagename}..."
end

def success()
puts "#{stagename} completed successfully!"
@logfile.puts "#{stagename} completed successfully!"
end

def failure(errmsg)
puts "#{stagename} failed: #{errmsg}"
@logfile.puts "#{stagename} failed: #{errmsg}"
errors += 1
end

def skipline(msg)
puts
@logfile.puts
if (msg != nil) and (msg != "")
puts msg
logfile.puts msg
end
end

end

def todaystring()
d = Date.today()
return zerofill(d.year - 2000) + zerofill(d.month) + zerofill(d.mday)
end


def zerofill(number)
number += 10000
string = number.to_s
twodigit = string[-2, 2]
return twodigit
end

def tar_cre_string(abbrev, dir)

command = "tar czvf #{$tgzdir}#{abbrev}#{$datestring}.tgz #{dir}"
return command
end

def tar_diff_string(abbrev, dir)

command = "tar dzvf #{$tgzdir}#{abbrev}#{$datestring}.tgz"
return command
end

def tar_md5_string(abbrev)
command = "md5sum #{$tgzdir}#{abbrev}#{$datestring}.tgz > #{$tgzdir}#{abbrev}#{$datestring}.md5"
return command
end

def tar_lst_string(abbrev)
command = "tar tzvf #{$tgzdir}#{abbrev}#{$datestring}.tgz "
command = command + "| sed -e \"s/^.* //\" | "
command = command + "sort > #{$tgzdir}#{abbrev}#{$datestring}.lst"
puts command
return command
end

def do1tgz_string_only(commands, abbrev, dir)
commands.push(tar_cre_string(abbrev, dir))
commands.push(tar_diff_string(abbrev, dir))
commands.push(tar_md5_string(abbrev))
commands.push(tar_lst_string(abbrev))
end

def do1tgz(tgzlogger, tasklogger, abbrev, dir)
this_tgz_errors = 0
tgzlogger.stagename ="Directory #{dir} (#{abbrev}{$datestring}.tgz)"
tgzlogger.skipline("")
tgzlogger.begins()


tasklogger.skipline("Directory #{dir} as #{abbrev}{$datestring}.tgz")
tasklogger.stagename = ("Creating #{$tgzdir}#{abbrev}#{$datestring}.tgz")
tasklogger.begins()
cre_return = system(tar_cre_string(abbrev, dir))
if cre_return then
tasklogger.success()
else
tasklogger.failure("")
this_tgz_errors += 1
end

tasklogger.stagename = ("Diffing #{$tgzdir}#{abbrev}#{$datestring}.tgz")
tasklogger.begins()
diff_return = system(tar_diff_string(abbrev, dir))
if diff_return then
tasklogger.success()
else
tasklogger.failure("")
this_tgz_errors += 2
end


tasklogger.stagename = ("Creating md5 #{$tgzdir}#{abbrev}#{$datestring}.md5")
tasklogger.begins()
md5_return = system(tar_md5_string(abbrev))
if md5_return then
tasklogger.success()
else
tasklogger.failure("")
this_tgz_errors += 4
end

tasklogger.stagename = ("Creating lst #{$tgzdir}#{abbrev}#{$datestring}.lst")
tasklogger.begins()
lst_return = system(tar_lst_string(abbrev))
if lst_return then
tasklogger.success()
else
tasklogger.failure("")
this_tgz_errors += 8
end

if this_tgz_errors == 0 then
tgzlogger.success()
else
errmsg = "failed on step(s) "
if this_tgz_errors % 1 == 1 then
errmsg += "(CREATE) "
end

this_tgz_errors /= 2

if this_tgz_errors % 1 == 1 then
errmsg += "(DIFF) "
end

this_tgz_errors /= 2

if this_tgz_errors % 1 == 1 then
errmsg += "(MD5) "
end

this_tgz_errors /= 2

if this_tgz_errors % 1 == 1 then
errmsg += "(LST) "
end

tgzlogger.failure(errmsg)
end

end

def main()
$datestring = todaystring()
Dir.chdir($rsyncdir)
system("pwd")
system("sleep 1")
tasklogger = Logger.new(ENV['HOME'] +"/maketgz_task.log")
tgzlogger = Logger.new(ENV['HOME'] + "/maketgz_tgz.log")
$tgzs.each do |tgz|
do1tgz(tgzlogger, tasklogger, tgz[0], tgz[1])
end
end

main()

At the top, global constants are defined for the tgz root and the root of the directories to be tarred. Also at the top is a global array of directory/abbreviation pairs, each implemented as a 2 element array. I didn't do it as a hash, in order to give allow the backups tarballs to be created in a specific order. Personally, I like doing the shortest ones first. The program creates all the tarballs enumerated in the $tgzs array.

The rest of the program facilitates the tarring itself, creating a .md5 file, creating a listing of the files backed up (.lst), verifying the tarball (.tgz) against the data with which it was created, as well as implementing a logging capability. The logger should probably write and flush on each entry, and eventually I'll change it so it does.

Recording backup metadata

Some data about the backup itself needs to be recorded. I put such in the mirror tree's meta directory. The obvious piece of metadata is the timestamp for the backup. That's meta/timestamp.txt. It's used not only to identify the time of backup, but also to drive incrementation.

The other thing that's needed is the backup programs themselves. I copy all the programs to accomplish my backup system to meta/backup/. Here's a script called metabup. Here's what that shellscript looks like:

#!/bin/bash
rm -rf /stevebup/rsync/meta/*
date +%Y%m%d_%H%M%S > /stevebup/rsync/meta/timestamp.txt
cp -Rp /d/bats/backup /stevebup/rsync/meta/

At a later step, the meta directory is packaged as a (very small) tarball, which can be included on every DVD in the set. The meta directory can also be burned as a plain text tree.  In addition, timestamp.txt can be burned to the DVD in plain text, so it's obvious when the backup was made.

Once again, meta/timestamp.txt is later used to create an incremental snapshot of the backup, and that incremental snapshot remains unchanged in spite of changes to the actual Rsync destination directory.

Burning the tarballs onto removable media

I'm a big believer in minimum user intervention, but I burn DVDs using K3b. Here's why:

When you have several tarballs, and exceed the limits of one DVD, you need to spread them out over multiple DVDs. Which tarballs go on which DVDs is not only a space conservation thing, but also a personal preference. Any script I write today would be obsolete in a few months. Besides that, once you have more than 1 DVD to burn, you need to insert DVDs during the burn, so there's no possibility of a "set it forget it" solution.

One could conceivably create a backtracking algorithm that would optimally pack tarballs onto multiple DVDs, but because a human must remove and insert DVDs, you can't get around the human intervention. So my preference is K3b.

After-burn verification

Earlier I said I never had an error on .tgz creation. The same cannot be said for CD or DVD burning. Those error out a lot! To facilitate after-burn verification, each created .tgz file is augmented with a .md5 file listing its md5sum. Therefore, after burn, even years after burn, one can compare the md5sum of the .tgz file with the md5 value listed in the .md5 file.

Because my backups have many tarballs on each DVD, I created a script to md5 compare all tarballs, after mounting the CD or DVD as /mnt/cdrom:
#!/bin/bash

echo
echo

mountpoint=/mnt/cdrom

echo Mountpoint =$mountpoint=

logfile=$HOME/verifycd.tmp
rm -f $logfile

for tgz in $mountpoint/*.tgz
do
md5=${tgz//\.tgz/\.md5}
echo -n "Comparing Checksums for $tgz & $md5, please wait... "
md5val=`cut -d " " -f 1 $md5`
tgzval=`md5sum $tgz | cut -d " " -f 1`
echo " finished."
echo "$md5 value==>$md5val<=="
echo "$tgz value==>$tgzval<=="
if (test "$md5val" = "$tgzval"); then
echo $tgz is good. >> $logfile
else
echo ""
echo $tgz MD5SUM MISMATCH! ERROR ERROR ERROR ERROR ERROR! >> $logfile
fi
echo
done
cat $logfile
Steve Litt is the author of the Universal Troubleshooting Process Courseware.   Steve can be reached at his email address.

Passwords and Password Alternatives

By Steve Litt
In an ideal world, you'd institute your daily backup, it would ask you for the password of the workstation, and then do the backup. Trouble is, if you're grabbing several directories with several different Rsync commands, you'll be asked for your password each time, and not at the beginning, but as each one finishes and the next one starts. Remember that this magazine started as a desire to make backups easy?

I considered grabbing the password with a master script, and then calling each rsync command with its password. Umm, no. First, my script echoed my password, so the guy standing over my shoulder can see my password. Also, if I use the password on the command line of each rsync call, those calls can be visible in a ps command.

I tried setting the RSYNC_PASSWORD environment variable to the password, but it had no effect. Apparently that works only when you connect to an Rsync daemon. Likewise, the --password-file option works only when accessing a daemon, not using a remote shell like ssh. Besides that, I cannot imagine having a password in open text in a file.

The "solution" is to use public and private keys created by ssh-keygen, and do so without specifying a passphrase. If you specify a passphrase, you'll be asked for that passphrase every time rsync runs, so you've gained nothing.

Having a private key without a passphrase is extremely risky, but the risks can be somewhat ameliorated. There's an excellent document on eliminating passwords from Rsync while minimizing the risk, at http://www.jdmz.net/ssh/. Read it!
Steve Litt is the author of Rapid Learning: Secret Weapon of the Successful Technologist.   Steve can be reached at his email address.

Musings on Rsync

By Steve Litt
A trip to the Rsync project website (URL in URLs section) is very revealing. For instance, Andrew Tridgell originated the Rsync project. Does that name sound familiar? He's much better known for originating another free software project -- Samba.

The Rsync project website is a treasure trove of information and help on Rsync, and even contains Andrew's Ph.D. thesis, which itself contains three chapters on Rsync, and is a fascinating read. Read it!
Steve Litt is the author of Samba Unleashed.   Steve can be reached at his email address.

Life After Windows: What does this have to do with Linux?

Life After Windows is a regular Linux Productivity Magazine column, by Steve Litt, bringing you observations and tips subsequent to Troubleshooters.Com's Windows to Linux conversion.
By Steve Litt
Backing up with Rsync is pretty cool, isn't it? Is there something like it in the Windows world?

I wouldn't know. How would I find out? I Googled the words Windows and backup. There's something called ZipBackup, which appears to be a front end to PKZip (or Winzip, etc). Great idea, but I didn't see anything about network backups, or moving the process to another computer. There was something called GRBackPro, which was cool because it does provide for network backup and it can back up to .zip files. It also has incremental backups, so theoretically you could transfer only changed files. However, those changed files appear to contribute to a chain of incremental backups following a full backup, not a mirror like Rsync gives you.

My quick web research indicated that there are many nice Windows backup programs costing $29 to $99. Most write to a wide variety of media. A few seem capable of backing up over a network. These programs are a nice compromise for a non-technical user, but they're monolithic. I've gotten used to the fact that if I can get a mirror on another computer, I can use tar to compress it into an archive, and K3b or growisofs or cdrecord to get it onto media. If a part of my backup doesn't work, I can replace just that part. You can't do that with a Windows backup program.

Then there's the cost factor. The Windows backup programs are cheap, but to try a bunch of them you'd need to spend a fortune. So you'll probably settle.

There's the knowledge accumulation factor. With Rsync, I learned Rsync and hardlinks, but I already had prodigious knowledge of creating tarballs complete with verification and logs, and DVD burning. Now, going forward, I'll have good knowledge of Rsync and hardlinks with which to accomplish other tasks. In other words, with Linux, your knowledge builds and builds, whereas with Windows you just learn a succession of different programs, often with radically different user interfaces and data formats.

There's also a huge difference in quality. The Windows backup programs are adequate for personal or very small business backup, but they're not scalable. As the business grows, and data becomes more voluminous, and the computer housing the data is used ever more continuously, the Windows backup programs can't handle the load.

Contrast that with Kevin Korb's Rsync backup system that runs on Linux. It's enterprise quality, at least for a midsized enterprise. One backup computer could poll various other computers, pulling Rsync'ed data throughout the day. Tarball creation and disk burning would require a couple hours of complete attention by the backup server (in other words, no Rsyncing while burning), but not the machines being backed up. One way to scale up would be to have a dedicated DVD burning machine with an NFS link to the backup server.

Indeed, if you needed to scale further, you could split computers to be backed up between several backup servers, each with its own DVD or tape drive, or with a single DVD or tape drive attached via NFS.

The relationship between Windows and Linux backup programs is representative of all software. Linux uses the Unix philosophy of lots of little programs that do one thing and do it well. You glue them together with Ruby or shellscripts or UMENU or whatever, and get a product exactly matching your needs. With Windows you try out several commercial programs, pick the best, work around its idiosyncracies and inconveniences. With Linux, every new functionality you build increases your knowledge, whereas with Windows you keep starting from scratch with every new functionality. Available retail Windows applications tend to be good for a moderately loaded desktop, but group or enterprise class applications cost a fortune. Often times, group or enterprise class applications, or the major building blocks to build them, come free on your Linux install CD.

This isn't to say there aren't advantages to the Windows way. For the totally nontechnical, a monolithic program set up to perform the functionality right out of the box is what is needed. But for those of us who are "power users", I believe the Linux way is better.

And it's not that difficult. Each of my Rsync scripts was less than 10 lines of shellscript code. My metadata script was a shellscript less than 10 lines. My script to create an incremental hardlink copy of the new backup was less than 10 lines. Indeed, the only sizeable piece of code was the Ruby program to create the tarballs, and truth be told, that could be replaced by a series of per-tree shellscripts writing to a common log file. I like the Ruby solution better, but a nonprogrammer could easily implement a shellscript solution.

Beyond all of that is the difference in empowerment. As a Windows guy back in the old days, when I needed something, I asked around, collected voluminous information and opinions on alternative proprietary programs, wrote a check for one, and hoped it would work. Now, as a Linux guy, when I need something I ask around, assemble the pieces, and get the thing done. When my needs increase, I modify my previous solution, or find a better one available on the Internet. I expect to be able to do that.

If you're willing to spend a little time, Linux gives you the power to accomplish anything.
Steve Litt is the founder and acting president of Greater Orlando Linux User Group (GoLUG).   Steve can be reached at his email address.

GNU/Linux, open source and free software

By Steve Litt
Linux is a kernel. The operating system often described as "Linux" is that kernel combined with software from many different sources. One of the most prominent, and oldest of those sources, is the GNU project.

"GNU/Linux" is probably the most accurate moniker one can give to this operating system. Please be aware that in all of Troubleshooters.Com, when I say "Linux" I really mean "GNU/Linux". I completely believe that without the GNU project, without the GNU Manifesto and the GNU/GPL license it spawned, the operating system the press calls "Linux" never would have happened.

I'm part of the press and there are times when it's easier to say "Linux" than explain to certain audiences that "GNU/Linux" is the same as what the press calls "Linux". So I abbreviate. Additionally, I abbreviate in the same way one might abbreviate the name of a multi-partner law firm. But make no mistake about it. In any article in Troubleshooting Professional Magazine, in the whole of Troubleshooters.Com, and even in the technical books I write, when I say "Linux", I mean "GNU/Linux".

There are those who think FSF is making too big a deal of this. Nothing could be farther from the truth. The GNU General Public License, combined with Richard Stallman's GNU Manifesto and the resulting GNU-GPL License, are the only reason we can enjoy this wonderful alternative to proprietary operating systems, and the only reason proprietary operating systems aren't even more flaky than they are now. 

For practical purposes, the license requirements of "free software" and "open source" are almost identical. Generally speaking, a license that complies with one complies with the other. The difference between these two is a difference in philosophy. The "free software" crowd believes the most important aspect is freedom. The "open source" crowd believes the most important aspect is the practical marketplace advantage that freedom produces.

I think they're both right. I wouldn't use the software without the freedom guaranteeing me the right to improve the software, and the guarantee that my improvements will not later be withheld from me. Freedom is essential. And so are the practical benefits. Because tens of thousands of programmers feel the way I do, huge amounts of free software/open source is available, and its quality exceeds that of most proprietary software.

In summary, I use the terms "Linux" and "GNU/Linux" interchangably, with the former being an abbreviation for the latter. I usually use the terms "free software" and "open source" interchangably, as from a licensing perspective they're very similar. Occasionally I'll prefer one or the other depending if I'm writing about freedom, or business advantage.
Steve Litt has used GNU/Linux since 1998, and written about it since 1999. Steve can be reached at his email address.

Letters to the Editor

All letters become the property of the publisher (Steve Litt), and may be edited for clarity or brevity. We especially welcome additions, clarifications, corrections or flames from vendors whose products have been reviewed in this magazine. We reserve the right to not publish letters we deem in bad taste (bad language, obscenity, hate, lewd, violence, etc.).


Submit letters to the editor to Steve Litt's email address, and be sure the subject reads "Letter to the Editor". We regret that we cannot return your letter, so please make a copy of it for future reference.

How to Submit an Article

We anticipate two to five articles per issue. We look for articles that pertain to the GNU/Linux or open source. This can be done as an essay, with humor, with a case study, or some other literary device. A Troubleshooting poem would be nice. Submissions may mention a specific product, but must be useful without the purchase of that product. Content must greatly overpower advertising. Submissions should be between 250 and 2000 words long.

Any article submitted to Linux Productivity Magazine must be licensed with the Open Publication License, which you can view at http://opencontent.org/openpub/. At your option you may elect the option to prohibit substantive modifications. However, in order to publish your article in Linux Productivity Magazine, you must decline the option to prohibit commercial use, because Linux Productivity Magazine is a commercial publication.

Obviously, you must be the copyright holder and must be legally able to so license the article. We do not currently pay for articles.

Troubleshooters.Com reserves the right to edit any submission for clarity or brevity, within the scope of the Open Publication License. If you elect to prohibit substantive modifications, we may elect to place editors notes outside of your material, or reject the submission, or send it back for modification. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com. Authors: please understand we can't place hyperlinks inside articles. If we did, only the first article would be read, and we can't place every article first.

Submissions should be emailed to Steve Litt's email address, with subject line Article Submission. The first paragraph of your message should read as follows (unless other arrangements are previously made in writing):

Copyright (c) 2003 by <your name>. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, version  Draft v1.0, 8 June 1999 (Available at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for readability at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest version is presently available at  http://www.opencontent.org/openpub/).

Open Publication License Option A [ is | is not] elected, so this document [may | may not] be modified. Option B is not elected, so this material may be published for commercial purposes.

After that paragraph, write the title, text of the article, and a two sentence description of the author.

Why not Draft v1.0, 8 June 1999 OR LATER

The Open Publication License recommends using the word "or later" to describe the version of the license. That is unacceptable for Troubleshooting Professional Magazine because we do not know the provisions of that newer version, so it makes no sense to commit to it. We all hope later versions will be better, but there's always a chance that leadership will change. We cannot take the chance that the disclaimer of warranty will be dropped in a later version. 

Trademarks

All trademarks are the property of their respective owners. Troubleshooters.Com(R) is a registered trademark of Steve Litt.

URLs Mentioned in this Issue


_