Troubleshooters.Com
Presents
Linux
Productivity
Magazine
September
2006
Making
Backups Easier
|
Copyright (C) 2006 by Steve Litt. All rights
reserved.
Materials from guest authors copyrighted by them and licensed for
perpetual
use to Linux Productivity Magazine. All rights reserved to the
copyright
holder, except for items specifically marked otherwise (certain free
software
source code, GNU/GPL, etc.). All material herein provided "As-Is". User
assumes
all risk and responsibility for any outcome.
[ Troubleshooters.Com
| Back Issues |Troubleshooting Professional Magazine
]
We're
entering a new world in which data may be more important than
software.
-- Tim O'Reilly
|
CONTENTS
Editor's Desk
By Steve Litt
It was a tough day. I was frazzled. Then things went bad.
My wife asked me to find an urgently needed email. My automated email
search swamped the processor. I deleted all the emails from what I thought was the trash
folder. A couple minutes later I looked at my inbox and saw nothing
there. Two and a half years of email gone!
No problem. I'd restore from backup. Except that my latest backup was a
month and three days old. All email, between 7/3 and 8/10/2006, not
associated with a mailing list or directly enquiring about my
troubleshooting course was gone. Irretrievably!
Step 9 of the Universal Troubleshooting Process is "Take Pride". In
that step not only do you take pride in your accomplishment (losing a
month of email isn't an especially proud accomplishment), but you ask
how you were brilliant (again, that's laughable) and how you can do
better next time.
How could I do better next time? I needed to identify the root cause of
the data loss and fix that root cause to prevent future occurrence.
One thing's for sure -- it wasn't for lack of knowledge. My basic backup strategy was formed in the 1980's, recorded in the July 1998 Troubleshooting Professional Magazine, and revised for Linux in the August 2002 Linux Productivity Magazine.
What was obvious is that I didn't follow my own procedures. My
procedures call for weekly backups. The backup on the first of the
month is to write-once media, while all the rest are to rewriteables.
If I'd followed my procedures, I'd have lost only a single week. Why
didn't I follow my procedures?
Because they were immenesly inconvenience. That nice little script
that worked so well with 200 MB backups was
hours-long drudgery with 7GB backups. During backup I
couldn't work, first because working would cause a verification
failure, and then because working would cause a DVD write buffer
overrun. The difficulty and inconvenience of these huge backups caused
me to forego my own backup procedures and policies, and my departure
from those procedures and policies caused me to lose a month of general
purpose email.
An easier backup method was needed.
My thoughts drifted back to an August 2005 GoLUG presentation by Kevin
Korb, entitled Backups using rsync (URL in URLs section of this
magazine). Kevin had demonstrated how you could use rsync
to back up every single day, and it would make a complete backup, but
only transfer the files that had been changed. In other words, I could
back up my computer in 5 minutes per day.
Kevin also mentioned that of course
you need backups on tape or CD or DVD or some other cheap removable
media, but if you use the Rsync method you can back up the Rsync mirror
rather than your live machine, which means you can continue working
full speed ahead while your backup is being burned to DVD.
Last but not least, Kevin demonstrated how you could use Unix type hard
links to create a system of incremental backups whose disk space used
was only the space needed for the changed files, but each looked like a
full backup and restored like one.
So Kevin's backup techniques yield the best of all worlds:
- Quick daily backups
- Quick daily increments that can be restored
- Burn to removeable media on the backup server instead of the work machine
In August 2005 I'd made a mental note to investigate Kevin's methods,
but you know how things go -- there were always other priorities. When
I lost a month's emails out of one mailbox a year later,
priorities changed. I stopped all work and used Kevin's
techniques to completely revamp my backup system.
My August 10, 2006 loss of a month's email in a single mailbox was one
of the most fortunate things that ever happened to me. It was
disasterous enough to get my attention, but nowhere near as disasterous
as a loss of a month's worth of ALL my data would have been. Had I lost
my troubleshooting course mailbox, or my records of customer purchases,
or the month's worth of work on my new book, it could have set me back
months or even endangered my business.
This Linux Productivity Magazine updates the backup philosophy espoused in the July 1998 Troubleshooting Professional Magazine, and the Linux-centric backup techniques discussed in the August 2002 Linux Productivity Magazine.
It explains how to implement Kevin Korb's backup techniques for a
"power user" Linux desktop. So kick back, relax, enjoy, and remember
that if you use GNU/Linux, this is your magazine.
Is Linux Productivity Magazine Back for Good?
By Steve Litt
Linux Productivity Magazine ceased publication in the summer of 2004,
after hurricanes Charley and Frances wiped out our roof and temporarily
stopped all Troubleshooters.Com business activities. This is the first
Linux Productivity Magazine since then. From now on, Linux Productivity
Magazine will be an occasional publication.
For that reason the magazines will no longer have volume and issue
numbers. From now on, new Linux Productivity Magazines only when I have
suitable content and the time to write it.
So check back from time to time. It will be worth it.
My Backup Philosophy
By Steve Litt
My backup philosophy was completely described in the July 1998 Troubleshooting Professional Magazine. The August 2002 Linux Productivity Magazine
did not change that philosophy at all, but simply adapted it for Linux,
and newer hardware and free software archiving solutions. My backup
philosophy didn't change appreciably between 1998 and 2006:
- Predictable and Trustworthy
- Accurate and Complete
- Restorable
- Easy to use
- Part of a good backup system
Upon backup, comparison must be successfully made between the burned
backup and the material on the hard disk. Then a CRC (or md5sum in Linux) of the backed up
data must be included on the media so that years from now, the validity
of the backup can be ascertained.
The backup must back up the directories you need backed up, and this
must be verifyable. This can be done by performing a tree command or tar tzvf command.
Restorability implies restorable soon after backup, restorable years
after backup, and restorable after disaster (hurricane, earthquake,
terrorist act). This means reliable media for the short term, and
ubiquitously common and accepted media and file format for the long
term, offsite backups, and even out of state backups.
Easy to use means configurability, minimum of human intervention,
minimum thought, and minimum time off work. We'll discuss that later in
this article.
As far as a good backup system, it must include a proper mix of daily,
weekly, monthly, quarterly and yearly backups. It must have provision
for offsite and even out of state backups.
Ease of Use
Although my backup philosophy was first written in 1998, I've been
conscious of it since the late 80's or early 90's. Watching a coworker
try and fail
to restore three different backups convinced me that restorability must
never be
taken for granted. The sight of floppy failures, and even more
so frequent tape failures, taught me the need for good media.
Failure to
restore proprietary formats taught me the value of ubiquitous hardware
and software formats. And of course my May 1987 disk crash sans backup
taught me that backups must be frequent and regular.
In the late 80's and early 90's I backed up to floppy disk -- the only
format available to a man of modest means. I used Fastback and Central
Point Backup to write the floppies. Even though I had less than 100MB
of data back in those days, the backup process was a gruelling, error
prone exercise in human intervention. Basically you took most of the
day off and backed up your computer. Unfortunately, with the exception
of expensive and error prone tape, floppies were the only choice.
I backed up to tape briefly between 1993 and 1995. Windows 95 could not
accommodate the tape drive, so back to floppies I went after installing
Win95. Now saddled with much more data than in the early 1990's,
backups were a miserable all-day affair, but they had to be done
monthly, and so they were. Backups were spanning 40 floppies, and often
times one bad floppy meant starting the whole process over again. Media
for a single backup set cost over $40.00. "Hot" projects were backed up
to other parts of the hard disk on a daily basis.
In 1996 the IOMega Zip Drive changed my life. At 100MB per disk, I
could set the backup process running, leave for an hour or two, and
come back to a backed up computer. I couldn't work on my computer
during that time, but I could do other work or exercise or chores. Life
was easy!
But my data grew, and soon a backup required two or even three Zip
disks. Sure, life was easier than in the floppy days, but it now
required significant user intervention to switch the Zip disks.
Life got easier again in late 1998 when I bought a CD writer. Once
again, a backup consumed a single piece of media, and was
accomplishable with a "set it, forget it" process. My first CD writer
was an incredibly slow 2x, but subsequent ones got faster, until whole
backups could be done in well less than an hour.
It didn't last. By early 2001 I needed two CDs for a backup, by mid
2002 I needed three. 2003 brought four CD backups, and by the time I
had to make backups during the 2004 hurricane assault on Florida, I was
up to five. In January 2005 I bought a DVD writer, and once again I had
a one media backup, powered with a set it/forget it script.
And once again I grew out of it. I joined lots of mailing lists whose
messages required backup. My wife bought a digital camera whose images
were easier to process and store on my computer than hers, but needed backup. In August
2005 my backups started using two DVDs, breaking my carefully crafted
backup scripts. As I write this in August 2006, it's getting harder and
harder to fit my backups on two DVDs. I've thought of getting
doublesided DVDs, but how could you handle those without getting
fingerprints on a recording surface?
Data Grows
The point is this: Data grows! Throughout your lifetime you accumulate
more data than you purge (absent a catostrophic disk crash sans
backup), and as time goes on the number and size of the files get
bigger. As backup media grow bigger and faster, they barely keep up
with your increased backup needs. Today, backing up to DVD on my
desktop computer takes almost as much time and effort as my 1989
CPBackup and floppy backups.
There's a Better Way
There was no choice in 1989. There was no such thing as a network for
1989's Average Joe. Who had the money for their own copy of Novell
Netware? Who had the money for network cards, and who had the expertise
to get it all running? If you wanted to back up in 1989, and didn't
have a king's ransom for a tapedrive, you backed up directly from your
desktop to floppy.
Windows 95 democratized networking by including TCP/IP in the operating
system. So did GNU/Linux. By the mid 1990's you could have backed up
over a network, and then written the result to Zip drive or tape.
Except that back then few had the money for multiple computers.
And don't forget the cost of hard disks. I think it was 1994 when
Egghead Software advertised a 1GB drive for only $899.95. I called to
see if it was in stock, rushed right in and bought it, immediately left
the store before they could "realize their mistake", drove a block
away, parked in a parking lot, and did a victory dance. That night I
bragged to all my friends that I'd bought a disk for less than a buck a
Megabyte.
Now Sams Club has a 200GB hard drive for $99.95. That's fifty cents a Gigabyte. I won't buy it. I think I can get a better deal.
Somewhere between 1998 and 2003, the average Joe Geek came to own two
network-equipped computers, and could afford the RAM and drive space to
make them both useful. The concept of a backup server was now possible.
The one remaining problem was this: Fast as 100 megabit networks can
be, it still takes a long time to transfer several gigabytes.
Of course that's no problem at all -- Rsync has been around for a long
time -- at least since Febrary 1999 when its original author, Andrew
Tridgell, completed his PhD thesis.
NOTE
If Andrew Tridgell's name sounds familiar, it's probably because of another free software project he originate: Samba. |
So once Geeks equipped with the proper equipment learned that Rsync was
a great backup solution, and learned how to use Rsync, the "better way"
was born.
What Rsync does is test all files on the acquiring side with the same
files on the side being backed up. If the size and modification time
are different, the file is transferred. Otherwise, the files are
assumed identical, and the file is not transferred. In practice, this
means only changed files are transferred -- a huge improvement.
Of course, you and I know the files could be different even though the
file's name, size and modification time are identical. But think of
what that it would take to silently change the file without disturbing
the size or mod date. Limited disk corruption could do it. A malicious
program could do it through OS level calls storing the original date
and size, modifying the file, truncating to the same size, and setting
the mod date back to the original one. Or a malicious program could do
it through BIOS level writes to a certain head, cylinder and sector.
Notice that in all of these eventualities, probably the version you'd
want to keep is the version on the backup machine, because the version
on the desktop being backed up is probably corrupt. If you're really
paranoid about the possibility of silently changed files, my suggestion
would be once a month to generate a report on them. I haven't
investigated it, but I think you could create such a report with the -I and --only-write-batch=FILE options. You could also use the -I
option alone to force checksumming even with unchanged filedates, but
as mentioned in the preceding paragraph, the most likely explanation
for silent file changes is disk corruption or malware.
So basically, the better way is to transfer only changed files to the
backup server. This saves network traffic. Because the process runs on
the backup server, it conserves your workstation's CPU power. In the
better way, you can have a collection of backups that use only the
space of incremental backups, but contain all files, like a full
backup. Last but not least, when it's time to burn your weekly or
monthly copy to removable media, this is done on the little used backup
server rather than your heavily used workstation.
You needn't buy a new computer for use as your backup server. If you
have a coworker or family member who doesn't come near using the
capacity of his or her computer, you can use that computer as your
backup server. You'll ssh
in under your own username, which he or she can't touch. Because it's
your own account, you also can't accidentally delete or change his or
her data.
My Philosophical Change
For the most part, my backup philosophy hasn't changed in 15 years.
However, the definition of "easy to use" has changed to accommodate
both the multigigabyte size of today's backups, and the availability of
other computers, cheap networking, and the Rsync program.
This definition change creates a tactical change in which I now back up in two stages:
- Incrementally back up over the network to update a "mirror"
- Burn to DVD from the "mirror"
So it's not really a philosophical change. It's simply that over the years, what's considered "easy" has changed.
Activities in Rsync Assisted Backup Systems
By Steve Litt
Rsync assisted backup systems accommodate the following activities:
- Pulling changed files from the workstation to the backup server.
- Using hardlinks to create "incremental backups".
- Creating tarballs from the backup server's directory mirror.
- Recording backup metadata.
- Burning the tarballs onto removable media.
- After burn verification
Pulling changed files from the workstation to the backup server.
This is typically a 5 minute process that should be done every day. By
doing it every day, a disk crash on a single computer costs you only
one day's worth of data.
This is simple enough. Let's start by viewing the script that backs up my digital photos:
RSYNC_RSH="ssh -c arcfour -o Compression=no -x" rsync -vaHx --progress --numeric-ids --delete \ --exclude-from=pbup_backup.excludes --delete-excluded \ slitt@192.168.100.2:/scratch/pictures/ /stevebup/rsync/pictures/
|
The preceding is based on Kevin Korb's "Backups using rsync"
presentation notes (URL in URLs section). The script is two commands. The first sets an
environment variable to facilitate Rsync transfer over an ssh session
spawned by Rsync itself. The second command, which is written in three
lines to prevent walking off the screen, pulls all changed files from /scratch/pictures/ on the workstation (192.168.100.2) to the mirror directory at /stevebup/rsync/pictures on the backup server, where this command is being run.
Here's an explanation of the options for the first command:
-c arcfour |
Use the arcfour cypher for encryption. It's fast. |
-o Compression=no |
Don't compress the data before sending it over the network
wire. The idea here is that the time it takes to compress and
decompress the data will exceed the added time transmitting more bytes
over the network line, which may or may not be true depending on
network load, processor capability and load. You might want to
experiment. |
-x |
Disables X11 forwarding. Rsync has no GUI component, so why
incur performance penalties and security risks by allowing GUI through
the ssh connection. |
Here's an explanation of the options for the second command:
-v |
Verbose |
-a |
Archive mode -- preserve all times, ownership, permissions and the like. |
-H |
Preserve hard links |
-x |
One file system -- don't cross filesystem boundaries. For
instance, if you were backing up /usr/, and /usr/local/ had its own
partition, /usr/local/ wouldn't be included. This prevents the backup
from following symlinks all over your hard disk. If you really need the
separate partition backed up, you can do it with a subsequent command. |
--progress |
Shows a progress meter. Not essential, but lets the user know the process isn't hung when large files are transferred. |
--numeric-ids |
Don't map uid/gid values by user/group name. Doing so would
create havoc if the backup server used different numeric IDs for the
user being backed up. |
--delete |
If the file is no longer on the workstation, delete it from
the mirror too. This sounds like an opportunity to lose data, but in
fact, because you'll be implementing a hardlinks based series of
incremental backups, and because you'll frequently be burning the
mirror to removable media, deleted files will be available for restorral if needed.
If one didn't delete from the mirror the files deleted from the
workstation, it would create a housekeeping nightmare. |
--exclude-from=pbup_backup.excludes |
pbup_backup.excludes contains a list of trees or files you
want excluded from the backup. Trees should be terminated with a
forward slash. If you don't want to exclude anything at this time, the
file can be blank. |
--delete-excluded |
Imagine for a moment that, after you've been backing up for awhile, you decide to exclude the temp tree. In order for the mirror to be a true mirror, the temp tree would need to be deleted from the mirror as well. That's what this option facilitates. |
slitt@192.168.100.2:/scratch/pictures/ |
The source directory. Because we're using a "pull to the
backup server" rather than "push from the workstation" approach, the
source directory is remote and needs the username and IP address (or
hostname or URL). The trailing forward slash is manditory. Lacking the
trailing slash, you wouldn't get what you think you'd get. |
/stevebup/rsync/pictures/ |
The destination directory, which in this case is on the
mirror on the backup server. The trailing slash is
manditory. Lacking the trailing slash, you wouldn't get what you
think you'd get. |
I have several similar scripts to Rsync several directories, and an umbrella script to run them all.
For more detailed information on Rsync and backup, see Kevin Korb's page (URL in URLs section).
Using hardlinks to create "incremental backups"
Hardlinks are different names for the same file, and are very different
from symbolic links. Unlike symbolic links, there's no filename that's
"the real filename" while the other(s) is/are "synonyms". With
hardlinks, every name is equally important, and equally dispensible.
The fact that the file originated under one name doesn't make it any
more important or less dispensible once other hardlinks are made to the
file. The following shellscript demonstrates these concepts:
#!/bin/bash
function showfile() { echo -n $1 echo -n ': ' cat $1 echo }
rm -f first.txt rm -f second.txt
echo -n first > first.txt ln first.txt second.txt showfile first.txt showfile second.txt ls -i -1 first.txt second.txt echo echo -n ': append to first' >> first.txt showfile first.txt showfile second.txt ls -i -1 first.txt second.txt echo echo -n '-- append to second' >> second.txt showfile first.txt showfile second.txt ls -i -1 first.txt second.txt echo echo -n ', redirect to first' > first.txt showfile first.txt showfile second.txt ls -i -1 first.txt second.txt echo rm -f first.txt echo -n 'Brand new first' > first.txt showfile first.txt showfile second.txt ls -i -1 first.txt second.txt echo
rm -f first.txt rm -f second.txt
|
In the code to the left, the showfile() function simply shows the contents of the file in a clear way.
First we delete both first.txt and second.txt. Next we create
first.txt, link in the name second.txt, and prove they contain the same
thing. Then we append to first.txt, and show that the appended text
shows up in second.txt, because second.txt and first.txt are just
different names for the same file. Next we append to second.txt, and
once again show the file contents are the same.
Next we redirect an entirely new content into first.txt and display
both files, and they're both the same. What this proves is that
although the redirect (the single right angle bracket) truncates the
file, it doesn't delete and recreate fhe file.
Finally, we delete the file and then redirect new text into it. Finally
the two files contain different info, because now they're not two names
for the same file. At this point, second.txt is the name of the file
that was the original first.txt, while the current first.txt is a brand
new file.
Throughout these experiments, the ls -i command demonstrates when the two names refer to the same file content, and when they refer to different file contents. |
Hardlinks are multiple names referring to the same file. If the file's
contents are changed while referenced from one filename, those changes
are visible from the other hardlinks. A simple ls -i command proves that.
Things change when one of the filenames is deleted, and then a new file is
created under that filename. Now they refer to two separate files.
This is how the incremental system, recommended by Kevin Korb, works. The cp -al
command copies a tree by making another tree whose every file and
directory is a hard link to the original. Immediately after the copy,
the two trees are identical in every respect and are in fact
indistinquishable in every respect except the name of the top directory.
Then you perform an Rsync whose destination is the original tree. If a file on the
workstation being backed up is changed, what Rsync does on the backup
server is delete and recreate the file. This breaks the "link" to the
original version of the file on in the primary mirror tree, but in the
incremental tree created with the cp -al command, the original file is still alive and well.
What this means is with Kevin's ingenious increment method, with 20
increments plus the mirror, an unchanged file uses space only once, and
has 20 names pointing to it (they're actually the same filename, but in
different directory trees). Only files that have changed consume extra space.
So I've created a shellscript called ./make_incremental.sh:
#!/bin/bash datestamp=$(head -n1 /stevebup/rsync/meta/timestamp.txt) incname=inc_$datestamp
cd /stevebup/ echo "PLEASE WAIT, MAKING INCREMENTAL COPY, cp -al rsync $incname..." cp -al rsync $incname
|
The incrementation is accomplished by the cp -al command. The timestamp.txt
file was created earlier in the process, so that every backup has a
record of when it was started, and also so that backup can have an
intelligent name when it becomes an incremental backup.
Creating tarballs from the backup server's directory mirror
I've had 35 Megaton flame wars about my belief in backing up to an archive (in this case, a .tgz
file). All sorts of people have called me wrong. My detractors bring up
all sorts of excellent reasons, such as the fact that a single blown
archive bit invalidates the whole backup (or at least that .tgz
file). It's much harder to restore individual files with a single file
backup. It's difficult and time consuming to determine what files are
contained (this argument is invalidated by my creation of .lst files).
All I can say is I still back up to archives, and have no plan to change that fact. My reasons haven't changed since 1990:
- By placing filenames in the archive file instead of the media directory, I get around any OS dependencies on filenames.
- Slightly better compression than compressing every single file.
- Ability to compare the archive against a .md5 file, to verify the integrity of the backup years after creation.
- Eliminate problems with Linux' "read ahead bug" affecting CD and DVD writes.
With archive file backups, I know
whether my backup is good. And what I've found over the years is that
with Zip disks and CDs, shelf life is years and years and years, with 100% accuracy almost
every time, making the "one bad bit destroys the backup" argument moot.
In fact, I've never found a Zip disk or write-once CD or write-once DVD
that started out good but went bad after time.
Now that I've told you why I do it the way I do, let me tell you how.
I could have used a simple shellscript, but I wanted some additional capabilities:
- Ability to create .md5 and .lst files (.lst files list the files in the archive).
- Ability to log the results.
- Ability to compare the resulting tgz against its hard disk
source, and error out on failure. I've never had a failure on .tgz
creation, but there's always a first time. The error mechanism prevents unknowing recording of bad
backups.
- Intelligent error handling.
Inserting those features would have rendered an un-pretty shellscript, so I did it as a Ruby program. It's called makeTgzs.rb. Here's what it looks like:
#!/usr/bin/ruby
require 'date'
$tgzdir = "/stevebup/tgz/" $rsyncdir = "/stevebup/rsync/"
$tgzs = [ ["d", "d/"], ["sl", "slitt/"], ["p", "pictures/"], ["a", "a3b3/"], ["ca", "classic/a"], ["i", "inst/"], ["l", "tclogs/"], ["me", "meta/"], ]
class Logger attr_reader :logfname attr_accessor :logfile attr_accessor :stagename attr_accessor :errors def initialize(logfname) @logfname = logfname @errors = 0 @logfile = File.new(@logfname, "w") end
def begins() puts "Begin #{stagename}..." @logfile.puts "Begin #{stagename}..." end
def success() puts "#{stagename} completed successfully!" @logfile.puts "#{stagename} completed successfully!" end
def failure(errmsg) puts "#{stagename} failed: #{errmsg}" @logfile.puts "#{stagename} failed: #{errmsg}" errors += 1 end
def skipline(msg) puts @logfile.puts if (msg != nil) and (msg != "") puts msg logfile.puts msg end end
end
def todaystring() d = Date.today() return zerofill(d.year - 2000) + zerofill(d.month) + zerofill(d.mday) end
def zerofill(number) number += 10000 string = number.to_s twodigit = string[-2, 2] return twodigit end
def tar_cre_string(abbrev, dir)
command = "tar czvf #{$tgzdir}#{abbrev}#{$datestring}.tgz #{dir}" return command end
def tar_diff_string(abbrev, dir)
command = "tar dzvf #{$tgzdir}#{abbrev}#{$datestring}.tgz" return command end
def tar_md5_string(abbrev) command = "md5sum #{$tgzdir}#{abbrev}#{$datestring}.tgz > #{$tgzdir}#{abbrev}#{$datestring}.md5" return command end
def tar_lst_string(abbrev) command = "tar tzvf #{$tgzdir}#{abbrev}#{$datestring}.tgz " command = command + "| sed -e \"s/^.* //\" | " command = command + "sort > #{$tgzdir}#{abbrev}#{$datestring}.lst" puts command return command end
def do1tgz_string_only(commands, abbrev, dir) commands.push(tar_cre_string(abbrev, dir)) commands.push(tar_diff_string(abbrev, dir)) commands.push(tar_md5_string(abbrev)) commands.push(tar_lst_string(abbrev)) end
def do1tgz(tgzlogger, tasklogger, abbrev, dir) this_tgz_errors = 0 tgzlogger.stagename ="Directory #{dir} (#{abbrev}{$datestring}.tgz)" tgzlogger.skipline("") tgzlogger.begins()
tasklogger.skipline("Directory #{dir} as #{abbrev}{$datestring}.tgz") tasklogger.stagename = ("Creating #{$tgzdir}#{abbrev}#{$datestring}.tgz") tasklogger.begins() cre_return = system(tar_cre_string(abbrev, dir)) if cre_return then tasklogger.success() else tasklogger.failure("") this_tgz_errors += 1 end
tasklogger.stagename = ("Diffing #{$tgzdir}#{abbrev}#{$datestring}.tgz") tasklogger.begins() diff_return = system(tar_diff_string(abbrev, dir)) if diff_return then tasklogger.success() else tasklogger.failure("") this_tgz_errors += 2 end
tasklogger.stagename = ("Creating md5 #{$tgzdir}#{abbrev}#{$datestring}.md5") tasklogger.begins() md5_return = system(tar_md5_string(abbrev)) if md5_return then tasklogger.success() else tasklogger.failure("") this_tgz_errors += 4 end
tasklogger.stagename = ("Creating lst #{$tgzdir}#{abbrev}#{$datestring}.lst") tasklogger.begins() lst_return = system(tar_lst_string(abbrev)) if lst_return then tasklogger.success() else tasklogger.failure("") this_tgz_errors += 8 end
if this_tgz_errors == 0 then tgzlogger.success() else errmsg = "failed on step(s) " if this_tgz_errors % 1 == 1 then errmsg += "(CREATE) " end
this_tgz_errors /= 2
if this_tgz_errors % 1 == 1 then errmsg += "(DIFF) " end
this_tgz_errors /= 2
if this_tgz_errors % 1 == 1 then errmsg += "(MD5) " end
this_tgz_errors /= 2
if this_tgz_errors % 1 == 1 then errmsg += "(LST) " end
tgzlogger.failure(errmsg) end
end
def main() $datestring = todaystring() Dir.chdir($rsyncdir) system("pwd") system("sleep 1") tasklogger = Logger.new(ENV['HOME'] +"/maketgz_task.log") tgzlogger = Logger.new(ENV['HOME'] + "/maketgz_tgz.log") $tgzs.each do |tgz| do1tgz(tgzlogger, tasklogger, tgz[0], tgz[1]) end end
main()
|
At the top, global constants are defined for the tgz root and the
root of the directories to be tarred. Also at the top is a global array of
directory/abbreviation pairs, each implemented as a 2 element array. I
didn't do it as a hash, in order to give allow the backups tarballs to
be created in a specific order. Personally, I like doing the shortest
ones first. The program creates all the tarballs enumerated in the $tgzs array.
The rest of the program facilitates the tarring itself, creating a .md5
file, creating a listing of the files backed up (.lst), verifying the
tarball (.tgz) against the data with which it was created, as well as
implementing a logging capability. The logger should probably write and
flush on each entry, and eventually I'll change it so it does.
Recording backup metadata
Some data about the backup itself needs to be recorded. I put such in the mirror tree's meta directory. The obvious piece of metadata is the timestamp for the backup. That's meta/timestamp.txt. It's used not only to identify the time of backup, but also to drive incrementation.
The other thing that's needed is the backup programs themselves. I copy all the programs to accomplish my backup system to meta/backup/. Here's a script called metabup. Here's what that shellscript looks like:
#!/bin/bash rm -rf /stevebup/rsync/meta/* date +%Y%m%d_%H%M%S > /stevebup/rsync/meta/timestamp.txt cp -Rp /d/bats/backup /stevebup/rsync/meta/
|
At a later step, the meta directory is packaged as a (very small) tarball, which can be included on every DVD in the set. The meta directory can also be burned as a plain text tree. In addition, timestamp.txt can be burned to the DVD in plain text, so it's obvious when the backup was made.
Once again, meta/timestamp.txt
is later used to create an incremental snapshot of the backup, and that
incremental snapshot remains unchanged in spite of changes to the
actual Rsync destination directory.
Burning the tarballs onto removable media
I'm a big believer in minimum user intervention, but I burn DVDs using K3b. Here's why:
When you have several tarballs, and exceed the limits of one DVD, you
need to spread them out over multiple DVDs. Which tarballs go on which
DVDs is not only a space conservation thing, but also a personal
preference. Any script I write today would be obsolete in a few months.
Besides that, once you have more than 1 DVD to burn, you need to insert
DVDs during the burn, so there's no possibility of a "set it forget it"
solution.
One could conceivably create a backtracking algorithm that would
optimally pack tarballs onto multiple DVDs, but because a human must
remove and insert DVDs, you can't get around the human intervention. So
my preference is K3b.
After-burn verification
Earlier I said I never had an error on .tgz creation. The same cannot be said for CD or DVD burning. Those error out a lot!
To facilitate after-burn verification, each created .tgz file is
augmented with a .md5 file listing its md5sum. Therefore, after burn,
even years after burn, one can compare the md5sum of the .tgz file with
the md5 value listed in the .md5 file.
Because my backups have many tarballs on each DVD, I created a script
to md5 compare all tarballs, after mounting the CD or DVD as /mnt/cdrom:
#!/bin/bash
echo echo
mountpoint=/mnt/cdrom
echo Mountpoint =$mountpoint=
logfile=$HOME/verifycd.tmp rm -f $logfile
for tgz in $mountpoint/*.tgz do md5=${tgz//\.tgz/\.md5} echo -n "Comparing Checksums for $tgz & $md5, please wait... " md5val=`cut -d " " -f 1 $md5` tgzval=`md5sum $tgz | cut -d " " -f 1` echo " finished." echo "$md5 value==>$md5val<==" echo "$tgz value==>$tgzval<==" if (test "$md5val" = "$tgzval"); then echo $tgz is good. >> $logfile else echo "" echo $tgz MD5SUM MISMATCH! ERROR ERROR ERROR ERROR ERROR! >> $logfile fi echo done cat $logfile
|
Passwords and Password Alternatives
By Steve Litt
In an ideal world, you'd institute your daily backup, it would ask you
for the password of the workstation, and then do the backup. Trouble
is, if you're grabbing several directories with several different Rsync
commands, you'll be asked for your password each time, and not at the
beginning, but as each one finishes and the next one starts. Remember
that this magazine started as a desire to make backups easy?
I considered grabbing the password with a master script, and then calling each rsync
command with its password. Umm, no. First, my script echoed my
password, so the guy standing over my shoulder can see my password.
Also, if I use the password on the command line of each rsync call, those calls can be visible in a ps command.
I tried setting the RSYNC_PASSWORD
environment variable to the password, but it had no effect. Apparently
that works only when you connect to an Rsync daemon. Likewise, the --password-file option works only when accessing a daemon, not using a remote shell like ssh. Besides that, I cannot imagine having a password in open text in a file.
The "solution" is to use public and private keys created by ssh-keygen, and do so without specifying a passphrase. If you specify a passphrase, you'll be asked for that passphrase every time rsync runs, so you've gained nothing.
Having a private key without a passphrase is extremely risky, but the
risks can be somewhat ameliorated. There's an excellent document on
eliminating passwords from Rsync while minimizing the risk, at http://www.jdmz.net/ssh/. Read it!
Musings on Rsync
By Steve Litt
A trip to the Rsync project website (URL in URLs section) is very
revealing. For instance, Andrew Tridgell originated the Rsync project.
Does that name sound familiar? He's much better known for originating
another free software project -- Samba.
The Rsync project website is a treasure trove of information and help
on Rsync, and even contains Andrew's Ph.D. thesis, which itself
contains three chapters on Rsync, and is a fascinating read. Read it!
Life After Windows: What does this
have to do with Linux?
Life After Windows is a regular Linux Productivity Magazine
column,
by Steve Litt, bringing you observations and tips subsequent to
Troubleshooters.Com's
Windows to Linux conversion.
By Steve Litt
Backing up with Rsync is pretty cool, isn't it? Is there something like it in the Windows world?
I wouldn't know. How would I find out? I Googled the words Windows and backup.
There's something called ZipBackup, which appears to be a front end to
PKZip (or Winzip, etc). Great idea, but I didn't see anything about
network backups, or moving the process to another computer. There was
something called GRBackPro, which was cool because it does
provide for network backup and it can back up to .zip files. It also
has incremental backups, so theoretically you could transfer only
changed files. However, those changed files appear to contribute to a
chain of incremental backups following a full backup, not a mirror like
Rsync gives you.
My quick web research indicated that there are many nice Windows backup
programs costing $29 to $99. Most write to a wide variety of media. A
few seem capable of backing up over a network. These programs are a
nice compromise for a non-technical user, but they're monolithic. I've
gotten used to the fact that if I can get a mirror on another computer,
I can use tar to compress it into an archive, and K3b or growisofs or
cdrecord to get it onto media. If a part of my backup doesn't work, I
can replace just that part. You can't do that with a Windows backup program.
Then there's the cost factor. The Windows backup programs are cheap,
but to try a bunch of them you'd need to spend a fortune. So you'll
probably settle.
There's the knowledge accumulation factor. With Rsync, I learned Rsync
and hardlinks, but I already had prodigious knowledge of creating
tarballs complete with verification and logs, and DVD burning. Now,
going forward, I'll have good knowledge of Rsync and hardlinks with
which to accomplish other tasks. In other words, with Linux, your
knowledge builds and builds, whereas with Windows you just learn a
succession of different programs, often with radically different user
interfaces and data formats.
There's also a huge difference in quality. The Windows backup programs
are adequate for personal or very small business backup, but they're
not scalable. As the business grows, and data becomes more voluminous,
and the computer housing the data is used ever more continuously, the
Windows backup programs can't handle the load.
Contrast that with Kevin Korb's Rsync backup system that runs on Linux.
It's enterprise quality, at least for a midsized enterprise. One backup
computer could poll various other computers, pulling Rsync'ed data
throughout the day. Tarball creation and disk burning would
require a couple hours of complete attention by the backup server (in
other words, no Rsyncing while burning), but not the machines being
backed up. One way to scale up would be to have a dedicated DVD burning
machine with an NFS link to the backup server.
Indeed, if you needed to scale further, you could split computers to be
backed up between several backup servers, each with its own DVD or tape
drive, or with a single DVD or tape drive attached via NFS.
The relationship between Windows and Linux backup programs is
representative of all software. Linux uses the Unix philosophy of lots
of little programs that do one thing and do it well. You glue them
together with Ruby or shellscripts or UMENU or whatever, and get a
product exactly matching your needs. With Windows you try out several
commercial programs, pick the best, work around its idiosyncracies and
inconveniences. With Linux, every new functionality you build increases
your knowledge, whereas with Windows you keep starting from scratch
with every new functionality. Available retail Windows applications
tend to be good for a moderately loaded desktop, but group or
enterprise class applications cost a fortune. Often times, group or
enterprise class applications, or the major building blocks to build
them, come free on your Linux install CD.
This isn't to say there aren't advantages to the Windows way. For the
totally nontechnical, a monolithic program set up to perform the
functionality right out of the box is what is needed. But for those of
us who are "power users", I believe the Linux way is better.
And it's not that difficult. Each of my Rsync scripts was less than 10
lines of shellscript code. My metadata script was a shellscript less
than 10 lines. My script to create an incremental hardlink copy of the
new backup was less than 10 lines. Indeed, the only sizeable piece of
code was the Ruby program to create the tarballs, and truth be told,
that could be replaced by a series of per-tree shellscripts writing to
a common log file. I like the Ruby solution better, but a nonprogrammer
could easily implement a shellscript solution.
Beyond all of that is the difference in empowerment.
As a Windows guy back in the old days, when I needed something, I
asked around, collected voluminous information and opinions on
alternative proprietary programs, wrote a check for one, and hoped it
would work. Now, as a Linux guy,
when I need something I ask around, assemble the pieces, and get the
thing done. When my needs increase, I modify my previous solution, or
find a better one available on the Internet. I expect to be able to do that.
If you're willing to spend a little time, Linux gives you the power to accomplish anything.
GNU/Linux, open source and free software
By Steve Litt
Linux is a kernel. The operating system often described as "Linux" is that
kernel combined with software from many different sources. One of the most
prominent, and oldest of those sources, is the GNU project.
"GNU/Linux" is probably the most accurate moniker one can give to this
operating system. Please be aware that in all of Troubleshooters.Com,
when I say "Linux" I really mean "GNU/Linux". I completely believe that without
the GNU project, without the GNU Manifesto and the GNU/GPL license it spawned,
the operating system the press calls "Linux" never would have happened.
I'm part of the press and there are times when it's easier to say "Linux"
than explain to certain audiences that "GNU/Linux" is the same as what the
press calls "Linux". So I abbreviate. Additionally, I abbreviate in the same
way one might abbreviate the name of a multi-partner law firm. But make no
mistake about it. In any article in Troubleshooting Professional Magazine,
in the whole of Troubleshooters.Com, and even in the technical books I write,
when I say "Linux", I mean "GNU/Linux".
There are those who think FSF is making too big a deal of this. Nothing
could be farther from the truth. The GNU General Public License, combined
with Richard Stallman's GNU Manifesto and the resulting GNU-GPL License,
are the only reason we can enjoy this wonderful alternative to proprietary
operating systems, and the only reason proprietary operating systems aren't
even more flaky than they are now.
For practical purposes, the license requirements of "free software" and "open
source" are almost identical. Generally speaking, a license that complies
with one complies with the other. The difference between these two is a difference
in philosophy. The "free software" crowd believes the most important aspect
is freedom. The "open source" crowd believes the most important aspect is
the practical marketplace advantage that freedom produces.
I think they're both right. I wouldn't use the software without the freedom
guaranteeing me the right to improve the software, and the guarantee that
my improvements will not later be withheld from me. Freedom is essential.
And so are the practical benefits. Because tens of thousands of programmers
feel the way I do, huge amounts of free software/open source is available,
and its quality exceeds that of most proprietary software.
In summary, I use the terms "Linux" and "GNU/Linux" interchangably, with
the former being an abbreviation for the latter. I usually use the terms "free
software" and "open source" interchangably, as from a licensing perspective
they're very similar. Occasionally I'll prefer one or the other depending
if I'm writing about freedom, or business advantage.
Steve Litt has used GNU/Linux since 1998, and written about it since 1999. Steve can be reached at his email address.
Letters to the Editor
All letters become the property of the publisher (Steve Litt), and
may
be edited for clarity or brevity. We especially welcome additions,
clarifications,
corrections or flames from vendors whose products have been reviewed in
this
magazine. We reserve the right to not publish letters we deem in
bad
taste (bad language, obscenity, hate, lewd, violence, etc.).
Submit letters to the editor to Steve Litt's email address, and be
sure
the subject reads "Letter to the Editor". We regret that we cannot
return
your letter, so please make a copy of it for future reference.
How to Submit an Article
We anticipate two to five articles per issue.
We look for articles that pertain to the GNU/Linux or open source. This
can
be done as an essay, with humor, with a case study, or some other
literary
device. A Troubleshooting poem would be nice. Submissions may mention a
specific
product, but must be useful without the purchase of that product.
Content
must greatly overpower advertising. Submissions should be between 250
and
2000 words long.
Any article submitted to Linux Productivity Magazine must be
licensed
with the Open Publication License, which you can view at
http://opencontent.org/openpub/.
At your option you may elect the option to prohibit substantive
modifications.
However, in order to publish your article in Linux Productivity
Magazine,
you must decline the option to prohibit commercial use, because Linux
Productivity
Magazine is a commercial publication.
Obviously, you must be the copyright holder and must be legally able
to
so license the article. We do not currently pay for articles.
Troubleshooters.Com reserves the right to edit any submission for
clarity
or brevity, within the scope of the Open Publication License. If you
elect
to prohibit substantive modifications, we may elect to place editors
notes
outside of your material, or reject the submission, or send it back for
modification.
Any published article will include a two sentence description of the
author,
a hypertext link to his or her email, and a phone number if desired.
Upon
request, we will include a hypertext link, at the end of the magazine
issue,
to the author's website, providing that website meets the
Troubleshooters.Com
criteria for links and that the
author's
website first links to Troubleshooters.Com. Authors: please understand
we
can't place hyperlinks inside articles. If we did, only the first
article
would be read, and we can't place every article first.
Submissions should be emailed to Steve Litt's email address, with
subject
line Article Submission. The first paragraph of your message should
read
as follows (unless other arrangements are previously made in writing):
Copyright (c) 2003 by <your name>. This
material
may be distributed only subject to the terms and conditions set forth
in
the Open Publication License, version Draft v1.0, 8 June 1999
(Available
at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for
readability
at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest
version
is presently available at http://www.opencontent.org/openpub/).
Open Publication License Option A [ is | is not]
elected,
so this document [may | may not] be modified. Option B is not elected,
so
this material may be published for commercial purposes.
After that paragraph, write the title, text of the article, and a
two
sentence description of the author.
Why not Draft v1.0, 8 June 1999 OR LATER
The Open Publication License recommends using the word "or later" to
describe
the version of the license. That is unacceptable for Troubleshooting
Professional
Magazine because we do not know the provisions of that newer version,
so
it makes no sense to commit to it. We all hope later versions will be
better,
but there's always a chance that leadership will change. We cannot take
the
chance that the disclaimer of warranty will be dropped in a later
version.
Trademarks
All trademarks are the property of their respective owners.
Troubleshooters.Com(R)
is a registered trademark of Steve Litt.
URLs Mentioned in this Issue
_