Troubleshooters.Com
Presents
Linux
Productivity
Magazine
April
2007
NoSQL: The Unix
Database (With awk)
|
Copyright (C) 2007 by
Steve Litt. All rights
reserved.
Materials from guest authors copyrighted by them and licensed for
perpetual
use to Linux Productivity Magazine. All rights reserved to the
copyright
holder, except for items specifically marked otherwise (certain free
software
source code, GNU/GPL, etc.). All material herein provided "As-Is". User
assumes
all risk and responsibility for any outcome.
[ Troubleshooters.Com
| Back Issues |Troubleshooting
Professional Magazine
]
NoSQL is
an extension of the UNIX environment, making available the full power
of UNIX during application development and usage. -- From the Philosophy of NoSQL
|
CONTENTS
Editor's Desk
By Steve Litt
I need your help.
I need your help building a Rapid Application Development (RAD) toolkit for Linux.
With the possible exception of Rails, there are few if any truly fast
Rapid Application Development (RAD) toolkits for Linux. Very few ways
to construct a full-fledged, five table app in a single day.
There were very few for DOS or Windows, for that matter. I'm in the
process of specifying the tools and inter-tool communication for a
Linux RAD toolkit, and need ideas and a little coding help. For this
first iteration in creating the toolkit, I'm using the simplest
database I know of, NoSQL. The toolkit can easily be adapted to MySQL
or Postgres later, but right now let's get it working with NoSQL. For
more details on the upcoming RAD toolkit, see the
RAD on Linux article later in this magazine, and please email me if you're interested.
Data Base Management Systems (DBMS) are wonderful, aren't they?
Transactions, row locking, user passwords, client-server, triggers and
stored procedures. But sometimes, when doing a simple app that just
needs a local relational database, it's just too much
hassle? DBMS's, even the good ones like MySQL and Postgres,
seem like black boxes. Wouldn't it be nice to be able to store
relational data in files? Wouldn't it be nice to have the option to
read them with grep
and write them with cat?
To back them up as ordinary files, rather than having a special
database backup procedure?
Welcome to the world of NoSQL. Like the big boys, it has tables full of
rows and columns. The tables can be joined relationally. But unlike the
big boys, if necessary you can edit them with vi.
And you don't need DBI::DBD, PHP, Rails, GDK, TK or QT to
work
with them. All you need are a few Unix commands, filters really, to
read and write the data you need.
OK, you're not going to scale a NoSQL database up to 100 users hitting
it hard with both reads and writes. But even if that's what you'll
eventually need, NoSQL provides an excellent prototyping environment.
Changing table structures is trivially easy. Adding tons of test (or
real) data consists of a cat
and a sort.
You access your data with Unix commands, not with SQL.
Here's the number one advantage of NoSQL. If you know Unix (or Linux or
BSD), you know NoSQL. You'll learn a few commands packaged with NoSQL,
and you're off and running.
Carlo Strozzi created NoSQL using Debian Linux, so that's your easiest
install.
Probably any Linux is easy. BSD and other Unices are harder -- the more
they differ from Linux, the harder they will be. But any Unix or Unix
workalike can accommodate NoSQL without too much trouble. On Windows:
use Access :-).
If you need to patch together a quick and dirty database app, in a few
minutes, and appearance isn't a priority, consider NoSQL. You can
always migrate the data to a database via comma delimited files or a
simple script to turn a NoSQL table into an SQL dump.
So kick back, relax, and use the database that every Unix person knows
before they've even heard of it. And remember, if you use GNU/Linux,
this is your magazine.
Is
Linux Productivity Magazine Back for Good?
By Steve Litt
Linux Productivity Magazine ceased publication in the summer of 2004,
after hurricanes Charley and Frances wiped out our roof and temporarily
stopped all Troubleshooters.Com business activities. This is the first
Linux Productivity Magazine since then. From now on, Linux Productivity
Magazine will be an occasional
publication.
For that reason the magazines will no longer have volume and issue
numbers. From now on, new Linux Productivity Magazines only when I have
suitable content and the time to write it.
So check back from time to time. It will be worth it.
Installing NoSQL
By Steve Litt
Like everything else in Linux, dependencies are the challenge.
According to its INSTALL documentation, NoSQL requires the following:
- Perl 5
- GNU textutils (grep, sort, join, etc.).
- The lockfile(1) utility, normally distributed with
procmail(1)
- The mktemp(1) utility
- Mike Brennan's mawk(1), version >= 1.3.3
Here are the tests for these utilities:
[slitt@mydesk 200704]$ perl -v
This is perl, v5.8.8 built for i386-linux (with 1 registered patch, see perl -V for more detail)
Copyright 1987-2006, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on this system using "man perl" or "perldoc perl". If you have access to the Internet, point your browser at http://www.perl.org/, the Perl Home Page.
[slitt@mydesk 200704]$
|
|
[slitt@mydesk 200704]$ lockfile -v lockfile v3.22 2001/09/10 Copyright (c) 1990-2001, Stephen R. van den Berg <srb@cuci.nl> Copyright (c) 1997-2001, Philip A. Guenther <guenther@sendmail.com>
Submit questions/answers to the procmail-related mailinglist by sending to: <procmail-users@procmail.org>
And of course, subscription and information requests for this list to: <procmail-users-request@procmail.org>
Your system mailbox's lockfile: /var/spool/mail/slitt.lock [slitt@mydesk 200704]$
|
|
[slitt@mydesk 200704]$ mawk -W version mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
compiled limits: max NF 32767 sprintf buffer 1020 [slitt@mydesk 200704]$
|
|
[slitt@mydesk 200704]$ mktemp /home/slitt/tmp/tmp.kVWCt24221 [slitt@mydesk 200704]$
|
|
[slitt@mydesk 200704]$ grep -v Usage: grep [OPTION]... PATTERN [FILE]... Try `grep --help' for more information. [slitt@mydesk 200704]$
|
|
By far the most likely stumbling block will be mawk.
Almost any Unix will have Perl 5. Almost any Linux will have the GNU
text utilities. Most Linuces come with procmail packages, so lockfile
shouldn't be a problem. The mktemp
utility is fairly ubiquitous.
Your distro probably didn't come with mawk. Instead
it probably came with GNU's gawk.
Although you could theoretically adapt NoSQL to use gawk instead of
mawk,
installing mawk
is easy enough...
Installing mawk
If mawk
is already on your
system, or if there's an easy way to install it as a package, you can
skip this section. However, I can tell you that on Mandriva, trying to
install mawk
as an RPM package turns into a multilevel dependency nightmare.
So here's what you do. Go to http://freshmeat.net/projects/mawk/ and
download the latest tarball. Untar it into a directory. If you're not
using Linux or BSD, look carefully at the INSTALL file. Otherwise,
start by running ./configure.
If it works, great. If you're like me, it will fail something like this:
[slitt@mydesk mawk-1.3.3]$ ./configure creating cache ./config.cache checking for gcc... gcc checking whether we are using GNU C... yes checking how to run the C preprocessor... gcc -E checking for -lm... no checking for log... no configure: error: Cannot find a math library. You need to set MATHLIB in config.user [slitt@mydesk mawk-1.3.3]$
|
This is a known problem, and simple to fix. Look in the config.user
file and locate something that looks like this:
# configure will look for libm. Change if you know this will fail # or want a different math library #MATHLIB=-lspecialmath #MATHLIB='' # if you don't need a special lib to get sin,sqrt,etc
|
What you need to do is set MATHLIB to
whatever flag you need to tell gcc
to include math libaries. This is typically -lm, so you'd
make the following change:
# configure will look for libm. Change if you know this will fail # or want a different math library #MATHLIB=-lspecialmath #MATHLIB='' # if you don't need a special lib to get sin,sqrt,etc MATHLIB=-lm
|
Once you get the right setting for MATHLIB, configure
should work, although you'll see some scary looking warnings about
incompatible implicit declarations. Then do make and make install,
and that's it, mawk
is installed.
Installing NoSQL
Grab the tarball from http://www.scriptaworks.com/shared/nosql/. Untar
it into a previously empty directory, then read the INSTALL file, which
will give you some tips and illuminate a few landmines. It also tells
you how to install the NoSQL files into a directory other than the
default /usr/local/nosql.
All of the above being said, as long as you have a working mawk, perl
5, lockfile, mktemp and GNU text tools, probably all you'll need to do
is become root
and run this command:
./install-sh
Then you're all set except for the executable path
Tweaking the Executable Path
NoSQL uses a lot of very common words, such as join and select, as
commands. These can clash with like-named Linux commands. Therefore,
when running NoSQL based applications, the NoSQL binary directory
(default /usr/local/nosql/bin) must be at the front of the path.
However, under normal, non-NoSQL situations, you want it at the back of the path
or maybe not on the path at all, so that Linux works normally. This
calls for a script to create a NoSQL friendly environment:
PATH=/usr/local/nosql/bin:$PATH export PATH OLD_PS1=PS1 PS1='[nosql_\u@\h \W]\$ ' export PS1 bash
|
The preceding script puts you in an environment in which the NoSQL
binary directory is at the front of the path, and the string nosql is in the
prompt, so you can instantly recognize what's going on.
WARNING
The rxvt
terminal program,
in its infinite wisdom, resets the $PS1 environment variable when it
starts, so if you're already in a NoSQL environment and then start rxvt,
the prompt will not show the "nosql_" even though the NoSQL binary
directory will still be at the front of the path. All other terminal
programs I tested worked right -- both the path and the prompt were
passed down from parent terminal to child terminal. |
Always be sure to have the NoSQL environment running before doing NoSQL
work. Otherwise, wrong commands will do wrong things, creating very
confusing failures.
Hello World, Unix Style
By Steve Litt
Using the Vim or Emacs editor, create the following file called people.table:
The preceding was created with Vim. It's shown as a screenshot rather
than text so you can see the first line has every column name preceded
by a \x01 character, otherwise known as a Ctrl+A. You can do this in
most Unix text editors, including Emacs. The purpose of the \x01
character is so the row of column names, otherwise known as the heading,
sorts to the top. Each column name is preceded by a \x01 so that if the
table is used to produce a table with columns rearranged, it will still
sort to the top. Implied in all of this is that characters \x00 and \x01
are illegal within NoSQL column values.
Every column's value is separated from the next column's value and the
previous column's value by a tab character. The same is true for the
column names on the top row. The first column is not preceded by a tab
character, and the last column is not followed by a tab character. All
of this means tab characters are illegal within a NoSQL column value.
Because rows are separated by newlines, newlines are illegal within a
column's value. So \x00, \x01, tab and newline are all illegal. If
binary data is required in NoSQL data, there are several alternatives:
- uuencode
- base64
- quoted-printable
- Filename of a file containing the binary data
If you want binary data that's more recognizable as what it is, you
could use the following escape characters:
Binary value |
Escape string |
\x00 |
\x00 |
\x01 |
\x01 |
newline |
\x0c |
tab |
\x0b |
backslash |
\\ |
Of course, you
would need to write the programs to encode and decode binary data to
respect those escapes.
The left hand column of any NoSQL table is its primary key sorted in
ascending order, so in this case it's person_id, and it's in ascending
order. If the
primary key is compound, then it's the leftmost 2 columns, or the
leftmost 3
columns, or however many segments comprise the primary key. The file
must be physically sorted by
primary key. If it is not, you'll see later there is a very easy way to
re-sort it by primary key.
Note also that the primary key is numeric. That means the
sort must be done numerically (similar to sort -n).
This
feature is included in NoSQL. It's usually easier to sort as a string.
If you start your numeric keys with a large number (say 100000), then
all numeric keys will be the same length and will therefore be sortable
as strings.
Finally, note that the last column is a seemingly meaningless number.
Later on we'll join that field to job_id in the jobs.table table. As a
little heads-up, job 1 is Visionary,
job 2 is Troubleshooter,
job 3 is President,
and job 4 is Vice
President.
OK, so we created the file, but is it truly a NoSQL table? Run your
script to create a NoSQL environment, and then use the column command
to see:
[slitt@mydesk 200704]$ nosql_environ.sh [nosql_slitt@mydesk 200704]$ cat people.table | column -r person_id lname fname job_id 1 Strozzi Carlo 1 2 Torvalds Linus 1 3 Stallman Richard 1 4 Litt Steve 2 5 Bush George 3 6 Clinton Bill 3 7 Reagan Ronald 3 8 Cheney Dick 4 9 Gore Al 4 [nosql_slitt@mydesk 200704]$
|
There's the proof. The -r
argument means "print all columns except the ones named", and since
none are named, it prints them all. Just for fun, let's correctly
justify everything:
[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify person_id lname fname job_id --------- -------- ------- ------ 1 Strozzi Carlo 1 2 Torvalds Linus 1 3 Stallman Richard 1 4 Litt Steve 2 5 Bush George 3 6 Clinton Bill 3 7 Reagan Ronald 3 8 Cheney Dick 4 9 Gore Al 4 [nosql_slitt@mydesk 200704]$
|
That's the proof of concept, assuming you're willing to make the table
in a text editor. If you do so, you're responsible for correctly
inserting tab characters and \x01 characters. After all is done, you
need to search for tab characters to make sure you didn't accidentally
use spaces where you should have had tabs, or tabs where you should
have had spaces.
If you don't like creating a table with a text editor, read on...
Hello World, NoSQL Style
By Steve Litt
Create the following people.xref template file:
# This is a comment in the people.xref template file # To create and empty people.table from people.xref, # just run # cat people.xref | maketable > people.table # BE CAREFUL not to overwrite a good people.table!!! person_id Primary key lname Person's last name fname Person's first name job_id ID of person's job (relate to jobs table)
|
The comments are preceded by hash characters, like any other Unix file.
The first set of non-space characters on non-comment lines is the
column name, while anything after the first space is a description that
is not used further.
Watch this:
[nosql_slitt@mydesk 200704]$ cat people.xref | maketable > people.table [nosql_slitt@mydesk 200704]$ cat people.table | column -r person_id lname fname job_id [nosql_slitt@mydesk 200704]$
|
The preceding proves that you just created a valid empty table with
filename people.table.
Now it's time to fill the table with data. Run the following command:
edittable people.table
You'll be editing a file that looks like this:
person_id|lname|fname|job_id ---------|-----|-----|------
|
Note that the editor you will be using will be Vim, unless you've
configured edittable
to use a different editor by setting and exporting the EDITOR environment variable. Continue editing by filling in values
separated by the pipe symbol:
person_id|lname|fname|job_id ---------|-----|-----|------ 1|Strozzi|Carlo|1 2|Torvalds|Linus|1 3|Stallman|Richard|1 4|Litt|Steve|2 5|Bush|George|3 6|Clinton|Bill|3 7|Reagan|Ronald|3 8|Cheney|Dick|4 9|Gore|Al|4
|
Using edittable,
you
needn't type in the \x01 characters in each column name, and instead of
typing tab characters you use a much more visibly discernable pipe
character (|).
Now save and exit using the proper commands for whatever editor edittable is
configured to use (Vim is the default). As edittable
exits, it automatically checks what you typed in, to make sure you
included the right number of columns in every row, and made no other
mistakes. Assuming everything is OK, it converts your input into
properly formatted tab delimited data, as can be shown from
the
following command:
[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify person_id lname fname job_id --------- ---------- --------- ------ 1 Strozzi Carlo 1 2 Torvalds Linus 1 3 Stallman Richard 1 4 Litt Steve 2 5 Bush George 3 6 Clinton Bill 3 7 Reagan Ronald 3 8 Cheney Dick 4 9 Gore Al 4 [nosql_slitt@mydesk 200704]$
|
If the check for legality on exiting edittable had
found problems, you'd have been presented with the following choices:
istable: long record at line 9 istable: table not ok
What shall I do ?
1) Re-edit the table 2) Quit without saving the changes 3) Commit the changes anyway
=>
|
Almost always, the thing you want to do is re-edit. You probably put in
one too many or one too few pipe symbols on one of the lines, or you
left out a \x01 character in a column header, or maybe you used a tab
instead of a pipe symbol, or something else that's easy.
Anyway, you've just proved the concept without directly editing in an
editor.
The justify
Command
By Steve Litt
The justify
command makes
the output of a NoSQL command "look right". I've added a
couple
people, including one with the long name "Alexander Montgomery". Here's
what the table looks like using column -r:
[nosql_slitt@mydesk 200704]$ cat people.table | column -r person_id lname fname job_id 1 Strozzi Carlo 1 2 Torvalds Linus 1 3 Stallman Richard 1 4 Litt Steve 2 5 Bush George 3 6 Clinton Bill 3 7 Reagan Ronald 3 8 Cheney Dick 4 9 Gore Al 4 10 Litt Brett 5 11 Montgomery Alexander 6 [nosql_slitt@mydesk 200704]$
|
Pretty ugly, isn't it. There's nothing like the combination of varying
string lengths plus the use of tabs to muddle which column is which.
Enter the justify
command:
[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify person_id lname fname job_id --------- ---------- --------- ------ 1 Strozzi Carlo 1 2 Torvalds Linus 1 3 Stallman Richard 1 4 Litt Steve 2 5 Bush George 3 6 Clinton Bill 3 7 Reagan Ronald 3 8 Cheney Dick 4 9 Gore Al 4 10 Litt Brett 5 11 Montgomery Alexander 6 [nosql_slitt@mydesk 200704]$
|
Ahh, that's better. Tabs are converted to spaces, and the longest
string in each column determines the space given to the column. Note
that justify
itself can be used to display the table, so in the preceding command
the column -r
was unnecessary. I could have just done this:
cat people.table | justify
Care must be taken, because the output of justify is not a valid NoSQL
table, and therefore cannot be used as input to any type of database
command. Observe:
[nosql_slitt@mydesk 200704]$ cat people.table | istable [nosql_slitt@mydesk 200704]$ echo $? 0 [nosql_slitt@mydesk 200704]$ cat people.table | justify | istable [nosql_slitt@mydesk 200704]$ echo $? 1 [nosql_slitt@mydesk 200704]$
|
The istable
command returns 0 if fed a valid NoSQL table, nonzero otherwise.
Sorting and Indexing
By Steve Litt
Here's how you sort a table alphabetically by rows:
cat people.table | sorttable
The output of that command is people.table sorted by the leftmost
column, then the second leftmost, then the third... If for some reason
you need it reverse sorted, add -r:
cat people.table | sorttable -r
Because tables should already
be sorted by the left column, which in NoSQL tables is the primary key,
or the left two or three columns, which in NoSQL tables are the two or
three keys comprising a multisegment primary key, whole row sorting is
seldom necessary unless you're sorting the output of a command that
left the data unsorted. However, often you need to sort by a column
that is not the primary key. The following is an example where I sort by last name and then first:
[nosql_slitt@mydesk 200704]$ cat people.table | sorttable lname fname | justify person_id lname fname job_id --------- ---------- --------- ------ 100005 Bush George 3 100008 Cheney Dick 4 100006 Clinton Bill 3 100009 Gore Al 4 100010 Litt Brett 5 100004 Litt Steve 2 100012 Mondale Walter 4 100011 Montgomery Alexander 6 100007 Reagan Ronald 3 100013 Reagan Ronald 3 100003 Stallman Richard 1 100001 Strozzi Carlo 1 100002 Torvalds Linus 1 [nosql_slitt@mydesk 200704]$
|
The sorttable -h help facility says you can also sort different columns in different ways, like this:
cat people.table | sorttable job_id:rn lname fname
In the preceding the rn after the colon means reverse (r) and numeric
(n). I haven't had a lot of luck with the stuff after the colon -- your
mileage may vary.
Being a very lightweight database management system (DBMS), NoSQL
requires that the key for a table must be the first (leftmost) column,
and the table must be phyically sorted on the key.
Sorting can be alphabetic or numeric. For keys especially, in my
opinion alphabetic sorts are much better. With alphabetic keys, you can
append a record with cat
>>.
If the key were numeric, the file would need re-sorting after every
record addition. Also, if the key is sorted numerically, the keysearch
command fails if some key values have less digits than others. All this
can be avoided by making the first key 100001. This would enable
arithmetic incrementing of the key, while the next 899,999 records would
have keys of the same digit length, so they could be appended and
sorted right. With most applications, by the time the key grows another
digit, the app would need a rewrite, at which time the key length could
be increased.
Even if you need to make it longer, you can do it on the fly with the compute command:
[nosql_slitt@mydesk 200704]$ cat people.table | compute person_id=person_id+1000000 NoSQL: compute is deprecated, use 'awktable' instead person_id lname fname job_id 1100001 Strozzi Carlo 1 1100002 Torvalds Linus 1 1100003 Stallman Richard 1 1100004 Litt Steve 2 1100005 Bushm George 3 1100006 Clinton Bill 3 1100007 Reagan Ronald 3 1100008 Cheney Dick 4 1100009 Gore Al 4 1100010 Litt Brett 5 1100011 Montgomery Alexander 6 1100012 Mondale Walter 4 1100013 Reagan Ronald 3 [nosql_slitt@mydesk 200704]$
|
Here in 2007, pay no attention to the message of deprecation -- compute works
just fine, and awktable
is not yet well documented. You can move people.table to a temp file,
then run the preceding command on the temp file, redirecting it to
people.table. The resulting people.table will have a six digit
person_id field. New rows can be appended via cat >>,
and the keysearch
command works properly.
Compound Keys
NoSQL tables accommodate compound keys. For instance, if you were to
key the people table on last name and then first name, you'd
just
make sure lname
was the first column, fname
the second, and then sort the table. In that case you could search on
the first of the key segments like this:
keysearch -p Litt people_byname.table
A disadvantage of meaningful primary keys is that you can never count
on newly inserted rows having a higher key value than the others, so
the table must be sorted each time.
Indexing
As mentioned, in NoSQL every table should be sorted by its primary key.
However, fast searches must be done on other columns. Enter indexing.
By creating an index on a given column, you enable keysearch to
work with that column.
To index people.table on job_id, you'd do this:
[nosql_slitt@mydesk 200704]$ indextable people.table job_id gen ndx file people.table._x.job_id (people.table) on: job_id Reading people.table ... Finishing sorting ... [nosql_slitt@mydesk 200704]$ keysearch -I people.table._x.job_id 3 people.table person_id lname fname job_id 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 [nosql_slitt@mydesk 200704]$
|
In the preceding session, the second command proved that the index was
built correctly, using the -I argument to specify which index file to
use in the search.
Whenever you add rows to a table, you must reindex all its indices.
Here's the command you use to reindex the index on job_id:
[nosql_slitt@mydesk 200704]$ indextable --update people.table._x.job_id gen ndx file people.table._x.job_id (people.table) on: job_id Reading people.table ... Finishing sorting ... [nosql_slitt@mydesk 200704]$
|
As long as indices are not
rebuilt, the keysearch
command will likely deliver the wrong row if called on that index.
Therefore, it's sometimes OK to add several rows before reindexing, but
reindexing must be guaranteed to happen before further calls to keysearch on
any indices.
Does all this sorting and indexing seem slow? Read on...
Speed and Efficiency
By Steve Litt
NOTE
All timing tests in this article were done on an Athlon XP2600+ with
1.5GB of RAM. As of 3/2007, this is a faster than average machine, but
$600.00 will buy you a more powerful desktop, and $1300.00 will buy you
a more powerful notebook, so it's definitely off the shelf hardware. |
You need to sort and reindex often. Sounds kinda slow.
#!/usr/bin/ruby -w
fnames = [ "Albert", "Betty", "Charles", "Darlene", "Edward", "Felice", "George", "Helen", "Ichibod", "Janet", "Karl", "Linda", "Max", "Nancy", "Oliver", "Patricia", "Quincy", "Rita", "Steve", "Tiffany", "Uvon", "Victoria", "William", "Xaviera", "Yale", "Zelda" ]
lnames = [ "Anderson", "Becker", "Chavez", "Daniels", "Elsworth", "Fox", "Garcia", "Holden", "Iverson", "Jackson", "Kantor", "Levin", "Martinez", "Nagy", "Olmstead", "Peterson", "Quigley", "Ramirez", "Stevenson", "Torkelson", "Ullman", "Victor", "Williamson", "Xavier", "Yarborough", "Ziegler" ]
puts "\x01people_id\t\x01lname\t\x01fname\t\x01job_id" people_id = 100000 multiplier = ARGV[0].to_i || 10 (1..multiplier).each do |n| lnames.each do |lname| fnames.each do |fname| people_id += 1 job_id = 1000 + (people_id % 7) puts "#{people_id.to_s}\t#{lname}\t#{fname}\t#{job_id.to_s}" end end end
|
|
|
It usually isn't,
at least on
moderately sized tables. To research this, let's create a program to
output a people table. The program, written in Ruby, is to the left.
We can pipe the program into the sorttable
command with a script like this:
It usually isn't, at least on moderately sized tables. To research
this, let's create two scripts: the first runs the (slow) Ruby program
at the left and redirects to ./junk.jnk, and then times the execution
of the second one, which sorts the file created by the Ruby. Here's
./testtime.sh:
#!/bin/bash ./test_table.rb $1 > junk.jnk time ./testtime2.sh
|
and here's ./testtime2.sh:
#!/bin/bash sorttable fname < junk.jnk | wc
|
The results are listed below the Ruby program, but basically, with
67,600 rows, it sorts in the time you could hit two or three
keystrokes. With 676,000 rows, it takes about 5 seconds, which is
tolerable in some situations. With 6,760,000 rows (that's 6 million
rows folks), it takes an intolerable 65 seconds.
In other words, when the number of rows is expected to go significantly
above 100,000, try to select before you sort. |
[nosql_slitt@mydesk 200704]$ ./testtime.sh 10 6761 27044 179953 0.04user 0.00system 0:00.05elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1275minor)pagefaults 0swaps [nosql_slitt@mydesk 200704]$ ./testtime.sh 100 67601 270404 1799233 0.42user 0.02system 0:00.45elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2027minor)pagefaults 0swaps [nosql_slitt@mydesk 200704]$ ./testtime.sh 1000 676001 2704004 17992033 5.00user 0.15system 0:05.15elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+9544minor)pagefaults 0swaps [nosql_slitt@mydesk 200704]$ ./testtime.sh 10000 6760001 27040004 185780034 62.27user 2.83system 1:05.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+71507minor)pagefaults 0swaps [nosql_slitt@mydesk 200704]
|
Now let's see how much time it takes to reindex, by changing
testtime.sh and testtime2.sh to the following:
testtime.sh |
|
testtime2.sh |
#!/bin/bash ./test_table.rb $1 > junk.jnk indextable junk.jnk fname echo "aaaaaa aronson alphonso 1008" >> junk.jnk echo "bbbbbb benitez betty 1009" >> junk.jnk wc junk.jnk time ./testtime2.sh
|
|
|
#!/bin/bash indextable --update junk.jnk._x.fname
|
|
[nosql_slitt@mydesk 200704]$ ./testtime.sh 10 gen ndx file junk.jnk._x.fname (junk.jnk) on: fname Reading junk.jnk ... Finishing sorting ... 6763 27052 180008 junk.jnk gen ndx file junk.jnk._x.fname (junk.jnk) on: fname Reading junk.jnk ... Finishing sorting ... 0.05user 0.02system 0:00.07elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2513minor)pagefaults 0swaps [nosql_slitt@mydesk 200704]$ ./testtime.sh 100 gen ndx file junk.jnk._x.fname (junk.jnk) on: fname Reading junk.jnk ... Finishing sorting ... 67603 270412 1799288 junk.jnk gen ndx file junk.jnk._x.fname (junk.jnk) on: fname Reading junk.jnk ... Finishing sorting ... 0.52user 0.01system 0:00.54elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+3211minor)pagefaults 0swaps [nosql_slitt@mydesk 200704]$ ./testtime.sh 1000 gen ndx file junk.jnk._x.fname (junk.jnk) on: fname Reading junk.jnk ... Finishing sorting ... 676003 2704012 17992088 junk.jnk gen ndx file junk.jnk._x.fname (junk.jnk) on: fname Reading junk.jnk ... Finishing sorting ... 5.82user 0.40system 0:06.21elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+9031minor)pagefaults 0swaps [nosql_slitt@mydesk 200704]$ ./testtime.sh 10000 gen ndx file junk.jnk._x.fname (junk.jnk) on: fname Reading junk.jnk ... Finishing sorting ... 6760003 27040012 185780089 junk.jnk gen ndx file junk.jnk._x.fname (junk.jnk) on: fname Reading junk.jnk ... Finishing sorting ... 59.06user 4.30system 1:06.90elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+9769minor)pagefaults 0swaps [nosql_slitt@mydesk 200704]$
|
|
|
To the left you see
the
reindexing results. 67,603 rows take a half a second. 676,003 rows take
a slightly irritating but still tolerable 6.2 seconds. 6,760,003 rows
take an unacceptable 66 seconds.
There are ways around this, as explained on the NoSQL website. One way
is to make all adds and changes go to a small journaling table, and use
that to update the main table every night.
Then there's the other viewpoint: When you have over a million rows in
any table, you've probably outgrown NoSQL if you do many writes, and
you've certainly outgrown an Athlon XP2600+ with 1.5GB of RAM. |
The bottom line is this: If you have a reasonably powerful computer and
none of your tables exceed 100,000 rows, NoSQL will keep up with your
keystrokes. 100,000 rows exceeds most personal, home office, and mom
and pop applications, especially if every year you move every
transaction older than 2 years in an archive table that never changes.
Viewing Data From a Single Table
By Steve Litt
Now let's look more carefully at viewing data from a single table. When
I say "viewing", I use the term loosely, because NoSQL does most of its
work through piping and redirection. Thus, you could create a different
table like this:
[nosql_slitt@mydesk 200704]$ cat people.table | column fname lname job_id | sorttable > people_byname.table [nosql_slitt@mydesk 200704]$ cat people_byname.table Al Gore 4 Bill Clinton 3 Carlo Strozzi 1 Dick Cheney 4 George Bush 3 Linus Torvalds 1 Richard Stallman 1 Ronald Reagan 3 Steve Litt 2 [nosql_slitt@mydesk 200704]$
|
Look what you just did with one command. You removed the person_id
column, reversed the order of lname
and fname,
used fname
and lname
as a compound key, and physically sorted the table by that key. In a
system that uses piping and redirection, "viewing" can be a very
powerful thing.
Before moving on, one more thing about the preceding table organization
-- one could ask how NoSQL deduced that it was a compound key comprised
of fname
and lname,
rather than a single key comprised of fname, or a
triple compound key comprised of fname, lname,
and job_id.
It doesn't deduce it at this point -- the table is sorted by the
combination of all three, so it could be used as a single, double
compound or triple compound key. Later, when you do things like create
indexes or schemas, such keys will be defined more exactly. But it
really doesn't matter -- you just select on whatever single or compound
key that is created by the leftmost set of columns, and properly sorted.
When viewing data, you can either view all the data, or you restrict
the viewed data. The three kinds of restriction are these:
Restrict
by |
Commands |
Column |
column |
Row |
keysearch, getrow, filtertable, awktable |
Join |
jointable |
Restricting by Column
The column
Command
This is how you view only certain columns, or reorder columns.
Remembering people.table, if you don't care about seeing the
job_id column, you can do this:
cat people.table | column person_id lname fname
A simpler way to accomplish that would have been to use the -r reverse
option, whereby everything not
named as a column is printed:
cat people.table | column -r job_id
Remember that NoSQL defines a primary key as the leftmost
column(s), so using the column
command to re-key the table is as simple as moving the intended key
column(s) into the leftmost position.
Perhaps you want to switch the first and last names while removing
the job_id:
cat people.table | column person_id fname lname
Restricting by Row
These commands select the rows to be output.
The keysearch
Command
Certainly the fastest of these commands is the keysearch
command because it binary-searches either the primary key or an index
to come up with the row(s) in just a few random reads. In the
people.table table, here's how to find the row with person_id 100007:
[nosql_slitt@mydesk 200704]$ keysearch 100007 people.table person_id lname fname job_id 100007 Reagan Ronald 3 [nosql_slitt@mydesk 200704]$
|
But what if you want to select by job_id, which is not sorted. Luckily,
there's an index on job_id called people.table._x.job_id. Here's how
you do it:
[nosql_slitt@mydesk 200704]$ keysearch -I people.table._x.job_id 3 people.table person_id lname fname job_id 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 [nosql_slitt@mydesk 200704]$
|
The getrow
Command
The keysearch
command is
lightning fast as long as your primary key is sorted and your indices
are up to date. If not, or if the column on which you want to select
isn't indexed, you can use the getrow
command, which searches sequentially. If you have only a few hundred
rows, getrow is fast...
[slitt@mydesk 200704]$ cat people.table | getrow 'lname=="Litt"' NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id 100004 Litt Steve 2 100010 Litt Brett 5 [slitt@mydesk 200704]$ cat people.table | getrow 'job_id=="3"' NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 [slitt@mydesk 200704]$
|
Notice the single and doublequotes. These work, at least until you try
to use an environment variable for the search term or you try to put
the command in a shellscript. BY FAR the preferred syntax is to use
doublequotes on the outside and escaped double quotes for the inner
quotes, as follows:
[nosql_slitt@mydesk 200704]$ cat people.table | getrow "lname==\"Litt\"" NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id 100004 Litt Steve 2 100010 Litt Brett 5 [nosql_slitt@mydesk 200704]$ cat people.table | getrow "job_id==\"3\"" NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 [nosql_slitt@mydesk 200704]$
|
Note that with numeric fields you can forego the quotes entirely, but
with text columns removing the quotes causes wrong answers:
[slitt@mydesk 200704]$ cat people.table | getrow job_id==3 NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 [slitt@mydesk 200704]$ cat people.table | getrow lname==Litt NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id [slitt@mydesk 200704]$ cat people.table | getrow lname=="Litt" NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id [slitt@mydesk 200704]$
|
Once again, I suggest ignoring the deprecation notice. As of March
2007, getrow works just fine, and awktable is not well documented, and
depends on a knowledge of awk.
The filtertable
Command
I love this command. It adapts any Unix filter (a command taking input
from stdin
and putting output to stdout)
to work on a NoSQL table. Specifically, it outputs the table's header,
and then runs the filter on the rest. For instance, here's a way you
can use filtertable
to speed selection of data:
[slitt@mydesk 200704]$ filtertable grep Litt < people.table | getrow 'lname=="Litt"' NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id 100004 Litt Steve 2 100010 Litt Brett 5 [slitt@mydesk 200704]$
|
Of course, this won't speed selection on my tiny table, but if your
table has a million rows, the ultra-efficient grep runs circles around getrow. Now of
course, a guy named Litt Stevenson might be in the table, and he
shouldn't be selected because Litt is his first name, not his
last name.
That's why the grep-selected rows must be fed to getrow, but at
that point getrow
might be working with a hundred rows instead of a million.
The Unix utility filters are the ultimate performers, but you might
need to write your own filter. Consider a rather contrived situation
where you need to insert a spouse for every current person. Here's one
way you might do it, starting with a Ruby program to do the insertions:
#!/usr/bin/ruby -w STDIN.each do |line| line.chomp! (person_id, lname, fname, job_id) = line.split puts line spouse_id = person_id.to_i + 200000 puts "#{spouse_id}\t#{lname}\tSpouse\t9" end
|
Now use the Ruby program as a filter to add spouses:
[slitt@mydesk 200704]$ filtertable ./insertspouses.rb < people.table person_id lname fname job_id 100001 Strozzi Carlo 1 300001 Strozzi Spouse 9 100002 Torvalds Linus 1 300002 Torvalds Spouse 9 100003 Stallman Richard 1 300003 Stallman Spouse 9 100004 Litt Steve 2 300004 Litt Spouse 9 100005 Bush George 3 300005 Bush Spouse 9 100006 Clinton Bill 3 300006 Clinton Spouse 9 100007 Reagan Ronald 3 300007 Reagan Spouse 9 100008 Cheney Dick 4 300008 Cheney Spouse 9 100009 Gore Al 4 300009 Gore Spouse 9 100010 Litt Brett 5 300010 Litt Spouse 9 100011 Montgomery Alexander 6 300011 Montgomery Spouse 9 [slitt@mydesk 200704]$
|
Of course, you and I know that to be a useful table, the preceding must
be filtered through an additional sort so the primary key is in order,
but I wanted you to see how the filter inserted the proper spouse rows,
and of course with their own unique person_id keys.
The awktable
Command
Currently (March 2007), this command is not documented well. That will
change as time goes on, but for the time being, you might want to try
other selection techniques.
Joining
By Steve Litt
The whole theory of relational databases is that in a given table, you
list only same-typed entities and information that directly
depends on the entity. For instance, if you have a table of
departments, and each department has a manager, you don't put that
manager's name, phone and address in the department row. Instead you
put the manager's employee number in the department row, and then if
necessary look up the manager's information in the employee table,
based on the manager's employee number gotten from the
department
row. The manager's address depends directly on the manager, not on the
department. You'd really hate to update address and phone every time a
manager got promoted and a new one took her place.
Let's make the following jobs.table:
job_id job_name 1 Visionary 2 Troubleshooter 3 President 4 Vice President 5 Writer 9 Spouse
|
Be sure to remember to precede each column name in the header with
\x01, and to put tabs between the first and second column, both in the
header and in the data.
[slitt@mydesk 200704]$ cat people.table | sorttable job_id | jointable -j job_id jobs.table | justify job_id person_id lname fname job_name ------ --------- -------- ------- -------------- 1 100001 Strozzi Carlo Visionary 1 100002 Torvalds Linus Visionary 1 100003 Stallman Richard Visionary 2 100004 Litt Steve Troubleshooter 3 100005 Bush George President 3 100006 Clinton Bill President 3 100007 Reagan Ronald President 4 100008 Cheney Dick Vice President 4 100009 Gore Al Vice President 5 100010 Litt Brett Writer 9 100000 Cheney Lynn Spouse [slitt@mydesk 200704]$
|
I had to sort the table with the foreign key. The table with the
primary key was, of course, already sorted. That sort can take a long
time if the table with the foreign key is large. More on this later.
Meanwhile, there's typically no need for the job_id, so the following
is a more usual join:
[nosql_slitt@mydesk 200704]$ cat people.table | sorttable job_id | \ jointable -j job_id jobs.table | column -r job_id | \ sorttable | justify person_id lname fname job_name --------- -------- ------- -------------- 100000 Cheney Lynn Spouse 100001 Strozzi Carlo Visionary 100002 Torvalds Linus Visionary 100003 Stallman Richard Visionary 100004 Litt Steve Troubleshooter 100005 Bush George President 100006 Clinton Bill President 100007 Reagan Ronald President 100008 Cheney Dick Vice President 100009 Gore Al Vice President 100010 Litt Brett Writer [nosql_slitt@mydesk 200704]$
|
As mentioned, joining on a very large table requires a time-consuming
sort if that's the table with the foreign key. Why would you join such
a large table? Certainly not to look at it -- who has the time to look
at a million rows?
You'd typically either look at a subset of it, or you'd use the join to
select rows. Let's start with using the join to select rows...
The SQL syntax for this would be:
select p.*
from people p, jobs j
where j.job_name='President'
and p.job_id=j.job_id;
Of course, as obvious from the name, you don't use SQL in NoSQL, so
you'd do the equivalent like this:
[nosql_slitt@mydesk 200704]$ cat people.table | sorttable job_id | jointable -j job_id jobs.table | \ column -r job_id | getrow 'job_name=="President"' | justify NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_name --------- ------- ------ --------- 100005 Bush George President 100006 Clinton Bill President 100007 Reagan Ronald President [nosql_slitt@mydesk 200704]$
|
As you can see, that pipeline does exactly the same thing as the SQL
command, in pretty much the same way. And it's expensive. You need to
sort the entire people table (let's say it had a million rows) and use
an inefficient getrow comparison on the entire people table. You can do
much better with a short script:
#!/bin/bash job_id=`cat jobs.table | getrow "job_name==\"President\"" | column -N job_id` filtertable grep $job_id < people.table | getrow "job_id==\"$job_id\"" | justify
|
Imagine a million row people table with a short jobs table, which
is typical of one to many relations. The first statment finds the job_id of the
job whose name is "President". That's an expensive getrow, but
it's on a small table. Armed with the job_id, the
statement pipelines several operations:
- Uses the highly efficient grep command to
find any row that has that number, whether it applies to the job_id or not.
This filters out the vast majority of rows.
- Use the inefficient but accurate getrow command
to select only those rows where the job_id
is the sought number.
Let's call the script find_presidents.sh,
run it, and see what happens:
[nosql_slitt@mydesk 200704]$ ./find_presidents.sh NoSQL: getrow is deprecated, use 'awktable' instead NoSQL: getrow is deprecated, use 'awktable' instead person_id lname fname job_id --------- ------- ------ ------ 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 [nosql_slitt@mydesk 200704]$
|
There are a plethora of ways to speed up NoSQL. The NoSQL web page (see
URLs section of the magazine) has many speedup methods.
To review, when using SQL to query a one to many pair of joined tables
on a column that's not the primary key of the one side, you need
to perform a selection on a non-sorted (and presumably non-indexed)
column for each row of the many
side. When doing the same thing in NoSQL by using a jointable
operation, you also have to pre-sort the many side.
But unlike pure SQL, NoSQL gives you the option to perform operations
in stages, so you can find the one
side's primary key with a relatively cheap getrow on this
small table, then perform a lightning quick grep command to
pre-screen most of the unneeded rows on the many side, and
finally perform getrow
on the now tiny subset of the many
side.
jointable
is best reserved for tables having less than 50,000 reasonably sized
rows.
Add, Change and Delete
By Steve Litt
NoSQL has been built from the ground up for efficiency reading data.
The NoSQL documentation is clear about that fact. If you want 50 people
simultaneously writing to the same database on a regular basis, NoSQL
is not a good choice. Even 10 people could get irritating.
- - - BUT - - -
A lot of databases are used by one person at a time. That's perfect for
NoSQL.
Many databases are written by few and read by many. The web shopping
cart catalog (not the purchase facility -- the catalog) is a perfect example, but there are many others.
Many databases are heavily written, but mostly adds, with few changes
or deletes. If instantaneous access to added columns is not critical,
each data entry person can enter data to his/her own table, and then at
the end of the day a merge program can be run to add them to the main
table, and update all indices.
I know, I know, that last suggestion sounds sooooo 1985. Data is needed
instantaneously. In an airline reservation system, the customer's
booking of the flight must instantaneously delete a seat, or
overbooking will occur.
I guess that's just the point. Don't use NoSQL on an airline
reservation system. But there are MANY other situations where batch
updates are still OK, even in 2007 (and probably still in 2027). A
little kludging is worth it, because the person knowing Unix already
knows 3/4 of NoSQL, and the person liking Unix will like NoSQL.
Oh, and by the way, if you absolutely must have realtime updates with
NoSQL, you can do what the NoSQL website says on their Bigtables page:
put all adds, deletes and updates in a small journal file, and have all
reads first access the (presumably large) main table, and then
sequentially search the small journal table to see any recent
modifications. Then, once an hour or once a day or whatever, merge the
journaling table with the main table and reindex all indicies. This
would involve a few extra lines of code on each shellscript comprising
your app, but it's probably a day's work for a moderately sized
application.
Adding Rows
Here's an easy way to add a row:
#!/bin/bash
lname=$1 fname=$2 job_id=$3
person_id=`tail -n1 people.table | cut -f1` let person_id=$person_id+1 echo "$person_id $lname $fname $job_id" >> people.table
|
In the preceding, note that the row to be added is enclosed in
doublequotes, and its columns are separated by tab characters.
Obviously, the preceding command would be almost instant, no matter
what the size of the table. Assuming the table is sorted by primary key
(NoSQL tables should always
be sorted by primary key), and assuming the primary key is
non-meaningful (and
primary keys should always be non-meaningful), this is simply
a lightning quick autoincrement facility that happens just before the
append.
Of course, if the table has indices, those indices must be updated, and
that can take some time on larger tables.
So maybe you shouldn't have indices.
WHAT? HERESY! ABOMINATION! Without indices, data operations will take
forever!
If this were MS SQLserver they might. But this is NoSQL, and you can
use the old prescreen with grep trick to search a large table amazingly
fast. As fast as a binary search on an index? No, but often fast
enough. And anyway, if you need indices, you can use the journaling
table technique described earlier.
By the way, there are two NoSQL provided ways to add a row, addrow and updtable.
Fast Deletions
Say I want to delete the row with a known primary key. Remember, the
primary key is the leftmost key, so watch this mydel.sh:
#!/bin/bash mytemp=`mktemp` mv people.table $mytemp grep -v "^$1 " > people.table rm -f $mytemp
|
Notice the tab after the $1. That's to make sure you don't accidentally
delete a record whose longer key starts with the key you want to delete.
If you need to delete a bunch of rows for which you know the primary
keys, do this:
#!/bin/bash mytemp=`mktemp` mv people.table $mytemp cat $mytemp | grep -v -f remove_list > people.table rm -f $mytemp
|
In the preceding, remove_list must contain one key per line, preceded
by the carat (^) sign standing for start of line, and ended with a tab
character so it doesn't match a longer key starting with the desired
one.
updtable:
Change, and Add and Delete
Let's face it. NoSQL tables are text files meant, for the most part, to
be processed sequentially, not randomly. Yes, its index files have byte
offsets that can be used for random access, but for the most part NoSQL
tables are best accessed randomly.
For adds that's certainly no problem -- the >>
redirection operator was made just for that. For deletes, well, grep -v
does a fast job in all but the biggest tables. But when dealing with
changes, sequential file operations aren't all that efficient. So NoSQL
gives you the updtable
program, which takes an identically structured table full of adds,
changes and deletes, and applies them to the original table. That way,
the sequentia, because the original table is scanned only once.
For example, let's take this people.table:
[nosql_slitt@mydesk 200704]$ justify < people.table person_id lname fname job_id --------- ---------- --------- ------ 100001 Strozzi Carlo 1 100002 Torvalds Linus 1 100003 Stallman Richard 1 100004 Litt Steve 2 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 100008 Cheney Dick 4 100009 Gore Al 4 100010 Litt Brett 5 100011 Montgomery Alexander 6 100012 Mondale Walter 4 100013 Reagan Ronald 3 [nosql_slitt@mydesk 200704]$
|
Imagine you want to perform the following changes:
- Add Ben Franklin as a visionary with person_id 100020
- Add George Washington as a president with person_id 100019
- Add Abraham Lincoln as a president with person_id 100018
- Change Steve Litt from a Troubleshooter to a Visionary
(job_id 2 to 1)
- Delete Alexander Montgomery, who has a person_id of 100011
So here's what you do. You create an empty table with the same
structure as people.table:
head -n1 people.table > xaction.table
Then you populate xaction.table, with Vim, or edittable, or
some automated way. Here's what the populated table looks like when
justified:
[nosql_slitt@mydesk 200704]$ justify < xaction.table person_id lname fname job_id --------- ---------- ------- ------ 100020 Franklin Ben 1 100019 Washington George 3 100018 Lincoln Abraham 3 100004 Litt Steve 1 100011 ..DEL.. [nosql_slitt@mydesk 200704]$
|
In the preceding, any row with a column (any column except the primary
key), populated by the string ..DEL.. is
deleted from the output. Now run people.table through updtable as
modified by xaction.table as follows:
[nosql_slitt@mydesk 200704]$ cat people.table | updtable xaction.table person_id lname fname job_id 100001 Strozzi Carlo 1 100002 Torvalds Linus 1 100003 Stallman Richard 1 100004 Litt Steve 1 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 100008 Cheney Dick 4 100009 Gore Al 4 100010 Litt Brett 5 100012 Mondale Walter 4 100013 Reagan Ronald 3 100018 Lincoln Abraham 3 100019 Washington George 3 100020 Franklin Ben 1 [nosql_slitt@mydesk 200704]$
|
Rows with ..DEL..
in any
non-primary-key column are deleted from the main table. Any rows
existing in the edit table but not in the main table are added to the
main table. Any rows existing in the main table but not in the edit
table are left intact. Any rows that exist in both tables (they have
the same primary key) are modified in the main table.
The updtable
command is
written so generically that neither the main table nor the updates
table need be sorted. I would guess that requires an algorithm where
every row of the main table most be compared to every row of the edit
table. If both are large, I would think this would be time consuming,
even though those comparisions are done in memory (hopefully).
By Steve Litt Steve Litt is the author of many books. Steve
can be reached at his
email
address.
An awk Primer
By Steve Litt
You can do NoSQL without awk, but awk gives you many more
opportunities. Awk is a programming
language specifically design for parsing files. Each rule evaluates
every line, and each action affiliated with a rule is taken on every
line for which the rule is true. In other words, there's an implied
"for each
line"
loop around the main routine, but the programmer doesn't need to
program it. Awk automatically parses each line into fields -- the
default field separator is whitespace. Because NoSQL uses tabs to
separate fields, awk's default works perfectly. Let's start with a
Hello World awk program, called hello.awk:
#!/usr/local/bin/mawk -We {print "Hello World"}
|
Remember to chmod hello.awk to be executable.
In the preceding, the {print "Hello World"} is an action with no rule, meaning that any line matches and therefore it's performed on every line.
Now run it:
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk Hello World Hello World Hello World Hello World Hello World Hello World Hello World Hello World Hello World Hello World Hello World Hello World Hello World Hello World [nosql_slitt@mydesk 200704]$
|
As you can see, it printed "Hello World" several times -- in fact one
for every line in people.table. This is a demonstration of the "implied
loop" surrounding every awk program.
Now let's demonstrate a real rule -- we'll print only on the first line:
#!/usr/local/bin/mawk -We NR==1{print $0}
|
The rule is NR==1. NR is awk's built in line number variable. When we pipe people.table into it, only the first line (the header) is shown:
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk person_id lname fname job_id [nosql_slitt@mydesk 200704]$
|
Now let's make it a little more useful by showing the number of fields
on each line. Remember, fields are separated by whitespace:
#!/usr/local/bin/mawk -We {print NF}
|
Here's the output:
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk 4 4 4 4 4 4 4 4 4 4 4 4 4 4 [nosql_slitt@mydesk 200704]$
|
Every line has four fields. Let's look at a different file with
different numbers of fields on each line:
[nosql_slitt@mydesk 200704]$ cat /etc/pythonrc.py | ./hello.awk 1 7 2 2 7 3 2 1 1 14 9 [nosql_slitt@mydesk 200704]$
|
Now let's actually use each field. Remember that $0 stands for the
whole line, while $1 stands for the first field on the line, $2 the
second field, $3 the third, etc. The total number of fields on the line
is represented by the system variable NF. A simple but instructive
exercise would be, on each line, to print the number of fields and then
to place triple colons between the fields. Watch this:
#!/usr/local/bin/mawk -We { i = 1 outrecord = NF " " $1 while(++i <= NF){ outrecord = outrecord ":::" $i } print outrecord }
|
Here's the result:
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk 4 person_id:::lname:::fname:::job_id 4 100001:::Strozzi:::Carlo:::1 4 100002:::Torvalds:::Linus:::1 4 100003:::Stallman:::Richard:::1 4 100004:::Litt:::Steve:::2 4 100005:::Bush:::George:::3 4 100006:::Clinton:::Bill:::3 4 100007:::Reagan:::Ronald:::3 4 100008:::Cheney:::Dick:::4 4 100009:::Gore:::Al:::4 4 100010:::Litt:::Brett:::5 4 100011:::Montgomery:::Alexander:::6 4 100012:::Mondale:::Walter:::4 4 100013:::Reagan:::Ronald:::3 [nosql_slitt@mydesk 200704]$
|
OK, how about hard coding it so we don't print anyone with a first name
beginning with R. This should eliminate Richard Stallman, and both
Ronald Reagans. We'll install a regular expression that prevents
further processing if triggered. Here it is:
#!/usr/local/bin/mawk -We $3 ~ /^R/{next} { i = 1 outrecord = NF " " $1 while(++i <= NF){ outrecord = outrecord ":::" $i } print outrecord }
|
In the preceding code, we added a line saying that if the third field
begins with R, then go to the next line (and skip the rest of the
implied loop for this one).
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk 4 person_id:::lname:::fname:::job_id 4 100001:::Strozzi:::Carlo:::1 4 100002:::Torvalds:::Linus:::1 4 100004:::Litt:::Steve:::2 4 100005:::Bush:::George:::3 4 100006:::Clinton:::Bill:::3 4 100008:::Cheney:::Dick:::4 4 100009:::Gore:::Al:::4 4 100010:::Litt:::Brett:::5 4 100011:::Montgomery:::Alexander:::6 4 100012:::Mondale:::Walter:::4 [nosql_slitt@mydesk 200704]$
|
As you can see in the preceding, all references to Richard and Ronald
are gone.
Instead of skipping people with the first name beginning with R, let's
prepend an honorary doctorate to the first name of anyone whose job is
"Visionary", in
other words, their job code is 1:
#!/usr/local/bin/mawk -We $4 ~ /^1\s*$/{ $3 = "Dr. " $3 } { i = 1 outrecord = NF " " $1 while(++i <= NF){ outrecord = outrecord ":::" $i } print outrecord }
|
Notice in the preceding, the action for the rule about job_id being 1
happens first, and then the no-rule action happens. This is a big part
of the power of awk. If you order your actions right, you can do a lot.
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk 4 person_id:::lname:::fname:::job_id 4 100001:::Strozzi:::Dr. Carlo:::1 4 100002:::Torvalds:::Dr. Linus:::1 4 100003:::Stallman:::Dr. Richard:::1 4 100004:::Litt:::Steve:::2 4 100005:::Bush:::George:::3 4 100006:::Clinton:::Bill:::3 4 100007:::Reagan:::Ronald:::3 4 100008:::Cheney:::Dick:::4 4 100009:::Gore:::Al:::4 4 100010:::Litt:::Brett:::5 4 100011:::Montgomery:::Alexander:::6 4 100012:::Mondale:::Walter:::4 4 100013:::Reagan:::Ronald:::3 [nosql_slitt@mydesk 200704]$
|
Sometimes you need code before the begining of the implied loop, and
sometimes you need code after the end of the implied loop. Next we
consider
totalling both the lines and fields read. We'll initialize the totals
before the loop, and write the totals after the loop. A special rule
called BEGIN is true before the loop starts, and another called END is
true after the loop finishes. This exercise will also introduce
arithmetic:
#!/usr/local/bin/mawk -We BEGIN{ fields=0 lines=0 } { lines++ fields += NF i = 1 outrecord = NF " " $1 while(++i <= NF){ outrecord = outrecord ":::" $i } print outrecord } END{ print "Lines=" lines ", Fields=" fields "." }
|
The preceding's pretty obvious. We initialized totals before the loop,
incremented them during the loop, and displayed them after the loop.
The result follows -- note the totals at the bottom:
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk 4 person_id:::lname:::fname:::job_id 4 100001:::Strozzi:::Carlo:::1 4 100002:::Torvalds:::Linus:::1 4 100003:::Stallman:::Richard:::1 4 100004:::Litt:::Steve:::2 4 100005:::Bush:::George:::3 4 100006:::Clinton:::Bill:::3 4 100007:::Reagan:::Ronald:::3 4 100008:::Cheney:::Dick:::4 4 100009:::Gore:::Al:::4 4 100010:::Litt:::Brett:::5 4 100011:::Montgomery:::Alexander:::6 4 100012:::Mondale:::Walter:::4 4 100013:::Reagan:::Ronald:::3 Lines=14, Fields=56. [nosql_slitt@mydesk 200704]$
|
Now that we've done some totalling, let's do some break logic. We'll
break every time a line's last name ($2) is alphabetically less than
the last name on the line before. At that point we'll print totals from
the last group (of ascending last names) and zero the group totals.
After the comparison, we'll set lastlname to
the current last name ($2) so next time we can compare current and
previous last names.
This program introduces not only break logic, but functions
(subroutines). We define a function called print_totals(),
whose purpose you can probably guess. It's called both from within the if statement at
the head of the main program, and from the END section.
Like all other break logic, when you test at the top, you must finalize
the final group after termination of the loop.
#!/usr/local/bin/mawk -We BEGIN{ fields=0 lines=0 breaklines=0 breakfields=0 prevlname="" # Ctrl+A: Ensure no break on first line } function print_totals(){ print "Group's Lines=" breaklines ", Group's Fields=" breakfields "." print "" } { if($2 < prevlname){ print_totals() breaklines = 0 breakfields = 0 } lines++ fields += NF prevlname=$2 breaklines++ breakfields += NF i = 1 outrecord = NF " " $1 while(++i <= NF){ outrecord = outrecord ":::" $i } print outrecord } END{ print_totals() print "Total Lines=" lines ", Total Fields=" fields "." }
|
The preceding code prints out pretty much what you'd expect:
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk 4 person_id:::lname:::fname:::job_id 4 100001:::Strozzi:::Carlo:::1 4 100002:::Torvalds:::Linus:::1 Group's Lines=3, Group's Fields=12.
4 100003:::Stallman:::Richard:::1 Group's Lines=1, Group's Fields=4.
4 100004:::Litt:::Steve:::2 Group's Lines=1, Group's Fields=4.
4 100005:::Bush:::George:::3 4 100006:::Clinton:::Bill:::3 4 100007:::Reagan:::Ronald:::3 Group's Lines=3, Group's Fields=12.
4 100008:::Cheney:::Dick:::4 4 100009:::Gore:::Al:::4 4 100010:::Litt:::Brett:::5 4 100011:::Montgomery:::Alexander:::6 Group's Lines=4, Group's Fields=16.
4 100012:::Mondale:::Walter:::4 4 100013:::Reagan:::Ronald:::3 Group's Lines=2, Group's Fields=8.
Total Lines=14, Total Fields=56. [nosql_slitt@mydesk 200704]$
|
What if you need to deal with two files. For instance, what if you want
to write a version of updtable
that works with a merge, for enhanced performance. You must read one of
the tables as a file rather than from stdin.
The following is a trivial example of reading from an explicitly named
file, in this case junk.jnk:
#!/usr/local/bin/mawk -We BEGIN{ line = "" while((getline line < "junk.jnk") > 0){ print line } } { } END{ }
|
The preceding code creates the following output, which is an exact copy
of the junk.jnk file:
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk one a b c d two a b c d three a b c d four a b c d [nosql_slitt@mydesk 200704]$
|
Now let's use an awk array in order to facilitate a stack. We create
functions push
and pop,
and also function stacklook.
The stacklook
function primarily looks at what would
be popped if one popped. stacklook(arr,0)
does that. To get the stack element that would be the next pop after
this one is stacklook(arr,-1).
If you want to look at the stack first in first out, use positive
numbers, with stacklook(arr,1)
showing the earliest pushed element.
The other interesting thing about this program is that it does nothing
to each line -- all code is in the BEGIN section.
Therefore, instead of piping in a file, you can simply echo a word into
the pipeline. Yet another interesting facet of this program is the introduction of local variables.
You declare a variable local to the subroutine by placing it in the
subroutine's paren delimited argument list, after all the real
arguments. Specifically, variable x in function pop() is declared to be local, meaning that any x in the main program or other subroutines is not effected by changes in this local x.
#!/usr/local/bin/mawk -We function push(arr, val){ arr[++arr[0]] = val } function pop(arr,x){ if(arr[0] < 1) return NULL x = arr[arr[0]] delete arr[arr[0]--] return x } function stacklook(arr, num){ if(arr[0] < 1) return NULL # stack spent if(num <= 0){ num = arr[0] + num if(num < 1) return NULL return arr[num] } else { if(num > arr[0]) return NULL return arr[num] } } BEGIN{ mystack[0] = 0 push(mystack, "one") push(mystack, "two") push(mystack, "three") push(mystack, "four") push(mystack, "five") print "----------STACKLOOK-------------" for(i=5; i > -6; i--){ print "stacklook(mystack, " i ")=" stacklook(mystack, i) } print "-------------POP----------------" while(str = pop(mystack)){ print str } } END{ }
|
The preceding code produces the following output:
[slitt@mydesk 200704]$ echo hello | ./hello.awk ----------STACKLOOK------------- stacklook(mystack, 5)=five stacklook(mystack, 4)=four stacklook(mystack, 3)=three stacklook(mystack, 2)=two stacklook(mystack, 1)=one stacklook(mystack, 0)=five stacklook(mystack, -1)=four stacklook(mystack, -2)=three stacklook(mystack, -3)=two stacklook(mystack, -4)=one stacklook(mystack, -5)= -------------POP---------------- five four three two one [slitt@mydesk 200704]$
|
Stacks are vital in any language, because they are a iterative method
of recursion. You can use them to parse outlines, or parse markup
language that might be nested. There are probably a million ways to
implement a stack in awk, but I used an array whose element 0 contains
the number of elements, with the first element being element 1.
The preceding exercise also introduced arrays, which are VERY
important. Please be aware that you can actually sort arrays with the asort command,
although that will not be demonstrated in this article.
WARNING
Awk makes your life "easier" by automatically considering any file on
its command line to be an input file. It iterates completely through
the first file, then the second, etc. Usually that's a timesaving
feature, but sometimes it's a pain. If you want stdin to be one of the
files, you need to put it on the command line as a dash solitary (-).
WARNING
First an ugly workaround will be demonstrated, and then a "nicer"
workaround. Let's start ugly, just to demonstrate the principles...
|
#!/usr/local/bin/mawk -We {print $0}
|
The preceding very simple awk script that simply prints each line of
input, does this with 2 files on the command line:
[slitt@mydesk 200704]$ cat /etc/fstab | ./hello.awk people.table edit.table person_id lname fname job_id 100001 Strozzi Carlo 1 100002 Torvalds Linus 1 100003 Stallman Richard 1 100004 Litt Steve 2 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 100008 Cheney Dick 4 100009 Gore Al 4 100010 Litt Brett 5 100011 Montgomery Alexander 6 100012 Mondale Walter 4 100013 Reagan Ronald 3 person_id lname fname job_id 100000 Bush GeorgeHW 4 100004 Litt Steve 1 100011 ..DEL.. 100018 Lincoln Abraham 3 100019 Washington George 3 100020 Franklin Ben 1 [slitt@mydesk 200704]$
|
In the preceding command, hello.awk ignored stdin, processed
people.table in its entirety, and then went on to process all of
edit.table. Sometimes that's what you want, and sometimes it's not.
Sometimes you want something more like this:
#!/usr/local/bin/mawk -We BEGIN{ line = "" while((getline line < edittable) > 0){ print line } }
{print $0}
|
In the preceding, you anticipate receipt of the edittable
variable from the command line. The following is how you call it, and
what results:
[slitt@mydesk 200704]$ cat people.table | awk -v edittable=edit.table -f ./hello.awk person_id lname fname job_id 100000 Bush GeorgeHW 4 100004 Litt Steve 1 100011 ..DEL.. 100018 Lincoln Abraham 3 100019 Washington George 3 100020 Franklin Ben 1 person_id lname fname job_id 100001 Strozzi Carlo 1 100002 Torvalds Linus 1 100003 Stallman Richard 1 100004 Litt Steve 2 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 100008 Cheney Dick 4 100009 Gore Al 4 100010 Litt Brett 5 100011 Montgomery Alexander 6 100012 Mondale Walter 4 100013 Reagan Ronald 3 [slitt@mydesk 200704]$
|
In the preceding, you actually invoked awk yourself so you could name
the variable, and because you invoked awk yourself, you needed to name
a source file with the -f
argument. The result was that the info on the command line was printed
first, and stdin followed. You could have reversed that by printing edittable in
the END
section. Or, you could have printed records from the edittable file
interspersed with records from stdin. You know, like you'd do in a
merge.
I don't know about you, but I find that really ugly. Here's
a better way:
#!/usr/local/bin/mawk -We BEGIN{ line = "" editfile = ARGV[2] ARGC = 1 while((getline line < editfile) > 0){ print line } }
{print $0}
|
The preceding is still ugly but it's not nearly as inconvenient. Your
first step is to capture the value of the second command line argument
as variable editfile,
and then remove that as a command line argument by setting ARGC to 1. The
command to invoke this is cleaner:
[slitt@mydesk 200704]$ cat people.table | ./hello.awk - edit.table person_id lname fname job_id 100000 Bush GeorgeHW 4 100004 Litt Steve 1 100011 ..DEL.. 100018 Lincoln Abraham 3 100019 Washington George 3 100020 Franklin Ben 1 person_id lname fname job_id 100001 Strozzi Carlo 1 100002 Torvalds Linus 1 100003 Stallman Richard 1 100004 Litt Steve 2 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 100008 Cheney Dick 4 100009 Gore Al 4 100010 Litt Brett 5 100011 Montgomery Alexander 6 100012 Mondale Walter 4 100013 Reagan Ronald 3 [slitt@mydesk 200704]$
|
There are probably better ways to do this, but for now this is good
enough.
Now we're ready to create a merge type version of updtable that
makes several assumptions:
- Both tables are sorted
- Both tables are properly formatted NoSQL tables
- Both tables have identical structures
- The primary key has only one field
- All primary key values are the same length
These are the data and edit tables:
[slitt@mydesk 200704]$ justify < people.table person_id lname fname job_id --------- ---------- --------- ------ 100001 Strozzi Carlo 1 100002 Torvalds Linus 1 100003 Stallman Richard 1 100004 Litt Steve 2 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 100008 Cheney Dick 4 100009 Gore Al 4 100010 Litt Brett 5 100011 Montgomery Alexander 6 100012 Mondale Walter 4 100013 Reagan Ronald 3 [slitt@mydesk 200704]$
|
|
[slitt@mydesk 200704]$ justify < edit.table person_id lname fname job_id --------- ---------- --------- ------ 100000 Bush George HW 4 100002 Torvalds Dr. Linus 1 100004 Litt Steve 1 100011 ..DEL.. 100018 Lincoln Abraham 3 100019 Washington George 3 100020 Franklin Ben 1 [slitt@mydesk 200704]$
|
This is the source code:
#!/usr/local/bin/mawk -We # ********************************************************************* # updmerge: inserts/updates/deletes table rows based on the contents # of an edit table. This is a fast, one pass process that assumes # the following: # 1) Both files are sorted by primary key # 2) All primary key values are the same length # 3) Single segment primary keys # 4) Both tables have identical structures # # Copyright (C) 2007 by Steve Litt # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; version 2 dated June, 1991. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. # # ********************************************************************* # $Id: updmerge,v 0.01 pre-alpha, 2007/03/30 09:02:23 steve Exp $
BEGIN{ ### HANDLE ARGUMENTS, DEFEAT DEFAULT USE OF ARGS AS INPUT FILES ### editfilename = ARGV[2] ARGC = 1
### FILE SEPARATOR SHOULD BE TAB ### FS="\t"
### SET SOME OTHER VARS ### record_changed_or_deleted_flag = 0 high_value_string="\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF"
### BLOW OFF EDIT FILE'S HEADER, READ ITS FIRST DATA RECORD ### ### TO ENABLE THE MERGE ALGORTHM ### numchars = getline editline < editfilename # Blow off edit file's header if(numchars == 0){ print "Empty edit file (" editfilename ", aborting." > stderr exit 1 }
editline = "" numchars = getline editline < editfilename # Get first edit line if(numchars == 0){ print "No data in edit file (" editfilename ", aborting." > stderr exit 1 }
### SPLIT editline INTO FIELDS FOR COMPARISON WITH MAIN FILE RECORD ### split(editline, editarray, "\t") }
### MAIN PROGRAM ### NOT USED FOR ANYTHING IMPORTANT RIGHT NOW { record_changed_or_deleted_flag = 0 }
### RULES NR == 1{ # Transfer main header intact, always print $0 next } $1 < editarray[1]{ # If main record smaller than edit, print main print $0 next }
### If edit record is smaller or equal than main, the edit record must ### be an add if it's bigger, or a delete or change if it's equal. ### iterate through all lower or equal edit records, not quitting ### til you have an edit record bigger than the main record. $1 >= editarray[1]{ while($1 >= editarray[1]){ if($1 > editarray[1]){ print editline }else { record_changed_or_deleted_flag = 1
### LOOK FOR DEL AND XFER DATA TO OUTPUT for(i=1; i <= NF; i++){ outarray[i] = $i } for(i=2; i <= NF; i++){ if(i in editarray){ if(editarray[i] ~ /^\.\.DEL\.\./){ delflag = 1 break } outarray[i] = editarray[i] } } if(!delflag) { ### ASSEMBLE OUTPUT LINE output = outarray[1] for(i=2; i <= NF; i++){ output = output "\t" outarray[i] } ### PRINT OUTPUT LINE print output } }
### GET NEXT EDIT RECORD ### numchars = getline editline < editfilename if(numchars == 0){ editline = high_value_string editarray[1]=high_value_string } else { split(editline, editarray) } }
}
### GUARANTEE PRINT FOR FIRST RECORD OF MAIN FILE ### ### BUT ONLY AFTER ANY LOWER EDIT RECORDS HAVE BEEN PRINTED ### ### AND ONLY IF THE FIRST RECORD HASN'T BEEN CHANGED OR DELETED ### NR == 2{ if(!record_changed_or_deleted_flag){print $0} }
END{ # INSERT ANY EDIT RECORDS ABOVE HIGHEST MAIN RECORD while(editarray[1] != high_value_string){ ### DON'T PRINT DELETES ### delflag=0 for(i=2; i <= NF; i++){ if(i in editarray){ if(editarray[i] ~ /^\.\.DEL\.\./){ delflag = 1 break } } }
if(!delflag) print editline
### GET NEXT EDIT TABLE RECORD ### numchars = getline editline < editfilename if(numchars == 0){ editline = high_value_string editarray[1]=high_value_string } else { split(editline, editarray) } } }
|
|
This
is your plain vanilla, DP101 file merge program. It works the
way
all merges work. Before the main loop begins, it blows off the edit
table's header, loads the edit table's first data row, and calculates
the edit table's primary key field.
The basic altorithm is to compare the current data and edit records on
every data record increment. If the data record is smaller, it's
written. If the edit record is smaller or equal, it's added (if
smaller) or changed or deleted (if equal), and then another edit record
is retrieved. This edit record retrieval loop continues until the edit
record is larger than the data record. When the edit record hits EOF,
high value is placed in the edit record so that any remaining data
records will be lower and will be written.
To accommodate cases when the final edit record(s) is(are) bigger than
the final data record, the END section adds any final edit records.
The awk language makes this algorithm easier to write, with much less
nesting than in other languages. As mentioned many time, the main data
record loop is implied -- the programmer needn't write it. Much of the
logic is implemented as "rules" that only happen when certain
conditions happen, and many of those rules call the next
operator to prevent execution of anything farther down. While this is a
violation of structural programming, these "weed out" rules result in
much less nesting and a much more readable program. |
The preceding code, run on the pre-sorted people.table and edit.table
above it, produces the following code:
[slitt@mydesk 200704]$ cat people.table | ./updmerge.awk - edit.table | justify person_id lname fname job_id --------- ---------- --------- ------ 100000 Bush George HW 4 100001 Strozzi Carlo 1 100002 Torvalds Dr. Linus 1 100003 Stallman Richard 1 100004 Litt Steve 1 100005 Bush George 3 100006 Clinton Bill 3 100007 Reagan Ronald 3 100008 Cheney Dick 4 100009 Gore Al 4 100010 Litt Brett 5 100012 Mondale Walter 4 100013 Reagan Ronald 3 100018 Lincoln Abraham 3 100019 Washington George 3 100020 Franklin Ben 1 [slitt@mydesk 200704]$
|
|
|
As you can see, George HW Bush was inserted at id
100000,
Linus Torvald's first name was changed to "Dr. Linus", Steve
Litt's job id was changed to 1, Alexander Montgomery was deleted, and
Lincoln, Washington and Franklin were added at the end. |
One impediment to writing large, complex awk programs is the lack of
local variables. Well, not exactly a lack, but more like a hassle.
The only way to make a local variable in awk is to put it in the
parentheses of a function after the function's arguments. There's no
difference between a function's arguments and its local variables
except how they're used:
function fixthis(arg1,arg2,local1,local2)
The following is a program to demonstrate what's local and what's
global within an awk program:
#!/usr/local/bin/mawk -We BEGIN{ bingo="begin" print "In begin, bingo=" bingo "." } NR==1{ bingo=$1 print "top norule, bingo=" bingo "." whatever() print "after whatever, bingo=" bingo "." whatever_with_locals() print "after whatever_with_locals, bingo=" bingo "." changearg(bingo) print "after changearg, bingo=" bingo "." bingo = changearg(bingo) print "after using changearg return, bingo=" bingo "." changearg_with_locals(bingo) print "after changearg_with_locals, bingo=" bingo "." print "bottom norule, bingo=" bingo "." } END{ bingo="end" print "In end, bingo=" bingo "." }
function whatever(){ print "Top whatever, bingo=" bingo "." bingo="changed by whatever" print "Bottom whatever, bingo=" bingo "." }
function whatever_with_locals(bingo){ print "Top whatever_with_locals, bingo=" bingo "." bingo="changed by whatever_with_locals" print "Bottom whatever_with_locals, bingo=" bingo "." }
function changearg(arg){ print "Top changearg, bingo=" bingo "." arg="changed by changearg" print "Bottom changarg, bingo=" bingo "." return arg }
function changearg_with_locals(arg,bingo){ print "Top changearg_with_locals, bingo=" bingo "." arg="changed by changearg_with_locals" print "Bottom changarg, bingo=" bingo "." return arg }
|
The preceding code produces the following output.
[slitt@mydesk 200704]$ cat people.table | ./test.awk In begin, bingo=begin. top norule, bingo=person_id. Top whatever, bingo=person_id. Bottom whatever, bingo=changed by whatever. after whatever, bingo=changed by whatever. Top whatever_with_locals, bingo=. Bottom whatever_with_locals, bingo=changed by whatever_with_locals. after whatever_with_locals, bingo=changed by whatever. Top changearg, bingo=changed by whatever. Bottom changarg, bingo=changed by whatever. after changearg, bingo=changed by whatever. Top changearg, bingo=changed by whatever. Bottom changarg, bingo=changed by whatever. after using changearg return, bingo=changed by changearg. Top changearg_with_locals, bingo=. Bottom changarg, bingo=. after changearg_with_locals, bingo=changed by changearg. bottom norule, bingo=changed by changearg. In end, bingo=end. [slitt@mydesk 200704]$
|
|
As
you can see to the right, every time variable bingo
is changed by name, it's changed throughout the program, except when
it's declared a local (by being included between the function's
parentheses) in function changearg_with_locals.
|
As you start to perform more complex operations with awk, requiring
larger programs, take special care to use local variables in functions,
thereby avoiding namespace collisions.
RAD on Linux
By Steve Litt
I need your help creating a RAD toolkit that uses shellscripts to glue
together a few RAD tools. What I envision is something enabling a
semi-decent programmer, familiar with the tookit, to create a five
table app in a single day.
Initially I see it working with a NoSQL data back end, because that's
the easiest. But it would be simple to adapt it to MySQL or Postgres.
Initially I see it using a Curses user interface, because that can be
used on the widest variety of systems, and it's simple. However, it
would be easy to adapt it to a TK environment, and even a browser
environment, always assuming one adds a state preservation module to
the toolkit.
Apps written in the toolkit I envision have a special user interface --
the default application user interface from Clarion 2.1. I've seen
hundreds of people pound on Clarion created programs, and can tell you
they're intuitive and fast.
I've written about it since the last century. Clarion was the best RAD (Rapid Application Development)
toolset I've ever used, and I'd like something with similar power for
Linux. I've created some rudimentary RAD tools in PHP, Perl/Tk and
Curses, but have never been able to put them together into a complete
RAD toolkit.
It was during the Curses attempt that I first seriously considered
using shellscripts to glue together generic form, field and picklist
executables into an application -- RAD by shellscript.
Unfortunately, I couldn't figure out interexecutable communication, and
went on to other things.
A couple weeks ago I mysteriously became part of the NoSQL mailing
list. Several years ago my friend Noel Henson had explained the RAD
aspects of his sequential file relational database, /rdb, and how there
was a free implementation called NoSQL. So when I fortuitously became
part of the NoSQL mailing list, I investigated.
NoSQL was inspirational -- a database made of a few executables whose
glue language was shellscripts. Here was a proof of concept of my
shellscript glue idea. I started thinking again, and came up with two
ways to enable interprocess communication in my new RAD. I now believe
the whole thing's doable, although the generic form executable will be
challenging. I could use some help developing this thing.
If you're considering helping, keep in mind that although my first
version will use NoSQL as its data back end, it would be easy
to adapt it to MySQL or Postgres.
Our Project Needs a Few Good Men and Women
We live in an era of spectacular web development environments -- Zope
and Rails to name a couple. But sometimes, when confronted with a small
task and little time, don't you wish for the days when you could
quickly code an app that hits a local database, and have it done in an
afternoon?
How would you like a toolkit enabling you to write a few shellscripts,
deploying a full featured data application? When I say shellscripts, I
don't mean the 300 line variety -- for picklists I mean the 20 line
variety, and for forms I mean the 50 line variety. Using a NoSQL
database, the database can be created in Vim (or Emacs).
If this sounds good to you, email me.
Life After
Windows: Retro-Grouch, or Revolutionary
Some have greeted my enthusiasm for NoSQL with scorn. NoSQL's
sequential file database is a 1970's concept. Detractors imply that I'm
a
retro-grouch. The modern thing is a transaction based, SQL queriable
DataBase Management System (DBMS). Am I a retro-grouch?
Yes. I am.
First maybe we should define retro-grouch. Mention the phrase, and many
think of an old guy saying "In my day, Sonny, things were better!" The
phrase conjures an image of someone believing change is universally bad.
That's a bad definition. First, very few people believe all change is
bad. Second of all, such people have real problems. So let me give you
my operational definition of a retro-grouch:
Retro-grouch
A retro-grouch is a person who
hates to see good things get worse or disappear.
|
So let's examine Steve Litt as a retro-grouch. I don't hate modern
things. I don't know how I ever got along without Google, Wikipedia,
Paypal and the Internet in general. I would never want to go back to
the days of $4500.00 486's with 8Meg of RAM. Rails is cool, and you'd
have to pry Ruby out of my cold, dead hand. Don't even think of telling
me to give up VimOutliner and go back to Grandview, or give up LyX and
go back to MS Word. And naturally, I'd never want to go back to the
pre-free-software world where I spent over a thousand bucks a year on
software, and much of it was junk that ended up shelfware.
But some changes aren't so good.
- By far the best graphics program I ever used was Micrografx
Windows Draw. For the past several years it's been unsupprted, and from
what I hear it doesn't run well on XP or Vista. It doesn't run on Linux
at all, unless there's a Wine or Crossover Office tweak.
- In the golden age of outliners (late 1980's and early
1990's) you
had your choice of many spectacular outliners, including Grandview,
More, Thinktank and Ecco Pro. As far as I know, they've all ceased
development. Luckily, a new outliner called VimOutliner has risen to
fill some of the void.
- By far the best application development environment I ever
saw
was the circa-1990 Clarion 2.1 (Clarion 4 was good too, and for all I
know they're
still making good stuff). Once familiar with Clarion, you could whip
out a production-ready, five table app in a day.
Let me repeat: a retro-grouch is a person who hates to see good things
get worse or disappear. Nice, simple sequential data files that can be
accessed relationally have dissappeared. I want them back.
I'm not saying web apps aren't good, or that Rails isn't a spectacular
way to make a good, sturdy multiuser application. I'm just saying
there's room for a the elegant simplicity of sequential file relational
data stores. Like NoSQL.
Overcoming NoSQL's Shortcomings
Here are a few shortcomings of NoSQL:
- No transactions
- No row level locking
- Users who can read or update the data can access the raw
data
- Simultaneous writes are a problem
- Writing large tables can be slow
- No scalability
- No triggers and stored procedures
No transactions
This can be a problem in a heavily used database. Nobody wants only
part of a data operation to succeed. By supporting transactions, DBMS's
give use a way to guarantee that data updates are all or nothing.
You'll never remove money from one account without putting it in the
other.
Then again, I'm sure you've seen plenty of programs that fail to take
advantage of the underlying database transaction support. Transactions require a constant discipline
on the part of the developer, and often transactions are
overlooked to beat a deadline.
Sometimes transactions just aren't that important. When they're not,
NoSQL is as good as anything.
No row level locking
That's not exactly accurate. The real scoop is NoSQL has no built in row level
locking. However, it can be built into the application by placing a locked column
in every table. Now row level locking can be achieved like this:
- Make the table read-only
- Set the row's locked
column to the user number or another identifiable number
- Make the table read-write
- Read the row
- Change the row
- Write the row
- Make the table read-only
- Set the row's locked
column to 0
- Make the table read-write
As long as every program in every application accessing the table obeys
these rules, and refuses to write or read for write a locked row, you
have row level locking, although it's not enforced by the "database". I
think it would be pretty easy to make a read_for_write
script to check the column for lock, and if not locked, lock and read.
Likewise, a write.sh
script checks that the row is locked by you, and then writes the data
and unlocks the row. Such scripts would make it easy to enforce row
level locking throughout.
Users who can read or update the data can access the raw data
If you're using NoSQL to back end a web app, under normal circumstances
the http user must access the NoSQL data, and therefore must have read
and possibly write access to the raw NoSQL files. In an intranet
situation, that might be OK if you trust your users. In an Internet
facing application, that's no good at all.
If this is an issue, you can have the raw data readable and writeable
by, let's say, nosql,
have a daemon running as user nosql
that can take commands from user http.
Now someone coming in off the web can't hit the raw data. For an extra
measure of security, the daemon could check incoming commands for
length and bad data, to prevent the equivalent of SQL injection.
Simultaneous writes are a problem
A requirement for lots of simultaneous writes might preclude NoSQL.
Personally, I wouldn't put it in an airline reservation system.
On the other hand, if the writes are all adds, and if the table has a
single segment key, and if it's acceptable for new data to be
unavailable for hours or for a day, no problem. Have all adds appended
to the bottom of a journaling table, incrementing the preceding key
(and probably locking the table for an instant so two people don't grab
the same key).
Have a cron job that does the following:
- Rename the journaling table to a temporary file
- Make an empty new journaling file
- Run updtable on the main table and the temporary journaling
table
- Reindex all indicies on the main table
- Delete the temporary journaling table
Under those conditions, you can have 50 data entry people hammering new
data while 50 other people are using the main data.
If you really want, you can have read operations query both the main
table and the journaling table. Because the journaling table is
(presumably) small, searching it is fast. The main table can be
searched by key or index extremely quickly. Yes, it's a little bit of a
hassle, but big tables are a hassle in any situation.
Writing large tables can be slow
Yep. As tables start exceeding 50,000 to 100,000 rows, many write
operations can be slow. Every update requires a reading and rewrite of
the whole table. Once again you can use a journaling table in which to
write adds, changes and deletes. If that data needs to be instantly
accessible, make all your "read" shellscripts indexed or key read the
main table, and then sequentially read the journaling table. All adds,
changes and deletes go to the journaling table, which in turn updates
the main table every so often via a cron job.
No scalability
Oh fiddlesticks! Who needs scalability when the original app took only
a couple days to create. So you lose a couple days work -- so what?
Think of this app as a prototype.
And speaking of prototypes, one of the best ways to find out what's really needed is to
code a functional prototype, have everyone hammer on it, and find
out the desired improvements.
It's been my experience that this type of prototype actually saves many
multiples of its development time when the "real" app goes into
development.
It's trivial to write a Ruby program to turn one or a group of NoSQL
tables into SQL statements suitable to load your new database, even if
your new database has different column names in different orders.
No triggers and stored procedures
Oh, the scandal: No triggers and stored procedures!
So do the work in the app itself. Or, if you really want triggers and
stored procedures, make shellscripts to simulate the desired triggers
and stored procedures, and access the NoSQL data through those
shellscripts. Personally I think that's too much trouble, but it can
easily be done.
Why NoSQL is Cool
NoSQL is retro to say the least. Its sequential storage of data harkens
back to the 1970's. It's also elegantly simple and transparent. I could
have written NoSQL in 1985 if someone had explained it in English, and
assuming my operating system had grep, awk, and a highly efficient sort
like Linux's sort command.
It's trivially easy to edit NoSQL data in a text editor. Just try that
in MySQL, Postgres or (gulp) MS SQL Server. Editing data in an editor
is sometimes a handy time saver, and sometimes an essential recovery
tool. Speaking of recovery, because it's so human readable, with NoSQL
you have a better chance of recovering from corruption. You can back up
NoSQL data in its native format. Its native
format is about as compressed as a text-only format can get.
You can write a quick and dirty report or data request using grep,
sort, head, and tail. I mean less than 20 lines of shellscript. You can
even join two tables with a couple sorts and a simple awk script.
Not that anything in the preceding paragraph is necessary -- NoSQL
comes with fully coded, heavily tested utilities to do all these
things. The point of the preceding paragraph is this: If for some
reason you can't use the NoSQL utilities -- maybe they're not
installed, maybe you're trying something they can't do efficiently --
you can easily get it done with Unix utilities. If, 20 years from now,
you have NoSQL data and for some reason the NoSQL project is no longer
developing, and NoSQL is no longer installable on modern operating
systems, if you find a system with sort, head, tail, grep and awk, you
can continue working with the data.
NoSQL utilities are built to be used in shellscripts, with data passed
from utility to utility by piping. This fits excellently with the
convenience and modularity of a shellscript based Rapid Application
Development environment.
Last but not least, NoSQL is very fast and efficient, because the Unix
utilities it uses -- head, tail, grep, sort, and awk are mature,
tested, and very
efficient.
Key and index based reads operate on binary search. Reads on nonsorted,
nonindexed fields in large tables can be made lightning fast by
pre-selecting with speed-of-light grep:
grep "Rodriguez" < people.table | getrow -N "last_name==\"Rodriguez\""
Or, in its more generic form:
grep "$1" < people.table | getrow -N "last_name==\"$1\""
Will that be as fast as a "real DBMS" with the last name field indexed?
Probably not, but probably close. Will it be as fast as a "real DBMS"
with that field not indexed? I'd imagine much faster. The grep command
is fast!
NoSQL is cool. It isn't the tool for every job, but it's cool. The only
trouble is, it's not too hip to support twenty year old technology. Or
is it?...
Futurist vs. Retro-grouch
Newer is better, right?
There are people whose job description is "Futurist". I suppose the
dictionary definition of a Futurist would be something like "one whose
job is to forecast future trends and inform others how to deal with
them." But behind that denotation is a flood of connotation:
- This is how it's going to be, so here's how to use it to
your advantage.
- This is how it's going to be, so here's how to protect
yourself.
- Deal with change.
- Embrace change.
- Don't fight change.
The ultimate realism -- the very measure of maturity.
And it's a cop out!
Don't fight change? Imagine a circa 1930 European Futurist proclaiming
that a wave of national socialism would sweep Germany and later Europe.
Imagine the 1930 Futurist advising all who would listen, to prepare for the change by joining the national socialists. After
all, change is neither good nor bad, it just is, so deal with it!
Imagine what the world would look like today if everyone had followed
such advice. Luckily, retro-grouches
longing for the "antiquated" principles of Democracy and even Communism
fought as partisans and armies throughout Europe to resist the change,
so that I can write this free-speech article today, and you can read
it. World War II was a pretty close contest, and I can't help thinking
those advocates of the quaint concept of Democracy swung
the war.
To me, Futurists are spectators of the future. They're the
money-changers in the temple of trends. Futurists may predict the future,
but revolutionaries and retro-grouches shape the future.
Here's another story of a guy who you might call a retro-grouch, or you
might call a revolutionary, and either way I think you'd be right. By
August 1991 Microsoft Windows was the new thing, after conquoring the
world with the incredible Windows 3.0. On August
26 1991, Linus Torvalds advertised for help coding his as-yet unnamed
operating system.
His unnamed operating system was to be a clone of Minix, which itself was a
subset of Unix, an operating system first developed in 1969. Unix had ruled
the minicomputer world by the late 1970's, but by 1991 it was losing
market share to decentralized apps written for the IBM PC clones. In
other words, Unix was the old thing, Windows was the new thing, and
Linus Torvalds was writing a free software version of the old thing.
As we all know now, in every technical regard, the old thing was much
better than the new thing. Torvalds' operating system changed the
world. He didn't predict
history, he shaped
it.
Torvalds could also be considered a revolutionary, having rebelled
against the crippled, academic Minix. He led the masses away from the
older Minix into the promised land of Linux.
Personally, I was a revolutionary when I
created the Universal Troubleshooting
Process that ultimately changed the way the world
troubleshoots. On the
other hand, I was a retro-grouch when I originated
VimOutliner as a substitute for the abandoned outliners of the golden
age of outliners.
It bears repeating. The retro-grouch and the revolutionary shape the
future, while the futurist talks about the future. It also bears
repeating: If the old way was insufficient, bring on something new. If
the old way was better, bring back the old way. If the old way was better for some situations, keep it around for those situations.
There's a little Futurist voice chattering in all of our minds, telling
us to throw away the old. Politely thank that voice, and then choose
based on quality and utility, not on newness. If you need a powerful
and
simple database to get something done today,
consider NoSQL.
Life After Windows is a regular Linux
Productivity Magazine
column,
by Steve Litt, bringing you observations and tips subsequent to
Troubleshooters.Com's
Windows to Linux conversion.
GNU/Linux, open
source and free software
By Steve Litt
Linux is a kernel. The operating system often described as "Linux" is
that
kernel combined with software from many different sources. One of the
most
prominent, and oldest of those sources, is the GNU project.
"GNU/Linux" is probably the most accurate moniker one can
give to this
operating system. Please be aware that in all of
Troubleshooters.Com,
when I say "Linux" I really mean "GNU/Linux". I completely believe that
without
the GNU project, without the GNU Manifesto and the GNU/GPL license it
spawned,
the operating system the press calls "Linux" never would have happened.
I'm part of the press and there are times when it's easier to
say "Linux"
than explain to certain audiences that "GNU/Linux" is the same as what
the
press calls "Linux". So I abbreviate. Additionally, I abbreviate in the
same
way one might abbreviate the name of a multi-partner law firm. But make
no
mistake about it. In any article in Troubleshooting Professional
Magazine,
in the whole of Troubleshooters.Com, and even in the technical books I
write,
when I say "Linux", I mean "GNU/Linux".
There are those who think FSF is making too big a deal of this. Nothing
could be farther from the truth. The GNU General Public License,
combined
with Richard Stallman's GNU Manifesto and the resulting GNU-GPL
License,
are the only reason we can enjoy this wonderful alternative to
proprietary
operating systems, and the only reason proprietary operating systems
aren't
even more flaky than they are now.
For practical purposes, the license requirements of "free software" and
"open
source" are almost identical. Generally speaking, a license that
complies
with one complies with the other. The difference between these two is a
difference
in philosophy. The "free software" crowd believes the most important
aspect
is freedom. The "open source" crowd believes the most important aspect
is
the practical marketplace advantage that freedom produces.
I think they're both right. I wouldn't use the software without the
freedom
guaranteeing me the right to improve the software, and the guarantee
that
my improvements will not later be withheld from me. Freedom is
essential.
And so are the practical benefits. Because tens of thousands of
programmers
feel the way I do, huge amounts of free software/open source is
available,
and its quality exceeds that of most proprietary software.
In summary, I use the terms "Linux" and "GNU/Linux" interchangably,
with
the former being an abbreviation for the latter. I usually use the
terms "free
software" and "open source" interchangably, as from a licensing
perspective
they're very similar. Occasionally I'll prefer one or the other
depending
if I'm writing about freedom, or business advantage.
Steve Litt has used GNU/Linux since 1998, and written about
it since 1999. Steve can be reached at his email address.
Letters
to the Editor
All letters become the property of the publisher (Steve
Litt), and
may
be edited for clarity or brevity. We especially welcome
additions,
clarifications,
corrections or flames from vendors whose products have been reviewed in
this
magazine. We reserve the right to not publish letters we
deem in
bad
taste (bad language, obscenity, hate, lewd, violence, etc.).
Submit letters to the editor to Steve Litt's email address,
and be
sure
the subject reads "Letter to the Editor". We regret that we cannot
return
your letter, so please make a copy of it for future reference.
How to
Submit an Article
We anticipate two to five articles per issue.
We look for articles that pertain to the GNU/Linux or open source. This
can
be done as an essay, with humor, with a case study, or some other
literary
device. A Troubleshooting poem would be nice. Submissions may mention a
specific
product, but must be useful without the purchase of that product.
Content
must greatly overpower advertising. Submissions should be between 250
and
2000 words long.
Any article submitted to Linux Productivity Magazine must be
licensed
with the Open Publication License, which you can view at
http://opencontent.org/openpub/.
At your option you may elect the option to prohibit substantive
modifications.
However, in order to publish your article in Linux Productivity
Magazine,
you must decline the option to prohibit commercial use, because Linux
Productivity
Magazine is a commercial publication.
Obviously, you must be the copyright holder and must be
legally able
to
so license the article. We do not currently pay for articles.
Troubleshooters.Com reserves the right to edit any submission
for
clarity
or brevity, within the scope of the Open Publication License. If you
elect
to prohibit substantive modifications, we may elect to place editors
notes
outside of your material, or reject the submission, or send it back for
modification.
Any published article will include a two sentence description of the
author,
a hypertext link to his or her email, and a phone number if desired.
Upon
request, we will include a hypertext link, at the end of the magazine
issue,
to the author's website, providing that website meets the
Troubleshooters.Com
criteria for links
and that the
author's
website first links to Troubleshooters.Com. Authors: please understand
we
can't place hyperlinks inside articles. If we did, only the first
article
would be read, and we can't place every article first.
Submissions should be emailed to Steve Litt's email address,
with
subject
line Article Submission. The first paragraph of your message should
read
as follows (unless other arrangements are previously made in writing):
Copyright (c) 2003 by
<your name>. This
material
may be distributed only subject to the terms and conditions set forth
in
the Open Publication License, version Draft v1.0, 8 June 1999
(Available
at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for
readability
at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest
version
is presently available at
http://www.opencontent.org/openpub/).
Open Publication License
Option A [ is | is not]
elected,
so this document [may | may not] be modified. Option B is not elected,
so
this material may be published for commercial purposes.
After that paragraph, write the title, text of the article,
and a
two
sentence description of the author.
Why not Draft v1.0, 8 June 1999 OR LATER
The Open Publication License recommends using the word "or later" to
describe
the version of the license. That is unacceptable for Troubleshooting
Professional
Magazine because we do not know the provisions of that newer version,
so
it makes no sense to commit to it. We all hope later versions will be
better,
but there's always a chance that leadership will change. We cannot take
the
chance that the disclaimer of warranty will be dropped in a later
version.
Trademarks
All trademarks are the property of their respective owners.
Troubleshooters.Com(R)
is a registered trademark of Steve Litt.
URLs
Mentioned in this Issue
_