Troubleshooters.Com Presents

Linux Productivity Magazine

April 2007

NoSQL: The Unix Database (With awk)

Copyright (C) 2007 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Linux Productivity Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.

See also Troubleshooting Techniques of the Successful Technologist
and Rapid Learning: Secret Weapon of the Successful Technologist
by Steve Litt

[ Troubleshooters.Com | Back Issues |Troubleshooting Professional Magazine ]



NoSQL is an extension of the UNIX environment, making available the full power of UNIX during application development and usage.  -- From the Philosophy of NoSQL

CONTENTS

Editor's Desk

By Steve Litt
I need your help.

I need your help building a Rapid Application Development (RAD) toolkit for Linux.

With the possible exception of Rails, there are few if any truly fast Rapid Application Development (RAD) toolkits for Linux. Very few ways to construct a full-fledged, five table app in a single day.

There were very few for DOS or Windows, for that matter. I'm in the process of specifying the tools and inter-tool communication for a Linux RAD toolkit, and need ideas and a little coding help. For this first iteration in creating the toolkit, I'm using the simplest database I know of, NoSQL. The toolkit can easily be adapted to MySQL or Postgres later, but right now let's get it working with NoSQL. For more details on the upcoming RAD toolkit, see the
RAD on Linux article later in this magazine, and please email me if you're interested.

Data Base Management Systems (DBMS) are wonderful, aren't they? Transactions, row locking, user passwords, client-server, triggers and stored procedures. But sometimes, when doing a simple app that just needs a local relational database, it's just too much hassle? DBMS's, even the good ones like MySQL and Postgres, seem like black boxes. Wouldn't it be nice to be able to store relational data in files? Wouldn't it be nice to have the option to read them with grep and write them with cat? To back them up as ordinary files, rather than having a special database backup procedure?

Welcome to the world of NoSQL. Like the big boys, it has tables full of rows and columns. The tables can be joined relationally. But unlike the big boys, if necessary you can edit them with vi. And you don't need DBI::DBD, PHP, Rails, GDK,  TK or QT to work with them. All you need are a few Unix commands, filters really, to read and write the data you need.

OK, you're not going to scale a NoSQL database up to 100 users hitting it hard with both reads and writes. But even if that's what you'll eventually need, NoSQL provides an excellent prototyping environment. Changing table structures is trivially easy. Adding tons of test (or real) data consists of a cat and a sort. You access your data with Unix commands, not with SQL.

Here's the number one advantage of NoSQL. If you know Unix (or Linux or BSD), you know NoSQL. You'll learn a few commands packaged with NoSQL, and you're off and running.

Carlo Strozzi created NoSQL using Debian Linux, so that's your easiest install. Probably any Linux is easy. BSD and other Unices are harder -- the more they differ from Linux, the harder they will be. But any Unix or Unix workalike can accommodate NoSQL without too much trouble. On Windows: use Access :-).

If you need to patch together a quick and dirty database app, in a few minutes, and appearance isn't a priority, consider NoSQL. You can always migrate the data to a database via comma delimited files or a simple script to turn a NoSQL table into an SQL dump.

So kick back, relax, and use the database that every Unix person knows before they've even heard of it. And remember, if you use GNU/Linux, this is your magazine.
Steve Litt is the author of Troubleshooting Techniques of the Successful Technologist.   Steve can be reached at his email address.

Is Linux Productivity Magazine Back for Good?

By Steve Litt
Linux Productivity Magazine ceased publication in the summer of 2004, after hurricanes Charley and Frances wiped out our roof and temporarily stopped all Troubleshooters.Com business activities. This is the first Linux Productivity Magazine since then. From now on, Linux Productivity Magazine will be an occasional publication. For that reason the magazines will no longer have volume and issue numbers. From now on, new Linux Productivity Magazines only when I have suitable content and the time to write it.

So check back from time to time. It will be worth it.
Steve Litt is the author of Manager's Guide to Technical Troubleshooting.   Steve can be reached at his email address.

Installing NoSQL

By Steve Litt
Like everything else in Linux, dependencies are the challenge. According to its INSTALL documentation, NoSQL requires the following:
Here are the tests for these utilities:
[slitt@mydesk 200704]$ perl -v

This is perl, v5.8.8 built for i386-linux
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2006, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

[slitt@mydesk 200704]$
[slitt@mydesk 200704]$ lockfile -v
lockfile v3.22 2001/09/10
Copyright (c) 1990-2001, Stephen R. van den Berg <srb@cuci.nl>
Copyright (c) 1997-2001, Philip A. Guenther <guenther@sendmail.com>

Submit questions/answers to the procmail-related mailinglist by sending to:
<procmail-users@procmail.org>

And of course, subscription and information requests for this list to:
<procmail-users-request@procmail.org>

Your system mailbox's lockfile: /var/spool/mail/slitt.lock
[slitt@mydesk 200704]$

[slitt@mydesk 200704]$ mawk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

compiled limits:
max NF 32767
sprintf buffer 1020
[slitt@mydesk 200704]$
[slitt@mydesk 200704]$ mktemp
/home/slitt/tmp/tmp.kVWCt24221
[slitt@mydesk 200704]$
[slitt@mydesk 200704]$ grep -v
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
[slitt@mydesk 200704]$


By far the most likely stumbling block will be mawk. Almost any Unix will have Perl 5. Almost any Linux will have the GNU text utilities. Most Linuces come with procmail packages, so lockfile shouldn't be a problem. The mktemp utility is fairly ubiquitous.

Your distro probably didn't come with mawk. Instead it probably came with GNU's gawk. Although you could theoretically adapt NoSQL to use gawk instead of mawk, installing mawk is easy enough...

Installing mawk

If mawk is already on your system, or if there's an easy way to install it as a package, you can skip this section. However, I can tell you that on Mandriva, trying to install mawk as an RPM package turns into a multilevel dependency nightmare.

So here's what you do. Go to http://freshmeat.net/projects/mawk/ and download the latest tarball. Untar it into a directory. If you're not using Linux or BSD, look carefully at the INSTALL file. Otherwise, start by running ./configure. If it works, great. If you're like me, it will fail something like this:
[slitt@mydesk mawk-1.3.3]$ ./configure
creating cache ./config.cache
checking for gcc... gcc
checking whether we are using GNU C... yes
checking how to run the C preprocessor... gcc -E
checking for -lm... no
checking for log... no
configure: error: Cannot find a math library. You need to set MATHLIB in config.user
[slitt@mydesk mawk-1.3.3]$

This is a known problem, and simple to fix. Look in the config.user file and locate something that looks like this:
# configure will look for libm. Change if you know this will fail
# or want a different math library
#MATHLIB=-lspecialmath
#MATHLIB='' # if you don't need a special lib to get sin,sqrt,etc

What you need to do is set MATHLIB to whatever flag you need to tell gcc to include math libaries. This is typically -lm, so you'd make the following change:

# configure will look for libm. Change if you know this will fail
# or want a different math library
#MATHLIB=-lspecialmath
#MATHLIB='' # if you don't need a special lib to get sin,sqrt,etc
MATHLIB=-lm

Once you get the right setting for  MATHLIB, configure should work, although you'll see some scary looking warnings about incompatible implicit declarations. Then do make and make install, and that's it, mawk is installed.

Installing NoSQL

Grab the tarball from http://www.scriptaworks.com/shared/nosql/. Untar it into a previously empty directory, then read the INSTALL file, which will give you some tips and illuminate a few landmines. It also tells you how to install the NoSQL files into a directory other than the default /usr/local/nosql.

All of the above being said, as long as you have a working mawk, perl 5, lockfile, mktemp and GNU text tools, probably all you'll need to do is become root and run this command:
./install-sh
Then you're all set except for the executable path

Tweaking the Executable Path

NoSQL uses a lot of very common words, such as join and select, as commands. These can clash with like-named Linux commands. Therefore, when running NoSQL based applications, the NoSQL binary directory (default /usr/local/nosql/bin) must be at the front of the path. However, under normal, non-NoSQL situations, you want it at the back of the path or maybe not on the path at all, so that Linux works normally. This calls for a script to create a NoSQL friendly environment:

PATH=/usr/local/nosql/bin:$PATH
export PATH
OLD_PS1=PS1
PS1='[nosql_\u@\h \W]\$ '
export PS1
bash

The preceding script puts you in an environment in which the NoSQL binary directory is at the front of the path, and the string nosql is in the prompt, so you can instantly recognize what's going on.

WARNING

The rxvt terminal program, in its infinite wisdom, resets the $PS1 environment variable when it starts, so if you're already in a NoSQL environment and then start rxvt, the prompt will not show the "nosql_" even though the NoSQL binary directory will still be at the front of the path. All other terminal programs I tested worked right -- both the path and the prompt were passed down from parent terminal to child terminal.

Always be sure to have the NoSQL environment running before doing NoSQL work. Otherwise, wrong commands will do wrong things, creating very confusing failures.
Steve Litt is the author of Twenty Eight Tales of Troubleshooting.   Steve can be reached at his email address.

Hello World, Unix Style

By Steve Litt 
Using the Vim or Emacs editor, create the following file called people.table:
Hello world created with Vim

The preceding was created with Vim. It's shown as a screenshot rather than text so you can see the first line has every column name preceded by a \x01 character, otherwise known as a Ctrl+A. You can do this in most Unix text editors, including Emacs. The purpose of the \x01 character is so the row of column names, otherwise known as the heading, sorts to the top. Each column name is preceded by a \x01 so that if the table is used to produce a table with columns rearranged, it will still sort to the top. Implied in all of this is that characters \x00 and \x01 are illegal within NoSQL column values.

Every column's value is separated from the next column's value and the previous column's value by a tab character. The same is true for the column names on the top row. The first column is not preceded by a tab character, and the last column is not followed by a tab character. All of this means tab characters are illegal within a NoSQL column value.

Because rows are separated by newlines, newlines are illegal within a column's value. So \x00, \x01, tab and newline are all illegal. If binary data is required in NoSQL data, there are several alternatives:
If you want binary data that's more recognizable as what it is, you could use the following escape characters:

Binary value Escape string
\x00 \x00
\x01 \x01
newline \x0c
tab \x0b
backslash \\

Of course, you would need to write the programs to encode and decode binary data to respect those escapes.

The left hand column of any NoSQL table is its primary key sorted in ascending order, so in this case it's person_id, and it's in ascending order. If the primary key is compound, then it's the leftmost 2 columns, or the leftmost 3 columns, or however many segments comprise the primary key. The file must be physically sorted by primary key. If it is not, you'll see later there is a very easy way to re-sort it by primary key.

Note also that the primary key is numeric.  That means the sort must be done numerically (similar to sort -n). This feature is included in NoSQL. It's usually easier to sort as a string. If you start your numeric keys with a large number (say 100000), then all numeric keys will be the same length and will therefore be sortable as strings.

Finally, note that the last column is a seemingly meaningless number. Later on we'll join that field to job_id in the jobs.table table. As a little heads-up, job 1 is Visionary, job 2 is Troubleshooter, job 3 is President, and job 4 is Vice President.

OK, so we created the file, but is it truly a NoSQL table? Run your script to create a NoSQL environment, and then use the column command to see:
[slitt@mydesk 200704]$ nosql_environ.sh
[nosql_slitt@mydesk 200704]$ cat people.table | column -r
person_id lname fname job_id
1 Strozzi Carlo 1
2 Torvalds Linus 1
3 Stallman Richard 1
4 Litt Steve 2
5 Bush George 3
6 Clinton Bill 3
7 Reagan Ronald 3
8 Cheney Dick 4
9 Gore Al 4
[nosql_slitt@mydesk 200704]$

There's the proof. The -r argument means "print all columns except the ones named", and since none are named, it prints them all. Just for fun, let's correctly justify everything:
[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify
person_id lname fname job_id
--------- -------- ------- ------
1 Strozzi Carlo 1
2 Torvalds Linus 1
3 Stallman Richard 1
4 Litt Steve 2
5 Bush George 3
6 Clinton Bill 3
7 Reagan Ronald 3
8 Cheney Dick 4
9 Gore Al 4
[nosql_slitt@mydesk 200704]$

That's the proof of concept, assuming you're willing to make the table in a text editor. If you do so, you're responsible for correctly inserting tab characters and \x01 characters. After all is done, you need to search for tab characters to make sure you didn't accidentally use spaces where you should have had tabs, or tabs where you should have had spaces.

If you don't like creating a table with a text editor, read on...
Steve Litt is the author of the Universal Troubleshooting Process Courseware. Steve can be reached at his email address.

Hello World, NoSQL Style

By Steve Litt
Create the following people.xref template file:
# This is a comment in the people.xref template file
# To create and empty people.table from people.xref,
# just run
# cat people.xref | maketable > people.table
# BE CAREFUL not to overwrite a good people.table!!!
person_id Primary key
lname Person's last name
fname Person's first name
job_id ID of person's job (relate to jobs table)
The comments are preceded by hash characters, like any other Unix file. The first set of non-space characters on non-comment lines is the column name, while anything after the first space is a description that is not used further.

Watch this:
[nosql_slitt@mydesk 200704]$ cat people.xref | maketable > people.table
[nosql_slitt@mydesk 200704]$ cat people.table | column -r
person_id lname fname job_id
[nosql_slitt@mydesk 200704]$

The preceding proves that you just created a valid empty table with filename people.table. Now it's time to fill the table with data. Run the following command:
edittable people.table
You'll be editing a file that looks like this:
person_id|lname|fname|job_id
---------|-----|-----|------

Note that the editor you will be using will be Vim, unless you've configured edittable to use a different editor by setting and exporting the EDITOR environment variable. Continue editing by filling in values separated by the pipe symbol:

person_id|lname|fname|job_id
---------|-----|-----|------
1|Strozzi|Carlo|1
2|Torvalds|Linus|1
3|Stallman|Richard|1
4|Litt|Steve|2
5|Bush|George|3
6|Clinton|Bill|3
7|Reagan|Ronald|3
8|Cheney|Dick|4
9|Gore|Al|4

Using edittable, you needn't type in the \x01 characters in each column name, and instead of typing tab characters you use a much more visibly discernable pipe character (|).

Now save and exit using the proper commands for whatever editor edittable is configured to use (Vim is the default). As edittable exits, it automatically checks what you typed in, to make sure you included the right number of columns in every row, and made no other mistakes. Assuming everything is OK, it converts your input into properly formatted tab delimited data, as can be shown from the following command:

[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify
person_id lname fname job_id
--------- ---------- --------- ------
1 Strozzi Carlo 1
2 Torvalds Linus 1
3 Stallman Richard 1
4 Litt Steve 2
5 Bush George 3
6 Clinton Bill 3
7 Reagan Ronald 3
8 Cheney Dick 4
9 Gore Al 4
[nosql_slitt@mydesk 200704]$

If the check for legality on exiting edittable had found problems, you'd have been presented with the following choices:

istable: long record at line 9
istable: table not ok

What shall I do ?

1) Re-edit the table
2) Quit without saving the changes
3) Commit the changes anyway

=>

Almost always, the thing you want to do is re-edit. You probably put in one too many or one too few pipe symbols on one of the lines, or you left out a \x01 character in a column header, or maybe you used a tab instead of a pipe symbol, or something else that's easy.

Anyway, you've just proved the concept without directly editing in an editor.
Steve Litt is the author of Twenty Eight Tales of Troubleshooting.   Steve can be reached at his email address.

The justify Command

By Steve Litt
The justify command makes the output of a NoSQL command "look right".  I've added a couple people, including one with the long name "Alexander Montgomery". Here's what the table looks like using column -r:
[nosql_slitt@mydesk 200704]$ cat people.table | column -r
person_id lname fname job_id
1 Strozzi Carlo 1
2 Torvalds Linus 1
3 Stallman Richard 1
4 Litt Steve 2
5 Bush George 3
6 Clinton Bill 3
7 Reagan Ronald 3
8 Cheney Dick 4
9 Gore Al 4
10 Litt Brett 5
11 Montgomery Alexander 6
[nosql_slitt@mydesk 200704]$

Pretty ugly, isn't it. There's nothing like the combination of varying string lengths plus the use of tabs to muddle which column is which. Enter the justify command:
[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify
person_id lname fname job_id
--------- ---------- --------- ------
1 Strozzi Carlo 1
2 Torvalds Linus 1
3 Stallman Richard 1
4 Litt Steve 2
5 Bush George 3
6 Clinton Bill 3
7 Reagan Ronald 3
8 Cheney Dick 4
9 Gore Al 4
10 Litt Brett 5
11 Montgomery Alexander 6
[nosql_slitt@mydesk 200704]$

Ahh, that's better. Tabs are converted to spaces, and the longest string in each column determines the space given to the column. Note that justify itself can be used to display the table, so in the preceding command the column -r was unnecessary. I could have just done this:
cat people.table | justify
Care must be taken, because the output of justify is not a valid NoSQL table, and therefore cannot be used as input to any type of database command. Observe:
[nosql_slitt@mydesk 200704]$ cat people.table | istable
[nosql_slitt@mydesk 200704]$ echo $?
0
[nosql_slitt@mydesk 200704]$ cat people.table | justify | istable
[nosql_slitt@mydesk 200704]$ echo $?
1
[nosql_slitt@mydesk 200704]$

The istable command returns 0 if fed a valid NoSQL table, nonzero otherwise.
Steve Litt is the author of Troubleshooting: Just the Facts.   Steve can be reached at his email address.

Sorting and Indexing

By Steve Litt
Here's how you sort a table alphabetically by rows:
cat people.table | sorttable
The output of that command is people.table sorted by the leftmost column, then the second leftmost, then the third... If for some reason you need it reverse sorted, add -r:
cat people.table | sorttable -r
Because tables should already be sorted by the left column, which in NoSQL tables is the primary key, or the left two or three columns, which in NoSQL tables are the two or three keys comprising a multisegment primary key, whole row sorting is seldom necessary unless you're sorting the output of a command that left the data unsorted. However, often you need to sort by a column that is not the primary key. The following is an example where I sort by last name and then first:

[nosql_slitt@mydesk 200704]$ cat people.table | sorttable lname fname | justify
person_id lname fname job_id
--------- ---------- --------- ------
100005 Bush George 3
100008 Cheney Dick 4
100006 Clinton Bill 3
100009 Gore Al 4
100010 Litt Brett 5
100004 Litt Steve 2
100012 Mondale Walter 4
100011 Montgomery Alexander 6
100007 Reagan Ronald 3
100013 Reagan Ronald 3
100003 Stallman Richard 1
100001 Strozzi Carlo 1
100002 Torvalds Linus 1
[nosql_slitt@mydesk 200704]$

The sorttable -h help facility says you can also sort different columns in different ways, like this:
cat people.table | sorttable job_id:rn lname fname
In the preceding the rn after the colon means reverse (r) and numeric (n). I haven't had a lot of luck with the stuff after the colon -- your mileage may vary.

Being a very lightweight database management system (DBMS), NoSQL requires that the key for a table must be the first (leftmost) column, and the table must be phyically sorted on the key.

Sorting can be alphabetic or numeric. For keys especially, in my opinion alphabetic sorts are much better. With alphabetic keys, you can append a record with cat >>. If the key were numeric, the file would need re-sorting after every record addition. Also, if the key is sorted numerically, the keysearch command fails if some key values have less digits than others. All this can be avoided by making the first key 100001. This would enable arithmetic incrementing of the key, while the next 899,999 records would have keys of the same digit length, so they could be appended and sorted right. With most applications, by the time the key grows another digit, the app would need a rewrite, at which time the key length could be increased.

Even if you need to make it longer, you can do it on the fly with the compute command:

[nosql_slitt@mydesk 200704]$ cat people.table | compute person_id=person_id+1000000
NoSQL: compute is deprecated, use 'awktable' instead
person_id lname fname job_id
1100001 Strozzi Carlo 1
1100002 Torvalds Linus 1
1100003 Stallman Richard 1
1100004 Litt Steve 2
1100005 Bushm George 3
1100006 Clinton Bill 3
1100007 Reagan Ronald 3
1100008 Cheney Dick 4
1100009 Gore Al 4
1100010 Litt Brett 5
1100011 Montgomery Alexander 6
1100012 Mondale Walter 4
1100013 Reagan Ronald 3
[nosql_slitt@mydesk 200704]$

Here in 2007, pay no attention to the message of deprecation -- compute works just fine, and awktable is not yet well documented. You can move people.table to a temp file, then run the preceding command on the temp file, redirecting it to people.table. The resulting people.table will have a six digit person_id field. New rows can be appended via cat >>, and the keysearch command works properly.

Compound Keys

NoSQL tables accommodate compound keys. For instance, if you were to key  the people table on last name and then first name, you'd just make sure lname was the first column, fname the second, and then sort the table. In that case you could search on the first of the key segments like this:
keysearch -p Litt people_byname.table 
A disadvantage of meaningful primary keys is that you can never count on newly inserted rows having a higher key value than the others, so the table must be sorted each time.

Indexing

As mentioned, in NoSQL every table should be sorted by its primary key. However, fast searches must be done on other columns. Enter indexing. By creating an index on a given column, you enable keysearch to work with that column.

To index people.table on job_id, you'd do this:

[nosql_slitt@mydesk 200704]$ indextable people.table job_id
gen ndx file people.table._x.job_id (people.table) on: job_id
Reading people.table ...
Finishing sorting ...
[nosql_slitt@mydesk 200704]$ keysearch -I people.table._x.job_id 3 people.table
person_id lname fname job_id
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
[nosql_slitt@mydesk 200704]$

In the preceding session, the second command proved that the index was built correctly, using the -I argument to specify which index file to use in the search.

Whenever you add rows to a table, you must reindex all its indices. Here's the command you use to reindex the index on job_id:
[nosql_slitt@mydesk 200704]$ indextable --update people.table._x.job_id 
gen ndx file people.table._x.job_id (people.table) on: job_id
Reading people.table ...
Finishing sorting ...
[nosql_slitt@mydesk 200704]$

As long as indices are not rebuilt, the keysearch command will likely deliver the wrong row if called on that index. Therefore, it's sometimes OK to add several rows before reindexing, but reindexing must be guaranteed to happen before further calls to keysearch on any indices.

Does all this sorting and indexing seem slow? Read on...

Speed and Efficiency

By Steve Litt
NOTE
All timing tests in this article were done on an Athlon XP2600+ with 1.5GB of RAM. As of 3/2007, this is a faster than average machine, but $600.00 will buy you a more powerful desktop, and $1300.00 will buy you a more powerful notebook, so it's definitely off the shelf hardware.

You need to sort and reindex often. Sounds kinda slow.

#!/usr/bin/ruby -w

fnames = [
"Albert",
"Betty",
"Charles",
"Darlene",
"Edward",
"Felice",
"George",
"Helen",
"Ichibod",
"Janet",
"Karl",
"Linda",
"Max",
"Nancy",
"Oliver",
"Patricia",
"Quincy",
"Rita",
"Steve",
"Tiffany",
"Uvon",
"Victoria",
"William",
"Xaviera",
"Yale",
"Zelda"
]

lnames = [
"Anderson",
"Becker",
"Chavez",
"Daniels",
"Elsworth",
"Fox",
"Garcia",
"Holden",
"Iverson",
"Jackson",
"Kantor",
"Levin",
"Martinez",
"Nagy",
"Olmstead",
"Peterson",
"Quigley",
"Ramirez",
"Stevenson",
"Torkelson",
"Ullman",
"Victor",
"Williamson",
"Xavier",
"Yarborough",
"Ziegler"
]

puts "\x01people_id\t\x01lname\t\x01fname\t\x01job_id"
people_id = 100000
multiplier = ARGV[0].to_i || 10
(1..multiplier).each do |n|
lnames.each do |lname|
fnames.each do |fname|
people_id += 1
job_id = 1000 + (people_id % 7)
puts "#{people_id.to_s}\t#{lname}\t#{fname}\t#{job_id.to_s}"

end
end
end
      It usually isn't, at least on moderately sized tables. To research this, let's create a program to output a people table. The program, written in Ruby, is to the left.

We can pipe the program into the sorttable command with a script like this:
It usually isn't, at least on moderately sized tables. To research this, let's create two scripts: the first runs the (slow) Ruby program at the left and redirects to ./junk.jnk, and then times the execution of the second one, which sorts the file created by the Ruby. Here's ./testtime.sh:
#!/bin/bash
./test_table.rb $1 > junk.jnk
time ./testtime2.sh
and here's ./testtime2.sh:
#!/bin/bash
sorttable fname < junk.jnk | wc

The results are listed below the Ruby program, but basically, with 67,600 rows, it sorts in the time you could hit two or three keystrokes. With 676,000 rows, it takes about 5 seconds, which is tolerable in some situations. With 6,760,000 rows (that's 6 million rows folks), it takes an intolerable 65 seconds.

In other words, when the number of rows is expected to go significantly above 100,000, try to select before you sort.

[nosql_slitt@mydesk 200704]$ ./testtime.sh 10
6761 27044 179953
0.04user 0.00system 0:00.05elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1275minor)pagefaults 0swaps
[nosql_slitt@mydesk 200704]$ ./testtime.sh 100
67601 270404 1799233
0.42user 0.02system 0:00.45elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2027minor)pagefaults 0swaps
[nosql_slitt@mydesk 200704]$ ./testtime.sh 1000
676001 2704004 17992033
5.00user 0.15system 0:05.15elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9544minor)pagefaults 0swaps
[nosql_slitt@mydesk 200704]$ ./testtime.sh 10000
6760001 27040004 185780034
62.27user 2.83system 1:05.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+71507minor)pagefaults 0swaps
[nosql_slitt@mydesk 200704]

Now let's see how much time it takes to reindex, by changing testtime.sh and testtime2.sh to the following:

testtime.sh testtime2.sh
#!/bin/bash
./test_table.rb $1 > junk.jnk
indextable junk.jnk fname
echo "aaaaaa aronson alphonso 1008" >> junk.jnk
echo "bbbbbb benitez betty 1009" >> junk.jnk
wc junk.jnk
time ./testtime2.sh
     
#!/bin/bash
indextable --update junk.jnk._x.fname

[nosql_slitt@mydesk 200704]$ ./testtime.sh 10
gen ndx file junk.jnk._x.fname (junk.jnk) on: fname
Reading junk.jnk ...
Finishing sorting ...
6763 27052 180008 junk.jnk
gen ndx file junk.jnk._x.fname (junk.jnk) on: fname
Reading junk.jnk ...
Finishing sorting ...
0.05user 0.02system 0:00.07elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2513minor)pagefaults 0swaps
[nosql_slitt@mydesk 200704]$ ./testtime.sh 100
gen ndx file junk.jnk._x.fname (junk.jnk) on: fname
Reading junk.jnk ...
Finishing sorting ...
67603 270412 1799288 junk.jnk
gen ndx file junk.jnk._x.fname (junk.jnk) on: fname
Reading junk.jnk ...
Finishing sorting ...
0.52user 0.01system 0:00.54elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+3211minor)pagefaults 0swaps
[nosql_slitt@mydesk 200704]$ ./testtime.sh 1000
gen ndx file junk.jnk._x.fname (junk.jnk) on: fname
Reading junk.jnk ...
Finishing sorting ...
676003 2704012 17992088 junk.jnk
gen ndx file junk.jnk._x.fname (junk.jnk) on: fname
Reading junk.jnk ...
Finishing sorting ...
5.82user 0.40system 0:06.21elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9031minor)pagefaults 0swaps
[nosql_slitt@mydesk 200704]$ ./testtime.sh 10000
gen ndx file junk.jnk._x.fname (junk.jnk) on: fname
Reading junk.jnk ...
Finishing sorting ...
6760003 27040012 185780089 junk.jnk
gen ndx file junk.jnk._x.fname (junk.jnk) on: fname
Reading junk.jnk ...
Finishing sorting ...
59.06user 4.30system 1:06.90elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9769minor)pagefaults 0swaps
[nosql_slitt@mydesk 200704]$
      To the left you see the reindexing results. 67,603 rows take a half a second. 676,003 rows take a slightly irritating but still tolerable 6.2 seconds. 6,760,003 rows take an unacceptable 66 seconds.

There are ways around this, as explained on the NoSQL website. One way is to make all adds and changes go to a small journaling table, and use that to update the main table every night.

Then there's the other viewpoint: When you have over a million rows in any table, you've probably outgrown NoSQL if you do many writes, and you've certainly outgrown an Athlon XP2600+ with 1.5GB of RAM.

The bottom line is this: If you have a reasonably powerful computer and none of your tables exceed 100,000 rows, NoSQL will keep up with your keystrokes. 100,000 rows exceeds most personal, home office, and mom and pop applications, especially if every year you move every transaction older than 2 years in an archive table that never changes.
Steve Litt is the author of Troubleshooting: Just the Facts.   Steve can be reached at his email address.

Viewing Data From a Single Table

By Steve Litt
Now let's look more carefully at viewing data from a single table. When I say "viewing", I use the term loosely, because NoSQL does most of its work through piping and redirection. Thus, you could create a different table like this:
[nosql_slitt@mydesk 200704]$ cat people.table | column fname lname job_id | sorttable > people_byname.table
[nosql_slitt@mydesk 200704]$ cat people_byname.table
Al Gore 4
Bill Clinton 3
Carlo Strozzi 1
Dick Cheney 4
George Bush 3
Linus Torvalds 1
Richard Stallman 1
Ronald Reagan 3
Steve Litt 2
[nosql_slitt@mydesk 200704]$

Look what you just did with one command. You removed the person_id column, reversed the order of lname and fname, used fname and lname as a compound key, and physically sorted the table by that key. In a system that uses piping and redirection, "viewing" can be a very powerful thing.

Before moving on, one more thing about the preceding table organization -- one could ask how NoSQL deduced that it was a compound key comprised of fname and lname, rather than a single key comprised of fname, or a triple compound key comprised of  fname, lname, and job_id. It doesn't deduce it at this point -- the table is sorted by the combination of all three, so it could be used as a single, double compound or triple compound key. Later, when you do things like create indexes or schemas, such keys will be defined more exactly. But it really doesn't matter -- you just select on whatever single or compound key that is created by the leftmost set of columns, and properly sorted.

When viewing data, you can either view all the data, or you restrict the viewed data. The three kinds of restriction are these:
Restrict
by
Commands
Column column
Row keysearch, getrow, filtertable, awktable
Join jointable

Restricting by Column

The column Command

This is how you view only certain columns, or reorder columns. Remembering people.table, if you don't care about seeing the job_id column, you can do this:
cat people.table | column person_id lname fname
A simpler way to accomplish that would have been to use the -r reverse option, whereby everything not named as a column is printed:
cat people.table | column -r job_id
Remember that NoSQL defines a primary key as the leftmost column(s), so using the column command to re-key the table is as simple as moving the intended key column(s) into the leftmost position.

Perhaps you want to switch the first and last names while removing the job_id:
cat people.table | column person_id fname lname

Restricting by Row

These commands select the rows to be output.

The keysearch Command

Certainly the fastest of these commands is the keysearch command because it binary-searches either the primary key or an index to come up with the row(s) in just a few random reads. In the people.table table, here's how to find the row with person_id 100007:

[nosql_slitt@mydesk 200704]$ keysearch 100007 people.table
person_id lname fname job_id
100007 Reagan Ronald 3
[nosql_slitt@mydesk 200704]$

But what if you want to select by job_id, which is not sorted. Luckily, there's an index on job_id called people.table._x.job_id. Here's how you do it:

[nosql_slitt@mydesk 200704]$ keysearch -I people.table._x.job_id 3 people.table
person_id lname fname job_id
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
[nosql_slitt@mydesk 200704]$

The getrow Command

The keysearch command is lightning fast as long as your primary key is sorted and your indices are up to date. If not, or if the column on which you want to select isn't indexed, you can use the getrow command, which searches sequentially. If you have only a few hundred rows, getrow is fast...

[slitt@mydesk 200704]$ cat people.table | getrow 'lname=="Litt"'
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
100004 Litt Steve 2
100010 Litt Brett 5
[slitt@mydesk 200704]$ cat people.table | getrow 'job_id=="3"'
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
[slitt@mydesk 200704]$

Notice the single and doublequotes. These work, at least until you try to use an environment variable for the search term or you try to put the command in a shellscript. BY FAR the preferred syntax is to use doublequotes on the outside and escaped double quotes for the inner quotes, as follows:

[nosql_slitt@mydesk 200704]$ cat people.table | getrow "lname==\"Litt\""
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
100004 Litt Steve 2
100010 Litt Brett 5
[nosql_slitt@mydesk 200704]$ cat people.table | getrow "job_id==\"3\""
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
[nosql_slitt@mydesk 200704]$

Note that with numeric fields you can forego the quotes entirely, but with text columns removing the quotes causes wrong answers:

[slitt@mydesk 200704]$ cat people.table | getrow job_id==3
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
[slitt@mydesk 200704]$ cat people.table | getrow lname==Litt
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
[slitt@mydesk 200704]$ cat people.table | getrow lname=="Litt"
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
[slitt@mydesk 200704]$

Once again, I suggest ignoring the deprecation notice. As of March 2007, getrow works just fine, and awktable is not well documented, and depends on a knowledge of awk.

The filtertable Command

I love this command. It adapts any Unix filter (a command taking input from stdin and putting output to stdout) to work on a NoSQL table. Specifically, it outputs the table's header, and then runs the filter on the rest. For instance, here's a way you can use filtertable to speed selection of data:
[slitt@mydesk 200704]$ filtertable grep Litt < people.table | getrow 'lname=="Litt"'
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
100004 Litt Steve 2
100010 Litt Brett 5
[slitt@mydesk 200704]$

Of course, this won't speed selection on my tiny table, but if your table has a million rows, the ultra-efficient grep runs circles around getrow. Now of course, a guy named Litt Stevenson might be in the table, and he shouldn't be selected because Litt is his first name, not his last name. That's why the grep-selected rows must be fed to getrow, but at that point getrow might be working with a hundred rows instead of a million.

The Unix utility filters are the ultimate performers, but you might need to write your own filter. Consider a rather contrived situation where you need to insert a spouse for every current person. Here's one way you might do it, starting with a Ruby program to do the insertions:

#!/usr/bin/ruby -w
STDIN.each do |line|
line.chomp!
(person_id, lname, fname, job_id) = line.split
puts line
spouse_id = person_id.to_i + 200000
puts "#{spouse_id}\t#{lname}\tSpouse\t9"
end

Now use the Ruby program as a filter to add spouses:

[slitt@mydesk 200704]$ filtertable ./insertspouses.rb < people.table
person_id lname fname job_id
100001 Strozzi Carlo 1
300001 Strozzi Spouse 9
100002 Torvalds Linus 1
300002 Torvalds Spouse 9
100003 Stallman Richard 1
300003 Stallman Spouse 9
100004 Litt Steve 2
300004 Litt Spouse 9
100005 Bush George 3
300005 Bush Spouse 9
100006 Clinton Bill 3
300006 Clinton Spouse 9
100007 Reagan Ronald 3
300007 Reagan Spouse 9
100008 Cheney Dick 4
300008 Cheney Spouse 9
100009 Gore Al 4
300009 Gore Spouse 9
100010 Litt Brett 5
300010 Litt Spouse 9
100011 Montgomery Alexander 6
300011 Montgomery Spouse 9
[slitt@mydesk 200704]$

Of course, you and I know that to be a useful table, the preceding must be filtered through an additional sort so the primary key is in order, but I wanted you to see how the filter inserted the proper spouse rows, and of course with their own unique person_id keys.

The awktable Command

Currently (March 2007), this command is not documented well. That will change as time goes on, but for the time being, you might want to try other selection techniques.
Steve Litt is the author of many books. Steve can be reached at his email address.

Joining

By Steve Litt
The whole theory of relational databases is that in a given table, you list only same-typed entities and information that directly depends on the entity. For instance, if you have a table of departments, and each department has a manager, you don't put that manager's name, phone and address in the department row. Instead you put the manager's employee number in the department row, and then if necessary look up the manager's information in the employee table, based on the manager's employee number gotten from the department row. The manager's address depends directly on the manager, not on the department. You'd really hate to update address and phone every time a manager got promoted and a new one took her place.

Let's make the following jobs.table:

job_id job_name
1 Visionary
2 Troubleshooter
3 President
4 Vice President
5 Writer
9 Spouse
Be sure to remember to precede each column name in the header with \x01, and to put tabs between the first and second column, both in the header and in the data.

[slitt@mydesk 200704]$ cat people.table | sorttable job_id | jointable -j job_id jobs.table | justify
job_id person_id lname fname job_name
------ --------- -------- ------- --------------
1 100001 Strozzi Carlo Visionary
1 100002 Torvalds Linus Visionary
1 100003 Stallman Richard Visionary
2 100004 Litt Steve Troubleshooter
3 100005 Bush George President
3 100006 Clinton Bill President
3 100007 Reagan Ronald President
4 100008 Cheney Dick Vice President
4 100009 Gore Al Vice President
5 100010 Litt Brett Writer
9 100000 Cheney Lynn Spouse
[slitt@mydesk 200704]$

I had to sort the table with the foreign key. The table with the primary key was, of course, already sorted. That sort can take a long time if the table with the foreign key is large. More on this later.

Meanwhile, there's typically no need for the job_id, so the following is a more usual join:

[nosql_slitt@mydesk 200704]$ cat people.table | sorttable job_id | \
jointable -j job_id jobs.table | column -r job_id | \
sorttable | justify
person_id lname fname job_name
--------- -------- ------- --------------
100000 Cheney Lynn Spouse
100001 Strozzi Carlo Visionary
100002 Torvalds Linus Visionary
100003 Stallman Richard Visionary
100004 Litt Steve Troubleshooter
100005 Bush George President
100006 Clinton Bill President
100007 Reagan Ronald President
100008 Cheney Dick Vice President
100009 Gore Al Vice President
100010 Litt Brett Writer
[nosql_slitt@mydesk 200704]$

As mentioned, joining on a very large table requires a time-consuming sort if that's the table with the foreign key. Why would you join such a large table? Certainly not to look at it -- who has the time to look at a million rows?

You'd typically either look at a subset of it, or you'd use the join to select rows. Let's start with using the join to select rows...

The SQL syntax for this would be:
select p.* 
from people p, jobs j
where j.job_name='President'
and p.job_id=j.job_id;
Of course, as obvious from the name, you don't use SQL in NoSQL, so you'd do the equivalent like this:
[nosql_slitt@mydesk 200704]$ cat people.table | sorttable job_id | jointable -j job_id jobs.table | \
column -r job_id | getrow 'job_name=="President"' | justify
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_name
--------- ------- ------ ---------
100005 Bush George President
100006 Clinton Bill President
100007 Reagan Ronald President
[nosql_slitt@mydesk 200704]$

As you can see, that pipeline does exactly the same thing as the SQL command, in pretty much the same way. And it's expensive. You need to sort the entire people table (let's say it had a million rows) and use an inefficient getrow comparison on the entire people table. You can do much better with a short script:

#!/bin/bash
job_id=`cat jobs.table | getrow "job_name==\"President\"" | column -N job_id`
filtertable grep $job_id < people.table | getrow "job_id==\"$job_id\"" | justify

Imagine a million row people table with a short jobs table, which is typical of one to many relations. The first statment finds the job_id of the job whose name is "President". That's an expensive getrow, but it's on a small table. Armed with the job_id, the statement pipelines several operations:
  1. Uses the highly efficient grep command to find any row that has that number, whether it applies to the job_id or not. This filters out the vast majority of rows.
  2. Use the inefficient but accurate getrow command to select only those rows where the job_id is the sought number.
Let's call the script find_presidents.sh, run it, and see what happens:

[nosql_slitt@mydesk 200704]$ ./find_presidents.sh
NoSQL: getrow is deprecated, use 'awktable' instead
NoSQL: getrow is deprecated, use 'awktable' instead
person_id lname fname job_id
--------- ------- ------ ------
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
[nosql_slitt@mydesk 200704]$

There are a plethora of ways to speed up NoSQL. The NoSQL web page (see URLs section of the magazine) has many speedup methods.

To review, when using SQL to query a one to many pair of joined tables on a column that's not the primary key of the one side, you need to perform a selection on a non-sorted (and presumably non-indexed) column for each row of the many side. When doing the same thing in NoSQL by using a jointable operation, you also have to pre-sort the many side.

But unlike pure SQL, NoSQL gives you the option to perform operations in stages, so you can find the one side's primary key with a relatively cheap getrow on this small table, then perform a lightning quick grep command to pre-screen most of the unneeded rows on the many side, and finally perform getrow on the now tiny subset of the many side.

jointable is best reserved for tables having less than 50,000 reasonably sized rows.
Steve Litt is the author of Twenty Eight Tales of Troubleshooting.   Steve can be reached at his email address.

Add, Change and Delete

By Steve Litt
NoSQL has been built from the ground up for efficiency reading data. The NoSQL documentation is clear about that fact. If you want 50 people simultaneously writing to the same database on a regular basis, NoSQL is not a good choice. Even 10 people could get irritating.
- - - BUT - - -

A lot of databases are used by one person at a time. That's perfect for NoSQL.

Many databases are written by few and read by many. The web shopping cart catalog (not the purchase facility -- the catalog) is a perfect example, but there are many others.

Many databases are heavily written, but mostly adds, with few changes or deletes. If instantaneous access to added columns is not critical, each data entry person can enter data to his/her own table, and then at the end of the day a merge program can be run to add them to the main table, and update all indices.

I know, I know, that last suggestion sounds sooooo 1985. Data is needed instantaneously. In an airline reservation system, the customer's booking of the flight must instantaneously delete a seat, or overbooking will occur.

I guess that's just the point. Don't use NoSQL on an airline reservation system. But there are MANY other situations where batch updates are still OK, even in 2007 (and probably still in 2027). A little kludging is worth it, because the person knowing Unix already knows 3/4 of NoSQL, and the person liking Unix will like NoSQL.

Oh, and by the way, if you absolutely must have realtime updates with NoSQL, you can do what the NoSQL website says on their Bigtables page: put all adds, deletes and updates in a small journal file, and have all reads first access the (presumably large) main table, and then sequentially search the small journal table to see any recent modifications. Then, once an hour or once a day or whatever, merge the journaling table with the main table and reindex all indicies. This would involve a few extra lines of code on each shellscript comprising your app, but it's probably a day's work for a moderately sized application.

Adding Rows

Here's an easy way to add a row:

#!/bin/bash

lname=$1
fname=$2
job_id=$3

person_id=`tail -n1 people.table | cut -f1`
let person_id=$person_id+1
echo "$person_id $lname $fname $job_id" >> people.table

In the preceding, note that the row to be added is enclosed in doublequotes, and its columns are separated by tab characters. Obviously, the preceding command would be almost instant, no matter what the size of the table. Assuming the table is sorted by primary key (NoSQL tables should always be sorted by primary key), and assuming the primary key is non-meaningful (and primary keys should always be non-meaningful), this is simply a lightning quick autoincrement facility that happens just before the append.

Of course, if the table has indices, those indices must be updated, and that can take some time on larger tables.

So maybe you shouldn't have indices.

WHAT? HERESY! ABOMINATION! Without indices, data operations will take forever!

If this were MS SQLserver they might. But this is NoSQL, and you can use the old prescreen with grep trick to search a large table amazingly fast. As fast as a binary search on an index? No, but often fast enough. And anyway, if you need indices, you can use the journaling table technique described earlier.

By the way, there are two NoSQL provided ways to add a row, addrow and updtable.

Fast Deletions

Say I want to delete the row with a known primary key. Remember, the primary key is the leftmost key, so watch this mydel.sh:
#!/bin/bash
mytemp=`mktemp`
mv people.table $mytemp
grep -v "^$1 " > people.table
rm -f $mytemp

Notice the tab after the $1. That's to make sure you don't accidentally delete a record whose longer key starts with the key you want to delete.

If you need to delete a bunch of rows for which you know the primary keys, do this:
#!/bin/bash
mytemp=`mktemp`
mv people.table $mytemp
cat $mytemp | grep -v -f remove_list > people.table
rm -f $mytemp

In the preceding, remove_list must contain one key per line, preceded by the carat (^) sign standing for start of line, and ended with a tab character so it doesn't match a longer key starting with the desired one.

updtable: Change, and Add and Delete

Let's face it. NoSQL tables are text files meant, for the most part, to be processed sequentially, not randomly. Yes, its index files have byte offsets that can be used for random access, but for the most part NoSQL tables are best accessed randomly.

For adds that's certainly no problem -- the >> redirection operator was made just for that. For deletes, well, grep -v does a fast job in all but the biggest tables. But when dealing with changes, sequential file operations aren't all that efficient. So NoSQL gives you the updtable program, which takes an identically structured table full of adds, changes and deletes, and applies them to the original table. That way, the sequentia, because the original table is scanned only once.

For example, let's take this people.table:
[nosql_slitt@mydesk 200704]$ justify < people.table
person_id lname fname job_id
--------- ---------- --------- ------
100001 Strozzi Carlo 1
100002 Torvalds Linus 1
100003 Stallman Richard 1
100004 Litt Steve 2
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
100008 Cheney Dick 4
100009 Gore Al 4
100010 Litt Brett 5
100011 Montgomery Alexander 6
100012 Mondale Walter 4
100013 Reagan Ronald 3
[nosql_slitt@mydesk 200704]$

Imagine you want to perform the following changes:
So here's what you do. You create an empty table with the same structure as people.table:
head -n1 people.table > xaction.table
Then you populate xaction.table, with Vim, or edittable, or some automated way. Here's what the populated table looks like when justified:
[nosql_slitt@mydesk 200704]$ justify < xaction.table
person_id lname fname job_id
--------- ---------- ------- ------
100020 Franklin Ben 1
100019 Washington George 3
100018 Lincoln Abraham 3
100004 Litt Steve 1
100011 ..DEL..
[nosql_slitt@mydesk 200704]$

In the preceding, any row with a column (any column except the primary key),  populated by the string ..DEL.. is deleted from the output. Now run people.table through updtable as modified by xaction.table as follows:

[nosql_slitt@mydesk 200704]$ cat people.table | updtable xaction.table
person_id lname fname job_id
100001 Strozzi Carlo 1
100002 Torvalds Linus 1
100003 Stallman Richard 1
100004 Litt Steve 1
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
100008 Cheney Dick 4
100009 Gore Al 4
100010 Litt Brett 5
100012 Mondale Walter 4
100013 Reagan Ronald 3
100018 Lincoln Abraham 3
100019 Washington George 3
100020 Franklin Ben 1
[nosql_slitt@mydesk 200704]$

Rows with ..DEL.. in any non-primary-key column are deleted from the main table. Any rows existing in the edit table but not in the main table are added to the main table. Any rows existing in the main table but not in the edit table are left intact. Any rows that exist in both tables (they have the same primary key) are modified in the main table. 

The updtable command is written so generically that neither the main table nor the updates table need be sorted. I would guess that requires an algorithm where every row of the main table most be compared to every row of the edit table. If both are large, I would think this would be time consuming, even though those comparisions are done in memory (hopefully).
By Steve Litt Steve Litt is the author of many books. Steve can be reached at his email address.

An awk Primer

By Steve Litt
You can do NoSQL without awk, but awk gives you many more opportunities. Awk is a programming language specifically design for parsing files. Each rule evaluates every line, and each action affiliated with a rule is taken on every line for which the rule is true. In other words, there's an implied "for each line"  loop around the main routine, but the programmer doesn't need to program it. Awk automatically parses each line into fields -- the default field separator is whitespace. Because NoSQL uses tabs to separate fields, awk's default works perfectly. Let's start with a Hello World awk program, called hello.awk:
#!/usr/local/bin/mawk -We
{print "Hello World"}

Remember to chmod hello.awk to be executable.

In the preceding, the {print "Hello World"} is an action with no rule, meaning that any line matches and therefore it's performed on every line.

Now run it:

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
[nosql_slitt@mydesk 200704]$

As you can see, it printed "Hello World" several times -- in fact one for every line in people.table. This is a demonstration of the "implied loop" surrounding every awk program.

Now let's demonstrate a real rule -- we'll print only on the first line:

#!/usr/local/bin/mawk -We
NR==1{print $0}

The rule is NR==1. NR is awk's built in line number variable. When we pipe people.table into it, only the first line (the header) is shown:

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
person_id lname fname job_id
[nosql_slitt@mydesk 200704]$

Now let's make it a little more useful by showing the number of fields on each line. Remember, fields are separated by whitespace:

#!/usr/local/bin/mawk -We
{print NF}

Here's the output:

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
4
4
4
4
4
4
4
4
4
4
4
4
4
4
[nosql_slitt@mydesk 200704]$

Every line has four fields. Let's look at a different file with different numbers of fields on each line:

[nosql_slitt@mydesk 200704]$ cat /etc/pythonrc.py | ./hello.awk
1
7
2
2
7
3
2
1
1
14
9
[nosql_slitt@mydesk 200704]$

Now let's actually use each field. Remember that $0 stands for the whole line, while $1 stands for the first field on the line, $2 the second field, $3 the third, etc. The total number of fields on the line is represented by the system variable NF. A simple but instructive exercise would be, on each line, to print the number of fields and then to place triple colons between the fields. Watch this:

#!/usr/local/bin/mawk -We
{
i = 1
outrecord = NF " " $1
while(++i <= NF){
outrecord = outrecord ":::" $i
}
print outrecord
}

Here's the result:

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
4 person_id:::lname:::fname:::job_id
4 100001:::Strozzi:::Carlo:::1
4 100002:::Torvalds:::Linus:::1
4 100003:::Stallman:::Richard:::1
4 100004:::Litt:::Steve:::2
4 100005:::Bush:::George:::3
4 100006:::Clinton:::Bill:::3
4 100007:::Reagan:::Ronald:::3
4 100008:::Cheney:::Dick:::4
4 100009:::Gore:::Al:::4
4 100010:::Litt:::Brett:::5
4 100011:::Montgomery:::Alexander:::6
4 100012:::Mondale:::Walter:::4
4 100013:::Reagan:::Ronald:::3
[nosql_slitt@mydesk 200704]$


OK, how about hard coding it so we don't print anyone with a first name beginning with R. This should eliminate Richard Stallman, and both Ronald Reagans. We'll install a regular expression that prevents further processing if triggered. Here it is:

#!/usr/local/bin/mawk -We
$3 ~ /^R/{next}
{
i = 1
outrecord = NF " " $1
while(++i <= NF){
outrecord = outrecord ":::" $i
}
print outrecord
}

In the preceding code, we added a line saying that if the third field begins with R, then go to the next line (and skip the rest of the implied loop for this one).

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
4 person_id:::lname:::fname:::job_id
4 100001:::Strozzi:::Carlo:::1
4 100002:::Torvalds:::Linus:::1
4 100004:::Litt:::Steve:::2
4 100005:::Bush:::George:::3
4 100006:::Clinton:::Bill:::3
4 100008:::Cheney:::Dick:::4
4 100009:::Gore:::Al:::4
4 100010:::Litt:::Brett:::5
4 100011:::Montgomery:::Alexander:::6
4 100012:::Mondale:::Walter:::4
[nosql_slitt@mydesk 200704]$

As you can see in the preceding, all references to Richard and Ronald are gone.

Instead of skipping people with the first name beginning with R, let's prepend an honorary doctorate to the first name of anyone whose job is "Visionary", in other words, their job code is 1:

#!/usr/local/bin/mawk -We
$4 ~ /^1\s*$/{
$3 = "Dr. " $3
}
{
i = 1
outrecord = NF " " $1
while(++i <= NF){
outrecord = outrecord ":::" $i
}
print outrecord
}

Notice in the preceding, the action for the rule about job_id being 1 happens first, and then the no-rule action happens. This is a big part of the power of awk. If you order your actions right, you can do a lot.

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
4 person_id:::lname:::fname:::job_id
4 100001:::Strozzi:::Dr. Carlo:::1
4 100002:::Torvalds:::Dr. Linus:::1
4 100003:::Stallman:::Dr. Richard:::1
4 100004:::Litt:::Steve:::2
4 100005:::Bush:::George:::3
4 100006:::Clinton:::Bill:::3
4 100007:::Reagan:::Ronald:::3
4 100008:::Cheney:::Dick:::4
4 100009:::Gore:::Al:::4
4 100010:::Litt:::Brett:::5
4 100011:::Montgomery:::Alexander:::6
4 100012:::Mondale:::Walter:::4
4 100013:::Reagan:::Ronald:::3
[nosql_slitt@mydesk 200704]$

Sometimes you need code before the begining of the implied loop, and sometimes you need code after the end of the implied loop. Next we consider totalling both the lines and fields read. We'll initialize the totals before the loop, and write the totals after the loop. A special rule called BEGIN is true before the loop starts, and another called END is true after the loop finishes. This exercise will also introduce arithmetic:

#!/usr/local/bin/mawk -We
BEGIN{
fields=0
lines=0
}
{
lines++
fields += NF
i = 1
outrecord = NF " " $1
while(++i <= NF){
outrecord = outrecord ":::" $i
}
print outrecord
}
END{
print "Lines=" lines ", Fields=" fields "."
}

The preceding's pretty obvious. We initialized totals before the loop, incremented them during the loop, and displayed them after the loop. The result follows -- note the totals at the bottom:

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
4 person_id:::lname:::fname:::job_id
4 100001:::Strozzi:::Carlo:::1
4 100002:::Torvalds:::Linus:::1
4 100003:::Stallman:::Richard:::1
4 100004:::Litt:::Steve:::2
4 100005:::Bush:::George:::3
4 100006:::Clinton:::Bill:::3
4 100007:::Reagan:::Ronald:::3
4 100008:::Cheney:::Dick:::4
4 100009:::Gore:::Al:::4
4 100010:::Litt:::Brett:::5
4 100011:::Montgomery:::Alexander:::6
4 100012:::Mondale:::Walter:::4
4 100013:::Reagan:::Ronald:::3
Lines=14, Fields=56.
[nosql_slitt@mydesk 200704]$

Now that we've done some totalling, let's do some break logic. We'll break every time a line's last name ($2) is alphabetically less than the last name on the line before. At that point we'll print totals from the last group (of ascending last names) and zero the group totals. After the comparison, we'll set lastlname to the current last name ($2) so next time we can compare current and previous last names.

This program introduces not only break logic, but functions (subroutines). We define a function called print_totals(), whose purpose you can probably guess. It's called both from within the if statement at the head of the main program, and from the END section. Like all other break logic, when you test at the top, you must finalize the final group after termination of the loop.

#!/usr/local/bin/mawk -We
BEGIN{
fields=0
lines=0
breaklines=0
breakfields=0
prevlname="" # Ctrl+A: Ensure no break on first line
}
function print_totals(){
print "Group's Lines=" breaklines ", Group's Fields=" breakfields "."
print ""
}
{
if($2 < prevlname){
print_totals()
breaklines = 0
breakfields = 0
}
lines++
fields += NF
prevlname=$2
breaklines++
breakfields += NF
i = 1
outrecord = NF " " $1
while(++i <= NF){
outrecord = outrecord ":::" $i
}
print outrecord
}
END{
print_totals()
print "Total Lines=" lines ", Total Fields=" fields "."
}

The preceding code prints out pretty much what you'd expect:

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
4 person_id:::lname:::fname:::job_id
4 100001:::Strozzi:::Carlo:::1
4 100002:::Torvalds:::Linus:::1
Group's Lines=3, Group's Fields=12.

4 100003:::Stallman:::Richard:::1
Group's Lines=1, Group's Fields=4.

4 100004:::Litt:::Steve:::2
Group's Lines=1, Group's Fields=4.

4 100005:::Bush:::George:::3
4 100006:::Clinton:::Bill:::3
4 100007:::Reagan:::Ronald:::3
Group's Lines=3, Group's Fields=12.

4 100008:::Cheney:::Dick:::4
4 100009:::Gore:::Al:::4
4 100010:::Litt:::Brett:::5
4 100011:::Montgomery:::Alexander:::6
Group's Lines=4, Group's Fields=16.

4 100012:::Mondale:::Walter:::4
4 100013:::Reagan:::Ronald:::3
Group's Lines=2, Group's Fields=8.

Total Lines=14, Total Fields=56.
[nosql_slitt@mydesk 200704]$

What if you need to deal with two files. For instance, what if you want to write a version of updtable that works with a merge, for enhanced performance. You must read one of the tables as a file rather than from stdin.

The following is a trivial example of reading from an explicitly named file, in this case junk.jnk:

#!/usr/local/bin/mawk -We
BEGIN{
line = ""
while((getline line < "junk.jnk") > 0){
print line
}
}
{
}
END{
}

The preceding code creates the following output, which is an exact copy of the junk.jnk file:

[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk
one a b c d
two a b c d
three a b c d
four a b c d
[nosql_slitt@mydesk 200704]$

Now let's use an awk array in order to facilitate a stack. We create functions push and pop, and also function stacklook. The stacklook function primarily looks at what would be popped if one popped. stacklook(arr,0) does that. To get the stack element that would be the next pop after this one is stacklook(arr,-1). If you want to look at the stack first in first out, use positive numbers, with stacklook(arr,1) showing the earliest pushed element.

The other interesting thing about this program is that it does nothing to each line -- all code is in the BEGIN section. Therefore, instead of piping in a file, you can simply echo a word into the pipeline. Yet another interesting facet of this program is the introduction of local variables. You declare a variable local to the subroutine by placing it in the subroutine's paren delimited argument list, after all the real arguments. Specifically, variable x in function pop() is declared to be local, meaning that any x in the main program or other subroutines is not effected by changes in this local x.

#!/usr/local/bin/mawk -We
function push(arr, val){ arr[++arr[0]] = val }
function pop(arr,x){
if(arr[0] < 1) return NULL
x = arr[arr[0]]
delete arr[arr[0]--]
return x
}
function stacklook(arr, num){
if(arr[0] < 1) return NULL # stack spent
if(num <= 0){
num = arr[0] + num
if(num < 1) return NULL
return arr[num]
} else {
if(num > arr[0]) return NULL
return arr[num]
}
}
BEGIN{
mystack[0] = 0
push(mystack, "one")
push(mystack, "two")
push(mystack, "three")
push(mystack, "four")
push(mystack, "five")
print "----------STACKLOOK-------------"
for(i=5; i > -6; i--){
print "stacklook(mystack, " i ")=" stacklook(mystack, i)
}
print "-------------POP----------------"
while(str = pop(mystack)){
print str
}
}
END{
}

The preceding code produces the following output:

[slitt@mydesk 200704]$ echo hello | ./hello.awk
----------STACKLOOK-------------
stacklook(mystack, 5)=five
stacklook(mystack, 4)=four
stacklook(mystack, 3)=three
stacklook(mystack, 2)=two
stacklook(mystack, 1)=one
stacklook(mystack, 0)=five
stacklook(mystack, -1)=four
stacklook(mystack, -2)=three
stacklook(mystack, -3)=two
stacklook(mystack, -4)=one
stacklook(mystack, -5)=
-------------POP----------------
five
four
three
two
one
[slitt@mydesk 200704]$

Stacks are vital in any language, because they are a iterative method of recursion. You can use them to parse outlines, or parse markup language that might be nested. There are probably a million ways to implement a stack in awk, but I used an array whose element 0 contains the number of elements, with the first element being element 1.

The preceding exercise also introduced arrays, which are VERY important. Please be aware that you can actually sort arrays with the asort command, although that will not be demonstrated in this article.

WARNING

Awk makes your life "easier" by automatically considering any file on its command line to be an input file. It iterates completely through the first file, then the second, etc. Usually that's a timesaving feature, but sometimes it's a pain. If you want stdin to be one of the files, you need to put it on the command line as a dash solitary (-).


WARNING

First an ugly workaround will be demonstrated, and then a "nicer" workaround. Let's start ugly, just to demonstrate the principles...

#!/usr/local/bin/mawk -We
{print $0}
The preceding very simple awk script that simply prints each line of input, does this with 2 files on the command line:

[slitt@mydesk 200704]$ cat /etc/fstab | ./hello.awk people.table edit.table 
person_id lname fname job_id
100001 Strozzi Carlo 1
100002 Torvalds Linus 1
100003 Stallman Richard 1
100004 Litt Steve 2
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
100008 Cheney Dick 4
100009 Gore Al 4
100010 Litt Brett 5
100011 Montgomery Alexander 6
100012 Mondale Walter 4
100013 Reagan Ronald 3
person_id lname fname job_id
100000 Bush GeorgeHW 4
100004 Litt Steve 1
100011 ..DEL..
100018 Lincoln Abraham 3
100019 Washington George 3
100020 Franklin Ben 1
[slitt@mydesk 200704]$

In the preceding command, hello.awk ignored stdin, processed people.table in its entirety, and then went on to process all of edit.table. Sometimes that's what you want, and sometimes it's not.

Sometimes you want something more like this:
#!/usr/local/bin/mawk -We
BEGIN{
line = ""
while((getline line < edittable) > 0){
print line
}
}

{print $0}

In the preceding, you anticipate receipt of the edittable variable from the command line. The following is how you call it, and what results:

[slitt@mydesk 200704]$ cat people.table | awk -v edittable=edit.table -f ./hello.awk 
person_id lname fname job_id
100000 Bush GeorgeHW 4
100004 Litt Steve 1
100011 ..DEL..
100018 Lincoln Abraham 3
100019 Washington George 3
100020 Franklin Ben 1
person_id lname fname job_id
100001 Strozzi Carlo 1
100002 Torvalds Linus 1
100003 Stallman Richard 1
100004 Litt Steve 2
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
100008 Cheney Dick 4
100009 Gore Al 4
100010 Litt Brett 5
100011 Montgomery Alexander 6
100012 Mondale Walter 4
100013 Reagan Ronald 3
[slitt@mydesk 200704]$

In the preceding, you actually invoked awk yourself so you could name the variable, and because you invoked awk yourself, you needed to name a source file with the -f argument. The result was that the info on the command line was printed first, and stdin followed. You could have reversed that by printing edittable in the END section. Or, you could have printed records from the edittable file interspersed with records from stdin. You know, like you'd do in a merge.

I don't know about you, but I find that really ugly. Here's a better way:
#!/usr/local/bin/mawk -We
BEGIN{
line = ""
editfile = ARGV[2]
ARGC = 1
while((getline line < editfile) > 0){
print line
}
}

{print $0}

The preceding is still ugly but it's not nearly as inconvenient. Your first step is to capture the value of the second command line argument as variable editfile, and then remove that as a command line argument by setting ARGC to 1. The command to invoke this is cleaner:

[slitt@mydesk 200704]$ cat people.table | ./hello.awk - edit.table
person_id lname fname job_id
100000 Bush GeorgeHW 4
100004 Litt Steve 1
100011 ..DEL..
100018 Lincoln Abraham 3
100019 Washington George 3
100020 Franklin Ben 1
person_id lname fname job_id
100001 Strozzi Carlo 1
100002 Torvalds Linus 1
100003 Stallman Richard 1
100004 Litt Steve 2
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
100008 Cheney Dick 4
100009 Gore Al 4
100010 Litt Brett 5
100011 Montgomery Alexander 6
100012 Mondale Walter 4
100013 Reagan Ronald 3
[slitt@mydesk 200704]$

There are probably better ways to do this, but for now this is good enough.

Now we're ready to create a merge type version of updtable that makes several assumptions:
  1. Both tables are sorted
  2. Both tables are properly formatted NoSQL tables
  3. Both tables have identical structures
  4. The primary key has only one field
  5. All primary key values are the same length
These are the data and edit tables:

[slitt@mydesk 200704]$ justify < people.table
person_id lname fname job_id
--------- ---------- --------- ------
100001 Strozzi Carlo 1
100002 Torvalds Linus 1
100003 Stallman Richard 1
100004 Litt Steve 2
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
100008 Cheney Dick 4
100009 Gore Al 4
100010 Litt Brett 5
100011 Montgomery Alexander 6
100012 Mondale Walter 4
100013 Reagan Ronald 3
[slitt@mydesk 200704]$
   
[slitt@mydesk 200704]$ justify < edit.table
person_id lname fname job_id
--------- ---------- --------- ------
100000 Bush George HW 4
100002 Torvalds Dr. Linus 1
100004 Litt Steve 1
100011 ..DEL..
100018 Lincoln Abraham 3
100019 Washington George 3
100020 Franklin Ben 1
[slitt@mydesk 200704]$

This is the source code:
#!/usr/local/bin/mawk -We
# *********************************************************************
# updmerge: inserts/updates/deletes table rows based on the contents
# of an edit table. This is a fast, one pass process that assumes
# the following:
# 1) Both files are sorted by primary key
# 2) All primary key values are the same length
# 3) Single segment primary keys
# 4) Both tables have identical structures
#
# Copyright (C) 2007 by Steve Litt
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 dated June, 1991.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
# *********************************************************************
# $Id: updmerge,v 0.01 pre-alpha, 2007/03/30 09:02:23 steve Exp $

BEGIN{
### HANDLE ARGUMENTS, DEFEAT DEFAULT USE OF ARGS AS INPUT FILES ###
editfilename = ARGV[2]
ARGC = 1

### FILE SEPARATOR SHOULD BE TAB ###
FS="\t"

### SET SOME OTHER VARS ###
record_changed_or_deleted_flag = 0
high_value_string="\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF"

### BLOW OFF EDIT FILE'S HEADER, READ ITS FIRST DATA RECORD ###
### TO ENABLE THE MERGE ALGORTHM ###
numchars = getline editline < editfilename # Blow off edit file's header
if(numchars == 0){
print "Empty edit file (" editfilename ", aborting." > stderr
exit 1
}

editline = ""
numchars = getline editline < editfilename # Get first edit line
if(numchars == 0){
print "No data in edit file (" editfilename ", aborting." > stderr
exit 1
}

### SPLIT editline INTO FIELDS FOR COMPARISON WITH MAIN FILE RECORD ###
split(editline, editarray, "\t")
}

### MAIN PROGRAM ### NOT USED FOR ANYTHING IMPORTANT RIGHT NOW
{
record_changed_or_deleted_flag = 0
}

### RULES
NR == 1{ # Transfer main header intact, always
print $0
next
}
$1 < editarray[1]{ # If main record smaller than edit, print main
print $0
next
}

### If edit record is smaller or equal than main, the edit record must
### be an add if it's bigger, or a delete or change if it's equal.
### iterate through all lower or equal edit records, not quitting
### til you have an edit record bigger than the main record.
$1 >= editarray[1]{
while($1 >= editarray[1]){
if($1 > editarray[1]){
print editline
}else {
record_changed_or_deleted_flag = 1

### LOOK FOR DEL AND XFER DATA TO OUTPUT
for(i=1; i <= NF; i++){ outarray[i] = $i }
for(i=2; i <= NF; i++){
if(i in editarray){
if(editarray[i] ~ /^\.\.DEL\.\./){
delflag = 1
break
}
outarray[i] = editarray[i]
}
}
if(!delflag) {
### ASSEMBLE OUTPUT LINE
output = outarray[1]
for(i=2; i <= NF; i++){
output = output "\t" outarray[i]
}
### PRINT OUTPUT LINE
print output
}
}

### GET NEXT EDIT RECORD ###
numchars = getline editline < editfilename
if(numchars == 0){
editline = high_value_string
editarray[1]=high_value_string
} else {
split(editline, editarray)
}
}

}

### GUARANTEE PRINT FOR FIRST RECORD OF MAIN FILE ###
### BUT ONLY AFTER ANY LOWER EDIT RECORDS HAVE BEEN PRINTED ###
### AND ONLY IF THE FIRST RECORD HASN'T BEEN CHANGED OR DELETED ###
NR == 2{
if(!record_changed_or_deleted_flag){print $0}
}

END{ # INSERT ANY EDIT RECORDS ABOVE HIGHEST MAIN RECORD
while(editarray[1] != high_value_string){
### DON'T PRINT DELETES ###
delflag=0
for(i=2; i <= NF; i++){
if(i in editarray){
if(editarray[i] ~ /^\.\.DEL\.\./){
delflag = 1
break
}
}
}

if(!delflag) print editline

### GET NEXT EDIT TABLE RECORD ###
numchars = getline editline < editfilename
if(numchars == 0){
editline = high_value_string
editarray[1]=high_value_string
} else {
split(editline, editarray)
}
}
}

    This is your  plain vanilla, DP101 file merge program. It works the way all merges work. Before the main loop begins, it blows off the edit table's header, loads the edit table's first data row, and calculates the edit table's primary key field.

The basic altorithm is to compare the current data and edit records on every data record increment. If the data record is smaller, it's written. If the edit record is smaller or equal, it's added (if smaller) or changed or deleted (if equal), and then another edit record is retrieved. This edit record retrieval loop continues until the edit record is larger than the data record. When the edit record hits EOF, high value is placed in the edit record so that any remaining data records will be lower and will be written.

To accommodate cases when the final edit record(s) is(are) bigger than the final data record, the END section adds any final edit records.

The awk language makes this algorithm easier to write, with much less nesting than in other languages. As mentioned many time, the main data record loop is implied -- the programmer needn't write it. Much of the logic is implemented as "rules" that only happen when certain conditions happen, and many of those rules call the next operator to prevent execution of anything farther down. While this is a violation of structural programming, these "weed out" rules result in much less nesting and a much more readable program.

The preceding code, run on the pre-sorted people.table and edit.table above it, produces the following code:

[slitt@mydesk 200704]$ cat people.table | ./updmerge.awk - edit.table | justify
person_id lname fname job_id
--------- ---------- --------- ------
100000 Bush George HW 4
100001 Strozzi Carlo 1
100002 Torvalds Dr. Linus 1
100003 Stallman Richard 1
100004 Litt Steve 1
100005 Bush George 3
100006 Clinton Bill 3
100007 Reagan Ronald 3
100008 Cheney Dick 4
100009 Gore Al 4
100010 Litt Brett 5
100012 Mondale Walter 4
100013 Reagan Ronald 3
100018 Lincoln Abraham 3
100019 Washington George 3
100020 Franklin Ben 1
[slitt@mydesk 200704]$
    As you can see, George HW Bush was inserted at id 100000, Linus Torvald's first name was changed to "Dr. Linus", Steve Litt's job id was changed to 1, Alexander Montgomery was deleted, and Lincoln, Washington and Franklin were added at the end.

One impediment to writing large, complex awk programs is the lack of local variables. Well, not exactly a lack, but more like a hassle.

The only way to make a local variable in awk is to put it in the parentheses of a function after the function's arguments. There's no difference between a function's arguments and its local variables except how they're used:
function fixthis(arg1,arg2,local1,local2)
The following is a program to demonstrate what's local and what's global within an awk program:

#!/usr/local/bin/mawk -We
BEGIN{
bingo="begin"
print "In begin, bingo=" bingo "."
}
NR==1{
bingo=$1
print "top norule, bingo=" bingo "."
whatever()
print "after whatever, bingo=" bingo "."
whatever_with_locals()
print "after whatever_with_locals, bingo=" bingo "."
changearg(bingo)
print "after changearg, bingo=" bingo "."
bingo = changearg(bingo)
print "after using changearg return, bingo=" bingo "."
changearg_with_locals(bingo)
print "after changearg_with_locals, bingo=" bingo "."
print "bottom norule, bingo=" bingo "."
}
END{
bingo="end"
print "In end, bingo=" bingo "."
}

function whatever(){
print "Top whatever, bingo=" bingo "."
bingo="changed by whatever"
print "Bottom whatever, bingo=" bingo "."
}

function whatever_with_locals(bingo){
print "Top whatever_with_locals, bingo=" bingo "."
bingo="changed by whatever_with_locals"
print "Bottom whatever_with_locals, bingo=" bingo "."
}

function changearg(arg){
print "Top changearg, bingo=" bingo "."
arg="changed by changearg"
print "Bottom changarg, bingo=" bingo "."
return arg
}

function changearg_with_locals(arg,bingo){
print "Top changearg_with_locals, bingo=" bingo "."
arg="changed by changearg_with_locals"
print "Bottom changarg, bingo=" bingo "."
return arg
}

The preceding code produces the following output.

[slitt@mydesk 200704]$ cat people.table | ./test.awk
In begin, bingo=begin.
top norule, bingo=person_id.
Top whatever, bingo=person_id.
Bottom whatever, bingo=changed by whatever.
after whatever, bingo=changed by whatever.
Top whatever_with_locals, bingo=.
Bottom whatever_with_locals, bingo=changed by whatever_with_locals.
after whatever_with_locals, bingo=changed by whatever.
Top changearg, bingo=changed by whatever.
Bottom changarg, bingo=changed by whatever.
after changearg, bingo=changed by whatever.
Top changearg, bingo=changed by whatever.
Bottom changarg, bingo=changed by whatever.
after using changearg return, bingo=changed by changearg.
Top changearg_with_locals, bingo=.
Bottom changarg, bingo=.
after changearg_with_locals, bingo=changed by changearg.
bottom norule, bingo=changed by changearg.
In end, bingo=end.
[slitt@mydesk 200704]$
    As you can see to the right, every time variable bingo is changed by name, it's changed throughout the program, except when it's declared a local (by being included between the function's parentheses) in function changearg_with_locals.

As you start to perform more complex operations with awk, requiring larger programs, take special care to use local variables in functions, thereby avoiding namespace collisions.
Steve Litt is the author of Twenty Eight Tales of Troubleshooting.   Steve can be reached at his email address.

RAD on Linux

By Steve Litt
I need your help creating a RAD toolkit that uses shellscripts to glue together a few RAD tools. What I envision is something enabling a semi-decent programmer, familiar with the tookit, to create a five table app in a single day.

Initially I see it working with a NoSQL data back end, because that's the easiest. But it would be simple to adapt it to MySQL or Postgres. Initially I see it using a Curses user interface, because that can be used on the widest variety of systems, and it's simple. However, it would be easy to adapt it to a TK environment, and even a browser environment, always assuming one adds a state preservation module to the toolkit.

Apps written in the toolkit I envision have a special user interface -- the default application user interface from Clarion 2.1. I've seen hundreds of people pound on Clarion created programs, and can tell you they're intuitive and fast.

I've written about it since the last century. Clarion was the best RAD (Rapid Application Development) toolset I've ever used, and I'd like something with similar power for Linux. I've created some rudimentary RAD tools in PHP, Perl/Tk and Curses, but have never been able to put them together into a complete RAD toolkit. It was during the Curses attempt that I first seriously considered using shellscripts to glue together generic form, field and picklist executables into an application -- RAD by shellscript.

Unfortunately, I couldn't figure out interexecutable communication, and went on to other things.

A couple weeks ago I mysteriously became part of the NoSQL mailing list. Several years ago my friend Noel Henson had explained the RAD aspects of his sequential file relational database, /rdb, and how there was a free implementation called NoSQL. So when I fortuitously became part of the NoSQL mailing list, I investigated.

NoSQL was inspirational -- a database made of a few executables whose glue language was shellscripts. Here was a proof of concept of my shellscript glue idea. I started thinking again, and came up with two ways to enable interprocess communication in my new RAD. I now believe the whole thing's doable, although the generic form executable will be challenging. I could use some help developing this thing.

If you're considering helping, keep in mind that although my first version will use NoSQL as its data back end, it would be easy to adapt it to MySQL or Postgres.

Our Project Needs a Few Good Men and Women

We live in an era of spectacular web development environments -- Zope and Rails to name a couple. But sometimes, when confronted with a small task and little time, don't you wish for the days when you could quickly code an app that hits a local database, and have it done in an afternoon?

How would you like a toolkit enabling you to write a few shellscripts, deploying a full featured data application? When I say shellscripts, I don't mean the 300 line variety -- for picklists I mean the 20 line variety, and for forms I mean the 50 line variety. Using a NoSQL database, the database can be created in Vim (or Emacs).

If this sounds good to you, email me.
Steve Litt has been writing Troubleshooting Professional Magazine for ten years. Steve can be reached at his email address.

Life After Windows: Retro-Grouch, or Revolutionary

Some have greeted my enthusiasm for NoSQL with scorn. NoSQL's sequential file database is a 1970's concept. Detractors imply that I'm a retro-grouch. The modern thing is a transaction based, SQL queriable DataBase Management System (DBMS). Am I a retro-grouch?

Yes. I am.

First maybe we should define retro-grouch. Mention the phrase, and many think of an old guy saying "In my day, Sonny, things were better!" The phrase conjures an image of someone believing change is universally bad.

That's a bad definition. First, very few people believe all change is bad. Second of all, such people have real problems. So let me give you my operational definition of a retro-grouch:

Retro-grouch

A retro-grouch is a person who hates to see good things get worse or disappear.

So let's examine Steve Litt as a retro-grouch. I don't hate modern things. I don't know how I ever got along without Google, Wikipedia, Paypal and the Internet in general. I would never want to go back to the days of $4500.00 486's with 8Meg of RAM. Rails is cool, and you'd have to pry Ruby out of my cold, dead hand. Don't even think of telling me to give up VimOutliner and go back to Grandview, or give up LyX and go back to MS Word. And naturally, I'd never want to go back to the pre-free-software world where I spent over a thousand bucks a year on software, and much of it was junk that ended up shelfware.

But some changes aren't so good.
Let me repeat: a retro-grouch is a person who hates to see good things get worse or disappear. Nice, simple sequential data files that can be accessed relationally have dissappeared. I want them back.

I'm not saying web apps aren't good, or that Rails isn't a spectacular way to make a good, sturdy multiuser application. I'm just saying there's room for a the elegant simplicity of sequential file relational data stores. Like NoSQL.

Overcoming NoSQL's Shortcomings

Here are a few shortcomings of NoSQL:

No transactions

This can be a problem in a heavily used database. Nobody wants only part of a data operation to succeed. By supporting transactions, DBMS's give use a way to guarantee that data updates are all or nothing. You'll never remove money from one account without putting it in the other.

Then again, I'm sure you've seen plenty of programs that fail to take advantage of the underlying database transaction support. Transactions require a constant discipline on the part of the developer,  and often transactions are overlooked to beat a deadline.

Sometimes transactions just aren't that important. When they're not, NoSQL is as good as anything.

No row level locking

That's not exactly accurate. The real scoop is NoSQL has no built in row level locking. However, it can be built into the application by placing a locked column in every table. Now row level locking can be achieved like this:
  1. Make the table read-only
  2. Set the row's locked column to the user number or another identifiable number
  3. Make the table read-write
  4. Read the row
  5. Change the row
  6. Write the row
  7. Make the table read-only
  8. Set the row's locked column to 0
  9. Make the table read-write
As long as every program in every application accessing the table obeys these rules, and refuses to write or read for write a locked row, you have row level locking, although it's not enforced by the "database". I think it would be pretty easy to make a read_for_write script to check the column for lock, and if not locked, lock and read. Likewise, a write.sh script checks that the row is locked by you, and then writes the data and unlocks the row. Such scripts would make it easy to enforce row level locking throughout.

Users who can read or update the data can access the raw data

If you're using NoSQL to back end a web app, under normal circumstances the http user must access the NoSQL data, and therefore must have read and possibly write access to the raw NoSQL files. In an intranet situation, that might be OK if you trust your users. In an Internet facing application, that's no good at all.

If this is an issue, you can have the raw data readable and writeable by, let's say, nosql, have a daemon running as user nosql that can take commands from user http. Now someone coming in off the web can't hit the raw data. For an extra measure of security, the daemon could check incoming commands for length and bad data, to prevent the equivalent of SQL injection.

Simultaneous writes are a problem

A requirement for lots of simultaneous writes might preclude NoSQL. Personally, I wouldn't put it in an airline reservation system.

On the other hand, if the writes are all adds, and if the table has a single segment key, and if it's acceptable for new data to be unavailable for hours or for a day, no problem. Have all adds appended to the bottom of a journaling table, incrementing the preceding key (and probably locking the table for an instant so two people don't grab the same key).

Have a cron job that does the following:
Under those conditions, you can have 50 data entry people hammering new data while 50 other people are using the main data.

If you really want, you can have read operations query both the main table and the journaling table. Because the journaling table is (presumably) small, searching it is fast. The main table can be searched by key or index extremely quickly. Yes, it's a little bit of a hassle, but big tables are a hassle in any situation.

Writing large tables can be slow

Yep. As tables start exceeding 50,000 to 100,000 rows, many write operations can be slow. Every update requires a reading and rewrite of the whole table. Once again you can use a journaling table in which to write adds, changes and deletes. If that data needs to be instantly accessible, make all your "read" shellscripts indexed or key read the main table, and then sequentially read the journaling table. All adds, changes and deletes go to the journaling table, which in turn updates the main table every so often via a cron job.

No scalability

Oh fiddlesticks! Who needs scalability when the original app took only a couple days to create. So you lose a couple days work -- so what? Think of this app as a prototype.

And speaking of prototypes, one of the best ways to find out what's really needed is to code a functional prototype, have everyone hammer on it, and find out the desired improvements. It's been my experience that this type of prototype actually saves many multiples of its development time when the "real" app goes into development.

It's trivial to write a Ruby program to turn one or a group of NoSQL tables into SQL statements suitable to load your new database, even if your new database has different column names in different orders.

No triggers and stored procedures

Oh, the scandal: No triggers and stored procedures!

So do the work in the app itself. Or, if you really want triggers and stored procedures, make shellscripts to simulate the desired triggers and stored procedures, and access the NoSQL data through those shellscripts. Personally I think that's too much trouble, but it can easily be done.

Why NoSQL is Cool

NoSQL is retro to say the least. Its sequential storage of data harkens back to the 1970's. It's also elegantly simple and transparent. I could have written NoSQL in 1985 if someone had explained it in English, and assuming my operating system had grep, awk, and a highly efficient sort like Linux's sort command.

It's trivially easy to edit NoSQL data in a text editor. Just try that in MySQL, Postgres or (gulp) MS SQL Server. Editing data in an editor is sometimes a handy time saver, and sometimes an essential recovery tool. Speaking of recovery, because it's so human readable, with NoSQL you have a better chance of recovering from corruption. You can back up NoSQL data in its native format. Its native format is about as compressed as a text-only format can get.

You can write a quick and dirty report or data request using grep, sort, head, and tail. I mean less than 20 lines of shellscript. You can even join two tables with a couple sorts and a simple awk script.

Not that anything in the preceding paragraph is necessary -- NoSQL comes with fully coded, heavily tested utilities to do all these things. The point of the preceding paragraph is this: If for some reason you can't use the NoSQL utilities -- maybe they're not installed, maybe you're trying something they can't do efficiently -- you can easily get it done with Unix utilities. If, 20 years from now, you have NoSQL data and for some reason the NoSQL project is no longer developing, and NoSQL is no longer installable on modern operating systems, if you find a system with sort, head, tail, grep and awk, you can continue working with the data.

NoSQL utilities are built to be used in shellscripts, with data passed from utility to utility by piping. This fits excellently with the convenience and modularity of a shellscript based Rapid Application Development environment.

Last but not least, NoSQL is very fast and efficient, because the Unix utilities it uses -- head, tail, grep, sort, and awk are mature, tested, and very efficient. Key and index based reads operate on binary search. Reads on nonsorted, nonindexed fields in large tables can be made lightning fast by pre-selecting with speed-of-light grep:
grep "Rodriguez" < people.table | getrow -N "last_name==\"Rodriguez\""
Or, in its more generic form:
grep "$1" < people.table | getrow -N "last_name==\"$1\""
Will that be as fast as a "real DBMS" with the last name field indexed? Probably not, but probably close. Will it be as fast as a "real DBMS" with that field not indexed? I'd imagine much faster. The grep command is fast!

NoSQL is cool. It isn't the tool for every job, but it's cool. The only trouble is, it's not too hip to support twenty year old technology. Or is it?...

Futurist vs. Retro-grouch

Newer is better, right?

There are people whose job description is "Futurist". I suppose the dictionary definition of a Futurist would be something like "one whose job is to forecast future trends and inform others how to deal with them." But behind that denotation is a flood of connotation:
The ultimate realism -- the very measure of maturity.

And it's a cop out!

Don't fight change? Imagine a circa 1930 European Futurist proclaiming that a wave of national socialism would sweep Germany and later Europe. Imagine the 1930 Futurist advising all who would listen, to prepare for the change by joining the national socialists. After all, change is neither good nor bad, it just is, so deal with it! Imagine what the world would look like today if everyone had followed such advice. Luckily, retro-grouches longing for the "antiquated" principles of Democracy and even Communism fought as partisans and armies throughout Europe to resist the change, so that I can write this free-speech article today, and you can read it. World War II was a pretty close contest, and I can't help thinking those advocates of the quaint concept of Democracy swung the war.

To me, Futurists are spectators of the future. They're the money-changers in the temple of trends. Futurists may  predict the future, but revolutionaries and retro-grouches shape the future.

Here's another story of a guy who you might call a retro-grouch, or you might call a revolutionary, and either way I think you'd be right. By August 1991 Microsoft Windows was the new thing, after conquoring the world with the incredible Windows 3.0. On August 26 1991, Linus Torvalds advertised for help coding his as-yet unnamed operating system.

His unnamed operating system was to be a clone of Minix, which itself was a subset of Unix, an operating system first developed in 1969. Unix had ruled the minicomputer world by the late 1970's, but by 1991 it was losing market share to decentralized apps written for the IBM PC clones. In other words, Unix was the old thing, Windows was the new thing, and Linus Torvalds was writing a free software version of the old thing.

As we all know now, in every technical regard, the old thing was much better than the new thing. Torvalds' operating system changed the world. He didn't predict history, he shaped it.

Torvalds could also be considered a revolutionary, having rebelled against the crippled, academic Minix. He led the masses away from the older Minix into the promised land of Linux.

Personally, I was a revolutionary  when
I created the Universal Troubleshooting Process that ultimately changed the way the world troubleshoots. On the other hand, I was a retro-grouch when I originated VimOutliner as a substitute for the abandoned outliners of the golden age of outliners.

It bears repeating. The retro-grouch and the revolutionary shape the future, while the futurist talks about the future. It also bears repeating: If the old way was insufficient, bring on something new. If the old way was better, bring back the old way. If the old way was better for some situations, keep it around for those situations.

There's a little Futurist voice chattering in all of our minds, telling us to throw away the old. Politely thank that voice, and then choose based on quality and utility, not on newness. If you need a powerful and simple database to get something done today, consider NoSQL.

Life After Windows is a regular Linux Productivity Magazine column, by Steve Litt, bringing you observations and tips subsequent to Troubleshooters.Com's Windows to Linux conversion.
Steve Litt is the founder and acting president of Greater Orlando Linux User Group (GoLUG).   Steve can be reached at his email address.

GNU/Linux, open source and free software

By Steve Litt
Linux is a kernel. The operating system often described as "Linux" is that kernel combined with software from many different sources. One of the most prominent, and oldest of those sources, is the GNU project.

"GNU/Linux" is probably the most accurate moniker one can give to this operating system. Please be aware that in all of Troubleshooters.Com, when I say "Linux" I really mean "GNU/Linux". I completely believe that without the GNU project, without the GNU Manifesto and the GNU/GPL license it spawned, the operating system the press calls "Linux" never would have happened.

I'm part of the press and there are times when it's easier to say "Linux" than explain to certain audiences that "GNU/Linux" is the same as what the press calls "Linux". So I abbreviate. Additionally, I abbreviate in the same way one might abbreviate the name of a multi-partner law firm. But make no mistake about it. In any article in Troubleshooting Professional Magazine, in the whole of Troubleshooters.Com, and even in the technical books I write, when I say "Linux", I mean "GNU/Linux".

There are those who think FSF is making too big a deal of this. Nothing could be farther from the truth. The GNU General Public License, combined with Richard Stallman's GNU Manifesto and the resulting GNU-GPL License, are the only reason we can enjoy this wonderful alternative to proprietary operating systems, and the only reason proprietary operating systems aren't even more flaky than they are now. 

For practical purposes, the license requirements of "free software" and "open source" are almost identical. Generally speaking, a license that complies with one complies with the other. The difference between these two is a difference in philosophy. The "free software" crowd believes the most important aspect is freedom. The "open source" crowd believes the most important aspect is the practical marketplace advantage that freedom produces.

I think they're both right. I wouldn't use the software without the freedom guaranteeing me the right to improve the software, and the guarantee that my improvements will not later be withheld from me. Freedom is essential. And so are the practical benefits. Because tens of thousands of programmers feel the way I do, huge amounts of free software/open source is available, and its quality exceeds that of most proprietary software.

In summary, I use the terms "Linux" and "GNU/Linux" interchangably, with the former being an abbreviation for the latter. I usually use the terms "free software" and "open source" interchangably, as from a licensing perspective they're very similar. Occasionally I'll prefer one or the other depending if I'm writing about freedom, or business advantage.
Steve Litt has used GNU/Linux since 1998, and written about it since 1999. Steve can be reached at his email address.

Letters to the Editor

All letters become the property of the publisher (Steve Litt), and may be edited for clarity or brevity. We especially welcome additions, clarifications, corrections or flames from vendors whose products have been reviewed in this magazine. We reserve the right to not publish letters we deem in bad taste (bad language, obscenity, hate, lewd, violence, etc.).
Submit letters to the editor to Steve Litt's email address, and be sure the subject reads "Letter to the Editor". We regret that we cannot return your letter, so please make a copy of it for future reference.

How to Submit an Article

We anticipate two to five articles per issue. We look for articles that pertain to the GNU/Linux or open source. This can be done as an essay, with humor, with a case study, or some other literary device. A Troubleshooting poem would be nice. Submissions may mention a specific product, but must be useful without the purchase of that product. Content must greatly overpower advertising. Submissions should be between 250 and 2000 words long.

Any article submitted to Linux Productivity Magazine must be licensed with the Open Publication License, which you can view at http://opencontent.org/openpub/. At your option you may elect the option to prohibit substantive modifications. However, in order to publish your article in Linux Productivity Magazine, you must decline the option to prohibit commercial use, because Linux Productivity Magazine is a commercial publication.

Obviously, you must be the copyright holder and must be legally able to so license the article. We do not currently pay for articles.

Troubleshooters.Com reserves the right to edit any submission for clarity or brevity, within the scope of the Open Publication License. If you elect to prohibit substantive modifications, we may elect to place editors notes outside of your material, or reject the submission, or send it back for modification. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com. Authors: please understand we can't place hyperlinks inside articles. If we did, only the first article would be read, and we can't place every article first.

Submissions should be emailed to Steve Litt's email address, with subject line Article Submission. The first paragraph of your message should read as follows (unless other arrangements are previously made in writing):

Copyright (c) 2003 by <your name>. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, version  Draft v1.0, 8 June 1999 (Available at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for readability at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest version is presently available at  http://www.opencontent.org/openpub/).

Open Publication License Option A [ is | is not] elected, so this document [may | may not] be modified. Option B is not elected, so this material may be published for commercial purposes.

After that paragraph, write the title, text of the article, and a two sentence description of the author.

Why not Draft v1.0, 8 June 1999 OR LATER

The Open Publication License recommends using the word "or later" to describe the version of the license. That is unacceptable for Troubleshooting Professional Magazine because we do not know the provisions of that newer version, so it makes no sense to commit to it. We all hope later versions will be better, but there's always a chance that leadership will change. We cannot take the chance that the disclaimer of warranty will be dropped in a later version. 

Trademarks

All trademarks are the property of their respective owners. Troubleshooters.Com(R) is a registered trademark of Steve Litt.

URLs Mentioned in this Issue


_