|
April 2007 NoSQL: The Unix
Database (With awk) |
Copyright (C) 2007 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Linux Productivity Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.
|
and Rapid Learning: Secret Weapon of the Successful Technologist by Steve Litt |
[ Troubleshooters.Com
| Back Issues |Troubleshooting
Professional Magazine
]
|
|
CONTENTS
|
|
|
|
|
|
[slitt@mydesk mawk-1.3.3]$ ./configure |
# configure will look for libm. Change if you know this will fail |
# configure will look for libm. Change if you know this will fail |
./install-shThen you're all set except for the executable path
PATH=/usr/local/nosql/bin:$PATH |
|
WARNING
The rxvt terminal program, in its infinite wisdom, resets the $PS1 environment variable when it starts, so if you're already in a NoSQL environment and then start rxvt, the prompt will not show the "nosql_" even though the NoSQL binary directory will still be at the front of the path. All other terminal programs I tested worked right -- both the path and the prompt were passed down from parent terminal to child terminal. |

| Binary value | Escape string |
| \x00 | \x00 |
| \x01 | \x01 |
| newline | \x0c |
| tab | \x0b |
| backslash | \\ |
[slitt@mydesk 200704]$ nosql_environ.sh |
[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify |
# This is a comment in the people.xref template file |
[nosql_slitt@mydesk 200704]$ cat people.xref | maketable > people.table |
edittable people.tableYou'll be editing a file that looks like this:
person_id|lname|fname|job_id |
person_id|lname|fname|job_id |
[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify |
|
[nosql_slitt@mydesk 200704]$ cat people.table | column -r |
[nosql_slitt@mydesk 200704]$ cat people.table | column -r | justify |
cat people.table | justifyCare must be taken, because the output of justify is not a valid NoSQL table, and therefore cannot be used as input to any type of database command. Observe:
[nosql_slitt@mydesk 200704]$ cat people.table | istable |
cat people.table | sorttableThe output of that command is people.table sorted by the leftmost column, then the second leftmost, then the third... If for some reason you need it reverse sorted, add -r:
cat people.table | sorttable -rBecause tables should already be sorted by the left column, which in NoSQL tables is the primary key, or the left two or three columns, which in NoSQL tables are the two or three keys comprising a multisegment primary key, whole row sorting is seldom necessary unless you're sorting the output of a command that left the data unsorted. However, often you need to sort by a column that is not the primary key. The following is an example where I sort by last name and then first:
[nosql_slitt@mydesk 200704]$ cat people.table | sorttable lname fname | justify |
cat people.table | sorttable job_id:rn lname fnameIn the preceding the rn after the colon means reverse (r) and numeric (n). I haven't had a lot of luck with the stuff after the colon -- your mileage may vary.
[nosql_slitt@mydesk 200704]$ cat people.table | compute person_id=person_id+1000000 |
keysearch -p Litt people_byname.tableA disadvantage of meaningful primary keys is that you can never count on newly inserted rows having a higher key value than the others, so the table must be sorted each time.
[nosql_slitt@mydesk 200704]$ indextable people.table job_id |
[nosql_slitt@mydesk 200704]$ indextable --update people.table._x.job_id |
|
NOTE
All timing tests in this article were done on an Athlon XP2600+ with
1.5GB of RAM. As of 3/2007, this is a faster than average machine, but
$600.00 will buy you a more powerful desktop, and $1300.00 will buy you
a more powerful notebook, so it's definitely off the shelf hardware. |
|
It usually isn't,
at least on
moderately sized tables. To research this, let's create a program to
output a people table. The program, written in Ruby, is to the left. We can pipe the program into the sorttable command with a script like this: It usually isn't, at least on moderately sized tables. To research this, let's create two scripts: the first runs the (slow) Ruby program at the left and redirects to ./junk.jnk, and then times the execution of the second one, which sorts the file created by the Ruby. Here's ./testtime.sh:
The results are listed below the Ruby program, but basically, with 67,600 rows, it sorts in the time you could hit two or three keystrokes. With 676,000 rows, it takes about 5 seconds, which is tolerable in some situations. With 6,760,000 rows (that's 6 million rows folks), it takes an intolerable 65 seconds. In other words, when the number of rows is expected to go significantly above 100,000, try to select before you sort. |
[nosql_slitt@mydesk 200704]$ ./testtime.sh 10 |
| testtime.sh | testtime2.sh | |||
|
|
|
To the left you see
the
reindexing results. 67,603 rows take a half a second. 676,003 rows take
a slightly irritating but still tolerable 6.2 seconds. 6,760,003 rows
take an unacceptable 66 seconds. There are ways around this, as explained on the NoSQL website. One way is to make all adds and changes go to a small journaling table, and use that to update the main table every night. Then there's the other viewpoint: When you have over a million rows in any table, you've probably outgrown NoSQL if you do many writes, and you've certainly outgrown an Athlon XP2600+ with 1.5GB of RAM. |
[nosql_slitt@mydesk 200704]$ cat people.table | column fname lname job_id | sorttable > people_byname.table |
| Restrict by |
Commands |
| Column | column |
| Row | keysearch, getrow, filtertable, awktable |
| Join | jointable |
cat people.table | column person_id lname fnameA simpler way to accomplish that would have been to use the -r reverse option, whereby everything not named as a column is printed:
cat people.table | column -r job_idRemember that NoSQL defines a primary key as the leftmost column(s), so using the column command to re-key the table is as simple as moving the intended key column(s) into the leftmost position.
cat people.table | column person_id fname lname
[nosql_slitt@mydesk 200704]$ keysearch 100007 people.table |
[nosql_slitt@mydesk 200704]$ keysearch -I people.table._x.job_id 3 people.table |
[slitt@mydesk 200704]$ cat people.table | getrow 'lname=="Litt"' |
[nosql_slitt@mydesk 200704]$ cat people.table | getrow "lname==\"Litt\"" |
[slitt@mydesk 200704]$ cat people.table | getrow job_id==3 |
[slitt@mydesk 200704]$ filtertable grep Litt < people.table | getrow 'lname=="Litt"' |
#!/usr/bin/ruby -w |
[slitt@mydesk 200704]$ filtertable ./insertspouses.rb < people.table |
job_id job_name |
[slitt@mydesk 200704]$ cat people.table | sorttable job_id | jointable -j job_id jobs.table | justify |
[nosql_slitt@mydesk 200704]$ cat people.table | sorttable job_id | \ |
select p.*Of course, as obvious from the name, you don't use SQL in NoSQL, so you'd do the equivalent like this:
from people p, jobs j
where j.job_name='President'
and p.job_id=j.job_id;
[nosql_slitt@mydesk 200704]$ cat people.table | sorttable job_id | jointable -j job_id jobs.table | \ |
#!/bin/bash |
[nosql_slitt@mydesk 200704]$ ./find_presidents.sh |
#!/bin/bash |
#!/bin/bash |
#!/bin/bash |
[nosql_slitt@mydesk 200704]$ justify < people.table |
head -n1 people.table > xaction.tableThen you populate xaction.table, with Vim, or edittable, or some automated way. Here's what the populated table looks like when justified:
[nosql_slitt@mydesk 200704]$ justify < xaction.table |
[nosql_slitt@mydesk 200704]$ cat people.table | updtable xaction.table |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
[nosql_slitt@mydesk 200704]$ cat /etc/pythonrc.py | ./hello.awk |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
#!/usr/local/bin/mawk -We |
[nosql_slitt@mydesk 200704]$ cat people.table | ./hello.awk |
#!/usr/local/bin/mawk -We |
[slitt@mydesk 200704]$ echo hello | ./hello.awk |
|
WARNING
Awk makes your life "easier" by automatically considering any file on its command line to be an input file. It iterates completely through the first file, then the second, etc. Usually that's a timesaving feature, but sometimes it's a pain. If you want stdin to be one of the files, you need to put it on the command line as a dash solitary (-). WARNING
First an ugly workaround will be demonstrated, and then a "nicer" workaround. Let's start ugly, just to demonstrate the principles... |
#!/usr/local/bin/mawk -We |
[slitt@mydesk 200704]$ cat /etc/fstab | ./hello.awk people.table edit.table |
#!/usr/local/bin/mawk -We |
[slitt@mydesk 200704]$ cat people.table | awk -v edittable=edit.table -f ./hello.awk |
#!/usr/local/bin/mawk -We |
[slitt@mydesk 200704]$ cat people.table | ./hello.awk - edit.table |
[slitt@mydesk 200704]$ justify < people.table |
[slitt@mydesk 200704]$ justify < edit.table |
#!/usr/local/bin/mawk -We |
This
is your plain vanilla, DP101 file merge program. It works the
way
all merges work. Before the main loop begins, it blows off the edit
table's header, loads the edit table's first data row, and calculates
the edit table's primary key field. The basic altorithm is to compare the current data and edit records on every data record increment. If the data record is smaller, it's written. If the edit record is smaller or equal, it's added (if smaller) or changed or deleted (if equal), and then another edit record is retrieved. This edit record retrieval loop continues until the edit record is larger than the data record. When the edit record hits EOF, high value is placed in the edit record so that any remaining data records will be lower and will be written. To accommodate cases when the final edit record(s) is(are) bigger than the final data record, the END section adds any final edit records. The awk language makes this algorithm easier to write, with much less nesting than in other languages. As mentioned many time, the main data record loop is implied -- the programmer needn't write it. Much of the logic is implemented as "rules" that only happen when certain conditions happen, and many of those rules call the next operator to prevent execution of anything farther down. While this is a violation of structural programming, these "weed out" rules result in much less nesting and a much more readable program. |
|
As you can see, George HW Bush was inserted at id 100000, Linus Torvald's first name was changed to "Dr. Linus", Steve Litt's job id was changed to 1, Alexander Montgomery was deleted, and Lincoln, Washington and Franklin were added at the end. |
function fixthis(arg1,arg2,local1,local2)The following is a program to demonstrate what's local and what's global within an awk program:
#!/usr/local/bin/mawk -We |
[slitt@mydesk 200704]$ cat people.table | ./test.awk |