Jdb - a flat-text database for shell scripting
John Heidemann, johnh@isi.edu
JDB is package of commands for manipulating flat-ASCII databases from shell scripts. JDB is useful to process medium amounts of data (with very little data you'd do it by hand, with megabytes you might want a real database).
JDB is very good at doing things like:
extracting measurements from experimental output
examining data to address different hypotheses
joining data from different experiments
eliminating/detecting outliers
computing statistics on data (mean, confidence intervals, correlations, histograms)
reformatting data for graphing programs
Rather than hand-code scripts to do each special case, JDB provides higher-level functions. Although it's often easy throw together a custom script to do any single task, I believe that there are several advantages to using this library:
these programs provide a higher level interface than plain Perl, so
Fewer lines of simpler code:
dbrow '_size == 1024' | dbstats bw
rather than:
while (<>) { split; $sum+=$F[2]; $ss+=$F[2]^2; $n++; } $mean = $sum / $n; $std_dev = ...
in dozens of places.
the library uses names for columns, so
No more $F[2], use _bw.
New or different order columns? no changes to your scripts!
the library is self-documenting (each program records what it did)
No more wondering what hacks were used to compute the final data, just look at the comments at the end of the output.
The library is mature, supporting large datasets, corner cases, error handling, backed by an automated test suite.
No more puzzling about bad output because your custom script skimped on error checking.
No more memory thrashing when you try to sort ten million records.
(The disadvantage is that you need to learn what functions JDB provides.)
JDB is built on flat-ASCII databases. By storing data in simple text
files and processing it with pipelines it is easy to experiment (in
the shell) and look at the output. The original implementation of
this idea was /rdb
, a commercial product described in the book
UNIX relational database management: application development in the UNIX environment
by Rod Manis, Evan Schaffer, and Robert Jorgensen (and
also at the web page http://www.rdb.com/). JDB is an incompatible
re-implementation of their idea without any accelerated indexing or
forms support. (But it's free!).
Installation instructions follow at the end of this document. JDB requires Perl 5.003 to run. All commands include a complete in the usage string, and many include manual pages. All commands are backed by an automated test suite.
The most recent version of JDB is available on the web at http://www.isi.edu/~johnh/SOFTWARE/JDB/index.html.
2.1, 6-Apr-08 --- another alpha 2.0, but now all converted programs understand both listize and colize format
The old dbjoin argument -i
is now -a
or <--type=outer>.
A minor change: comments in the source files for dbjoin are now intermixed with output rather than being delayed until the end.
-e
option (to avoid eoln spaces) is now -E
to avoid conflicts with the standard empty field argument.
-e
option is now -E
to avoid conflicts.
And its -n
, -s
, and -w
are now
-N
, -S
, and -W
to correspond.
Jdb::IO
now understand both list-format
and column-format data, so all converted programs can now
automatically read either format. This capability was one
of the milestone goals for 2.0, so yea!
Jdb now uses the standard Perl build and installation from ExtUtil::MakeMaker(3), so the quick answer to installation is to type:
perl Makefile.PL make make test make install
Or, if you want to install it somewhere else, change the first line to
perl Makefile.PL PREFIX=/home/you
and it will go in /home/john/bin, etc. (See the ExtUtil::MakeMaker(3) manpage for more details.)
JDB requires perl 5.005 or later. Some of the commands work on 5.000, but several of the test scripts fail, so buyer beware.
A test-suite is available, run it with make test
In the past there have been some test suite problems due to different printf implementations. I've tried to code around this problem; please let me know if you encounter it again.
A FreeBSD port to JDB is available, see http://www.freshports.org/databases/jdb/.
A Fink (MacOS X) port is available, see http://pdb.finkproject.org/pdb/package.php/jdb. (Thanks to Lars Eggert for maintaining this port.)
These programs are based on the idea storing data in simple ASCII files. A database is a file with one header line and then data or comment lines. For example:
#h account passwd uid gid fullname homedir shell johnh * 2274 134 John_Heidemann /home/johnh /bin/bash greg * 2275 134 Greg_Johnson /home/greg /bin/bash root * 0 0 Root /root /bin/bash # this is a simple database
The header line must be first and begins with #h
.
There are rows (records) and columns (fields),
just like in a normal database.
Comment lines begin with #
.
By default, columns are delimited by whitespace. By default it is therefore not possible to have fields which contain whitespace. (But see below for alternatives.)
The big advantage of this approach is that it's easy to massage data into this format, and it's reasonably easy to take data out of this format into other (text-based) programs, like gnuplot, jgraph, and LaTeX. Think Unix. Think pipes.
Since no-whitespace in columns was a problem for some applications,
there's an option which relaxes this rule. You can specify the field
separator in the table header with -Fx
where x
is the new field
separator. The special value -FS
sets a separator of two spaces, thus
allowing (single) spaces in fields. An example:
#h -FS account passwd uid gid fullname homedir shell johnh * 2274 134 John Heidemann /home/johnh /bin/bash greg * 2275 134 Greg Johnson /home/greg /bin/bash root * 0 0 Root /root /bin/bash # this is a simple database
See dbfilealter(1) for more details. Regardless of what the column separator is for the body of the data, it's always whitespace in the header.
There's also a third format: a ``list''. Because it's often hard to see what's columns past the first two, in list format each ``column'' is on a separate line. The programs dblistize and dbcolize convert to and from this format, and all 2.0 programs work with both formats. The command
dbfilealter -R C < DATA/passwd.jdb
outputs:
#L account passwd uid gid fullname homedir shell account: johnh passwd: * uid: 2274 gid: 134 fullname: John_Heidemann homedir: /home/johnh shell: /bin/bash account: greg passwd: * uid: 2275 gid: 134 fullname: Greg_Johnson homedir: /home/greg shell: /bin/bash account: root passwd: * uid: 0 gid: 0 fullname: Root homedir: /root shell: /bin/bash # this is a simple database # | dblistize
See dbfilealter(1) for more details.
A number of programs exist to manipulate databases. Complex functions can be made by stringing together commands with shell pipelines. For example, to print the home directories of everyone with ``john'' in their names, you would do:
cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
The output might be:
#h homedir /home/johnh /home/greg # this is a simple database # | dbrow _fullname =~ /John/ # | dbcol homedir
(Notice that comments are appended to the output listing each command, providing an automatic audit log.)
In addition to typical database functions (select, join, etc.) there are also a number of statistical functions.
An advantage of JDB is that you can talk about columns by name
(symbolically) rather than simply by their positions. So in the above
example, dbcol homedir
pulled out the home directory column, and
dbrow '_fullname =~ /John/'
matched against column fullname.
In general, you can use the name of the column listed on the #h
line
to identify it in most programs, and _name to identify it in code.
Some alternatives for flexibility:
numeric values identify columns positionally, so 0 or _0 is the first column, 1 is the second, etc.
in code, _last_columnname gets the value from columname's last row
See dbroweval(1) for more details about writing code.
Enough said. I'll summarize the commands, and then you can experiment. For a detailed description of each command, see its usage line by running it with the argument ``-?''. In some shells (csh) you'll need to quote this (run ``dbcol -\?'' rather than ``dbcol -?'').
add columns to a database
set the column headings for a non-JDB file
select columns from a table
select rows from a table
sort rows based on a set of columns
compute the natural join of two tables
rename a column
merge two columns into one
split one column into two or more columns
split one column into multiple rows
split the file into multiple files per unique fields
check that db file doesn't have some common errors
split a single input file containing multiple tables several files
compute statistics over a column (mean,etc.,optionally median)
compute a series of stats (mean, etc.) over a table
group rows (map) and then apply a function to each group (reduce)
compare two samples distributions (mean/conf interval/T-test)
computing moving statistics over a column of data
compute simple stats (sums and rates) over multiple timescales
compute Z-scores and T-scores over one column of data
compute the rank or percentile of a column
compute histograms over a column of data
compute the coefficient of correlation over several columns
compute linear regression and correlation for two columns
compute a running sum over a column of data
count the number of rows (a subset of dbstats)
compute differences between each row of a table
number each row
run arbitrary Perl code on each row
count/eliminate identical rows (like Unix uniq(1))
find ``cliffs'' in two-dimensional data
pretty-print columns
un-pretty-print columns
convert between column or list format, or change the column separator
remove comments from a table
remove extra headers that occur from table concatenation
remove leading spaces from (potentially non-JDB) data
generate a script that sends form mail based on each row
(These programs convert data into jdb. See their web pages for details.)
http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html
spreadsheet tab-delimited files to db
(see man tcpdump(8)
on any reasonable system)
(And out of jdb:)
simple conversion of JDB to html tables
Many programs have common options:
-?
or --help
usage
-c
confidence interval (dbstats, dbmultistats)
-C
column separator (dbcolsplit, dbcolmerge)
-d
or --debug
debug mode
-a
compute stats over all data (treating non-numerics as zeros)
(by default, non-numerics are ignored for stats purposes)
-e E
or --empty E
give value E as the value for empty (null) records
When giving Perl code (in dbrow and dbroweval) column names can be embedded if preceded by underscores. Look at dbrow(1) or dbroweval(1) for examples.)
Most programs run in constant memory and use temporary files if necessary. Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce, dbmultistats, dbrowsplituniq.
Take the raw data in DATA/http_bandwidth
,
put a header on it (dbcoldefine size bw
),
took statistics of each category (dbmultistats size bw
),
pick out the relevant fields (dbcol size mean stddev pct_rsd
), and you get:
#h size mean stddev pct_rsd 1024 1.4962e+06 2.8497e+05 19.047 10240 5.0286e+06 6.0103e+05 11.952 102400 4.9216e+06 3.0939e+05 6.2863 # | dbcoldefine size bw # | /home/johnh/BIN/DB/dbmultistats size bw # | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
(The whole command was:
cat DATA/http_bandwidth | dbcoldefine size | dbmultistats size bw | dbcol size mean stddev pct_rsd
all on one line.)
Then post-process them to get rid of the exponential notation by adding this to the end of the pipeline:
dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
(Actually, this step is no longer required since dbcolstats now uses a different default format.)
giving:
#h size mean stddev pct_rsd 1024 1496200 284970 19.047 10240 5028600 601030 11.952 102400 4921600 309390 6.2863 # | dbcoldefine size bw # | /home/johnh/BIN/DB/dbmultistats size bw # | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd # | /home/johnh/BIN/DB/dbroweval { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
In a few lines, raw data is transformed to processed output.
Suppose you expect there is an odd distribution of results of one datapoint. JDB can easily produce a CDF (cumulative distribution function) of the data, suitable for graphing:
cat DB/DATA/http_bandwidth | \ dbcoldefine size bw | \ dbrow '_size == 102400' | \ dbcol bw | \ dbsort -n bw | \ dbrowenumerate | \ dbcolpercentile count | \ dbcol bw percentile | \ xgraph
The steps, roughly: 1. get the raw input data and turn it into jdb format, 2. pick out just the relevant column (for efficiency) and sort it, 3. for each data point, assign a CDF percentage to it, 4. pick out the two columns to graph and show them
The first commercial program I wrote was a gradebook, so here's how to do it with JDB.
Format your data like DATA/grades.
#h name email id test1 a a@ucla.edu 1 80 b b@usc.edu 2 70 c c@isi.edu 3 65 d d@lmu.edu 4 90 e e@caltech.edu 5 70 f f@oxy.edu 6 90
Or if your students have spaces in their names, use -FS and two spaces to separate each column:
#h -FS name email id test1 alfred aho a@ucla.edu 1 80 butler lampson b@usc.edu 2 70 david clark c@isi.edu 3 65 constantine drovolis d@lmu.edu 4 90 debrorah estrin e@caltech.edu 5 70 sally floyd f@oxy.edu 6 90
To compute statistics on an exam, do
cat DATA/grades | dbstats test1 |dblistize
giving
#L ... mean: 77.5 stddev: 10.84 pct_rsd: 13.987 conf_range: 11.377 conf_low: 66.123 conf_high: 88.877 conf_pct: 0.95 sum: 465 sum_squared: 36625 min: 65 max: 90 n: 6 ...
To do a histogram:
cat DATA/grades | dbcolhisto -n 5 -g test1
giving
#h low histogram 65 * 70 ** 75 80 * 85 90 ** # | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
Now you want to send out grades to the students by e-mail. Create a form-letter (in the file test1.txt):
To: _email (_name) From: J. Random Professor <jrp@usc.edu> Subject: test1 scores
_name, your score on test1 was _test1. 86+ A 75-85 B 70-74 C 0-69 F
Generate the shell script that will send the mail out:
cat DATA/grades | dbformmail test1.txt > test1.sh
And run it:
sh <test1.sh
The last two steps can be combined:
cat DATA/grades | dbformmail test1.txt | sh
but I like to keep a copy of exactly what I send.
At the end of the semester you'll want to compute grade totals and assign letter grades. Both fall out of dbroweval. For example, to compute weighted total grades with a 40% midterm/60% final where the midterm is 84 possible points and the final 100:
dbcol -rv total | dbcolcreate total - | dbroweval ' _total = .40 * _midterm/84.0 + .60 * _final/100.0; _total = sprintf("%4.2f", _total); if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' | dbcolneaten
If you got the data originally from a spreadsheet, save it in ``tab-delimited'' format and convert it with tabdelim_to_db (run tabdelim_to_db -? for examples).
To convert the Unix password file to db:
cat /etc/passwd | sed 's/:/ /g'| \ dbcoldefine -F S login password uid gid gecos home shell \ >passwd.jdb
To convert the group file
cat /etc/group | sed 's/:/ /g' | \ dbcoldefine -F S group password gid members \ >group.jdb
To show the names of the groups that div7-members are in (assuming DIV7 is in the gecos field):
cat passwd.jdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \ dbjoin - group.jdb gid | dbcol login group
Which db programs are the most complicated (based on number of test cases)?
ls TEST/*.cmd | \ dbcoldefine test | \ dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \ dbrowuniq -c | \ dbsort -nr count | \ dbcolneaten
(Answer: dbstats, then dbjoin.)
Stats on an exam (in FILE, with COLUMN==the name of the exam)?
cat $FILE | dbstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
Merging a the hw1 column from file hw1.jdb into grades.jdb assuing there's a common student id in column ``id'':
dbcol id hw1 <hw1.jdb >t.jdb
dbjoin -i -e - grades.jdb t.jdb id | \ dbsort name | \ dbcolneaten >new_grades.jdb
Merging two jdb files with the same rows:
cat file1.jdb file2.jdb >output.jdb
or if you want to clean things up a bit
cat file1.jdb file2.jdb | dbstripextraheaders >output.jdb
or if you want to know where the data came from
for i in 1 2 do dbcolcreate source $i < file$i.jdb done | dbstripextraheaders >output.jdb
(assumes you're using a Bourne-shell compatible shell, not csh).
There have been two versions of JDB; the current is a complete re-write of the first.
JDB (in its various forms) has been used extensively by its author since 1991. Since 1995 it's been used by two other researchers at UCLA and several at ISI. In February 1998 it was announced to the Internet. Since then it has found a few users, some outside where I work.
JDB includes code ported from Geoff Kuenning (DbTDistr.pm).
JDB contributors: Ashvin Goel <goel@cse.oge.edu>, Geoff Kuenning <geoff@fmg.cs.ucla.edu>, Vikram Visweswariah <visweswa@isi.edu>, Kannan Varadahan <kannan@isi.edu>, Lars Eggert <larse@isi.edu>, Arkadi Gelfond <arkadig@dyna.com>, Haobo Yu <haoboy@packetdesign.com>, Pavlin Radoslavov <pavlin@catarina.usc.edu>, Fabio Silva <fabio@isi.edu>, Jerry Zhao <zhaoy@isi.edu>, Ning Xu <nxu@aludra.usc.edu>, Martin Lukac <mlukac@lecs.cs.ucla.edu>.
JDB includes datasets contributed from NIST (DATA/nist_zarr13.jdb), from http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm, the NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1. Background and Data. The source is public domain, and reproduced with permission.
As stated in the introduction, JDB is an incompatible reimplementation
of the ideas found in /rdb
. By storing data in simple text files and
processing it with pipelines it is easy to experiment (in the shell)
and look at the output. The original implementation of this idea was
/rdb, a commercial product described in the book UNIX relational
database management: application development in the UNIX environment
by Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web
page http://www.rdb.com/).
In August, 2002 I found out Carlo Strozzi extended RDB with his package NoSQL http://www.linux.it/~carlos/nosql/. According to Mr. Strozzi, he implemented NoSQL in awk to avoid the Perl start-up of RDB. Although I haven't found Perl startup overhead to be a big problem on my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may want to evaluate his system. (At some point I'll try to do a comparison of JDB and NoSQL.)
Versions prior to 1.0 were released informally on my web page but were not announced.
started for my own research use
first check-in to RCS
parts now require perl5
adds autoconf support and a test script.
support for double space field separators, better tests
minor changes and release on comp.lang.perl.announce
adds dmalloc_to_db converter
fixes some warnings
dbjoin now can run on unsorted input
fixes a dbjoin bug
some more tests in the test suite
improves error messages (all should now report the program that makes the error)
fixed a bug in dbstats output when the mean is zero
jdb-announce@heidemann.la.ca.us
and jdb-talk@heidemann.la.ca.us
To subscribe to either, send mail to jdb-announce-request@heidemann.la.ca.us
or jdb-talk-request@heidemann.la.ca.us
with ``subscribe'' in the BODY of the message.
dblib_text2html()
that will do simple conversion of iso-8859-1 to HTML
larse@isi.edu
nxu@aludra.usc.edu
.
O(1))
O(1)
memory
2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
dbstats
(renamed dbcolstats),
dbcolrename,
dbcolcreate,
It also provides perl function aliases for the internal modules, so a string of jdb commands in perl are nearly as terse as in the shell:
use Jdb::Filter::dbpipeline qw(:all); dbpipeline( dbrow(qw(name test1)), dbroweval('_test1 += 5;') );
-
(the default empty value) for
statistics it cannot compute (for example, standard deviation
if there is only one row),
instead of the old mix of -
and ``na''.
-t mean,stddev
option is now
--tmean mean --tstddev stddev
. See dbcolstatscores for details.
-e
option.
n
output
(except without differentiating numeric/non-numeric input),
or the equivalent of dbstripcomments | wc -l
.
-i
option to include non-matches
is now renamed -a
, so as to not conflict with the new
standard option -i
for input file.
JDB is Copyright (C) 1991-2008 by John Heidemann <johnh@isi.edu>.
This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
A copy of the GNU General Public License can be found in the file ``COPYING''.
Any comments about these programs should be sent to John Heidemann
johnh@isi.edu
.