These programs and scripts digest web proxy and server log files. From a sequence of logs they produce a digest file that contains one entry line for each URL. Each URL entry includes the request time for each fetch of that URL that appears in the logs digested. size.
These digest files capture the dynamics of the fetching activity associated with each URL, across a span of days or weeks, in a format that is much easier to access and store than is a set of raw log files. The fetch history of a URL is useful for determining whether a particular URL file is frequently fetched, and so is `hot'. A server may decide to treat hot URLs differently than other more typical URLs. The analysis of past recorded fetch history is useful for determining the effectiveness of algorithms that select such hot URLs.
Because there are differing log file formats, for example Squid and Common Log formats, much of the software discussed here operates on a simplified intermediate format log file. In the case of Squid proxy logs, the intermediate format log is achieved merely by extracting certain fields from a Squid log via the UNIX cut() program. For other log file formats, conversion may be a bit more difficult, perhaps requiring date reformatting.
The programs and scripts discussed here run under the BSD/SunOS flavor of UNIX and require Perl version 5 or the C-shell. It should be relatively trivial to make them run under other variants of the UNIX operating system.
The National Laboratory for Applied Network Research (NLANR) Information Resource Caching Project has made available a sequence of daily Squid logs to other researchers as well as some daily statistics. These logs are gathered each day from several proxy cache sites. For a discusion of this project and pointers to other related projects and sites, see http://ircache.nlanr.net.
The intermediate log file format contains a sequence of one-line text records of the form: <time> <client> <log-tag> <size> and <URL>. Within a record, each of these fields is separated by a single space character. The fields contain:
<time> --- Timestamp of request in seconds since
1970.
<client> --- IP client address, assumed good
for one day only, since privatizing software often substitutes a
randomly chosen, unique address for each IP address each day.
<log-tag> --- Log tag and HTTP reply code. If none of
this data is needed, this field may be a dash "-".
<size> --- Number of bytes written to the client.
<URL> --- Complete requested URL string (with no embedded
spaces).
The Squid log file is a sequence of
one-line text records of the form:
<time>
<etime> <client> <log-tag>
<size> <request><URL>
"-" <hierarchy>. Within a record, each of these
fields is separated by a single space character. The fields of
interest here contain:
<time> --- Timestamp of
request in seconds since 1970.
<etime> --- The
elapsed time of the request in milliseconds.
<client>
--- IP client address good for one day only.
<log-tag> --- Log tag and HTTP code.
<size> --- Number of bytes written to the client.
<request> --- The HTTP request method.
<URL> --- The requested URL.
<userid>
--- Always '-' for NLANR logs.
The <hierarchy>
field contains:
<hostname> --- How and where the
requested object was fetched.
<content> ---
Content-type field of HTTP reply (may be empty)
892796408.126 2276 141.14.76.96 TCP_MISS/304 160 GET http://www.foo.bar/3943.htm - TIMEOUT_CLOSEST_DIRECT/www.republika.co.id
Conversion from Squid log format into intermediate format is simply accomplished by extracting the <time>, <client>, <log-tag>, <size> and <URL> fields. For Squid logs, this is done by the dailylog.pl program discussed below.
Above sample Squid proxy log entry in intermediate file format:
892796408.126 141.14.76.96 TCP_MISS/304 160 http://www.foo.bar/3943.htm
Several programs are provided and are available here. They are:
Daily log processing:
dailylogs --- C-shell script
dailylogs.pl
dailylog.pl
ftpgetfile --- C-shell script
ccachefilter.pl
reqmat.pl
Monthly log processing:
monthly.pl
cmbreqmat.pl
Hot set visualization programs:
reqiat.pl
hshitplot.pl
hstrafficplot.pl
hsurlplot.pl
GNU utilities:
gcut() --- GNU version of cut()
gsort() --- GNU version of sort()
The GNU versions of these utilities are used to avoid line-length limits. Bugs in browsers can result in some very long URL name strings.
Because web cache log files can be very large, a straightforward longitudinal study of many days of logs could consume a great deal of storage. The NLANR site for example, only keeps the past seven days of log files on line. This package collects and processes logs data day by day, producing as a result a sequence of reduced-size request-time-matrix (reqmat) files. These reqmat files may be later combined and further reduced to perform a multi-week analysis.
Steps 1 thru 6 below are performed via dailylog.pl().
Steps 1 thru 4 below are performed via monthly.pl(), which is called from dailylog.pl() when a change of month is noticed.
If one has only past log data from which to make a determination, a `hot' URL is determined by examining the history of activity for that URL and testing that activity against one or more thresholds. Two examples of such thresholds are the number of requests per day and the standard deviation of the interval between those requests.
The number of requests per URL, the intervals between those requests and the standard deviation of those intervals is determined by the reqiat.pl() program. This program reads a reqmat file and produces a corresponding request-interarrival-times (reqiat) summary file.
Perl programs are provided that use the two thresholds mentioned above to select a hot URL set. These thresholds are varied across a range that is determined by adjusting values within each program. What results is a 2-D summarization of the sizes of the host sets being selected by the varied threshold settings.
These Perl programs produce GNUPLOT data files that are used to generate one of three surface plots, where the Z-axis may be: (1) hot set as a percentage of total URLs, (2) hot set traffic as a percentage of total traffic, and (3) the number of fetches generated by the hot set.
To drasticly reduce run time for these programs, the reqiat file must first be reverse sorted on its <request-count> field before being passed to one of these programs.
A brief description of the programs and how to run them is discussed below.
This is a simple C-shell script that serves only to set the UNIX filesize limit to the user's largest legal value. Having done that, it calls dailylogs.pl().
This program is meant to be run daily. It copies a set of daily Squid log files from the NLANR storage site, ircache.nlanr.net currently, processes and stores the results. When a monthly transition occurs, it performs a month-long longitudinal analysis, stores the results, deletes the previous month's daily files and begins storing daily results for the new month. After completing, it resubmits itself to run at midnight.
Its run-time arguments are obtained from a configuration file: dailylogs.config, which is assumed to be in the user's connected directory. The contents of this file are a sequence of text name/value pairs, one per line. For example:
src_host ircache.nlanr.net
src_dir Traces
log_names sd sj sv
sort_dir /tmp
dst_dir /nfs/ruby/lsam/logdata
Where src_host is the host where Squid logs are located, src_dir is the directory within src_host where log files are stored, and log_names contains the names of sites for which Squid logs are maintained.
These daily log files are currently kept at ircache.nlanr.net and are publicly available via FTP from the Traces directory. Their naming convention is: <log-name>.sanitized-access.<date>.gz. The <date> is in the form "YYMMDD". Thus a log file could be named: lj.sanitized-access.980211.gz.
dst_dir --- Destination directory where processed log and other
intermediate files are stored.
sort_dir --- Directory to be used
by the UNIX sort() program for its temporary storage.
This program is a batch job that performs daily log processing for each named log in log_names. For each log-name in log_names, this program copies the corresponding log file into dst_dir and uncompresses it. It then calls dailylog.pl() to process that particular log file.
NOTE: The C-shell script ftpgetfile is used to perform the FTP to get the log file. This currently uses the user's .netrc file to avoid a login, password prompt. See discussion of ftpgetfile below. dailylogs.pl keeps track of the days that it has processed files within the past month via the file dailylogs.run_dates, which is found in the dst_dir directory. This file contains one line per day, each line representing a day for which logs have been processed. Each line takes the form "YYMMDD", as in 980301. The date of the last entry in this file is incremented by 24 hours to determine the "YYMMDD" to be associated with the log files that are to be processed during the next execution.
If a transition to a new month is seen, the monthly.pl program is run to process and combine the previous month's worth of files. The daily files for the previous month are then deleted and a new dailylogs.run_dates file is created.
Called as: dailylog.pl <log-file> {<sort-dir>}
Where: <log-file> --- The filename string of a Squid log
file.
<sort-dir> --- An optional argument. If
present, it specifies to the
UNIX sort() program a directory to be
used for
temporary files created during the merge phase of the
sort. If the argument is not present, it is defaulted to /tmp.
The steps carried out by the dailylog.pl program are:
After successful completion, dailylog.pl will have produced six files in the connected directory:
<log-file>.reqmat --- request matrix file, unfiltered
<log-file>.reqmat.filtered --- ditto, filtered
<log-file>.reqmat.dropped --- ditto, dropped by filtering
<log-name>.summary --- accumulated summary, unfiltered
<log-name>.summary.filtered --- ditto, filtered
<log-name>.summary.dropped --- ditto, dropped by filtering
Where <log-name> is the file name prefix portion of <log-file>.
NOTE: Currently, the <log-file>.reqmat.dropped file is deleted after it is created. However, the <log-name>.summary.dropped is kept.
Called as: ftpgetfile <src-host> <src-dir> <src-file> <dst-file>
A C-shell script that is used to FTP from host <src-host> a file <src-file> in directory <src-dir>. That file is stored into the connected directory and is give the file name <dst-file>.
This script requires the user to have modified their .netrc file to allow FTP to automatically log into <src-host> without prompting for username and password. For example, adding this line to the user JohnDoe's .netrc file would work:
machine ircache.nlanr.net login anonymous password JohnDoe@foo.edu
Called as: ccachefilter.pl <sorted-log-file> <output-log-file>
Where <sorted-log-file> is a Squid proxy log file that is sorted by URL and <output-log-file>
This Perl program reads each record in <sorted-log-file>, which is a daily Squid proxy log file that has been sorted by URL. It filters that file, removing from it those records for which the client and URL fields are not unique. The result is written to <output-log-file>. This output log file mimics what the input log file would contain if every client had a perfect web cache that was operating and which was flushed at the end of the day.
Called as: reqmat.pl <log_file> <reqmat_file>
Where <log-file> is an intermediate format log file that has been sorted on its URL field and <reqmat-file> is the file name of the output request-time-matrix file.
This Perl program processes an intermediate format log file <log-file>, creating from it an output reqmat file. The <log_file> must already have been sorted by URL. The output <reqmatrix_file> consists of lines of the form: <URL> <timestamp> <size> <interval.reqs> ... <interval.reqs>. For a description of this file format, see below.
Called as: cmbreqmat.pl <combined-file> <output-file>
Where <combined-file> is a file formed by concatenating several daily reqmat files and sorting the resulting file on its URL field and <output-file> is a file name for the output reqmat file that results from merging multiple reqmat records for URLs into a one-line record for each URL. Each line may be quite long.
In principle one could concatenate all the daily Squid proxy log files into a single log file, sort that file by URL and pass it to reqmat.pl. However, a month of proxy log files can exceed 2GB. This would require at least another 2GB to sort. Because the daily reqmat files are typically much smaller than the log files themselves, it is much more practical to pass each daily log file to reqmat.pl, gather the request-time-matrix files it produces at the end of a month, concatenate them, and finally, sort the resulting combined file by URL.
Called as: reqiat.pl <reqmat_file> <reqiat_file>
This Perl program reads a reqmat file that is produced by the reqmat.pl or cmbreqmat.pl programs. The output <reqiat_file> consists of lines of the form:
<URL> <size> <request-count> <mean> <std-dev>
Where <size> is the byte count of <URL> when first encountered, <request-count> is the number of requests that <URL> was recorded as receiving in the request-time-matrix file, <mean> is the average inter-arrival time between requests. If there is only one hit for a URL, <mean> is meaningless. <std> is the standard deviation of inter-arrival time between requests. If there is only one hit for a URL, <std> is meaningless.
NOTE: Where there is more than one hit recorded during a single interval period, each arrival will have been deemed to arrive evenly spaced within the interval. If one hit is recorded, it is deemed to have arrived in the middle of the interval. This introduces a small error in both <mean> and <std> calculation and is a consequence of the data reduction undertaken when producing the request-time-matrix file.
Called as: hshitplot.pl <reqiat-file> <GNUPLOT-file> <URL-total>
Where <reqiat-file> is an reqiat input file that has been reverse sorted on its <request-count> field, <GNUPLOT-file> is an output file used for input into GNUPLOT, and <URL-total> is the total number of unique URLs in <reqiat-file>.
This program processes an reqiat file that has been sorted in descending order of its <request-count> fields. It produces from that a GNUPLOT data file suitable for producing a 3-D surface plot.
Consider the question "What percentage of URLs in the reqiat file have been fetched at least 60 times per month, where the standard deviation in intervals between fetches is no more than 0.75 days". That question can be answered by processing each line of the reqiat file, applying both conditions as a conjunction to each record.
This program performs that task for a range of times fetched per month and applies a range of standard deviation of intervals between fetches. This results in a two-dimensional surface of Z-axis percentage values.
The range of times fetched per month and the range of standard deviation of intervals between fetches is controlled by the bounds on two nested for loops within the hsurlplot.pl program itself. To alter these ranges, the user must alter those loops.
Called as: hshitplot.pl <reqiat-file> <GNUPLOT-file> <traffic-total>
Where <reqiat-file> is an reqiat input file that has been reverse sorted on its <request-count> field, <GNUPLOT-file> is an output file used for input into GNUPLOT and <traffic-total> is the total number of bytes transferred by all requests for URLs in the reqiat file.
This program processes an reqiat file that has been sorted in descending order of its <request-count> fields. It produces from that a GNUPLOT data file suitable for producing a 3-D surface plot.
Consider the question "What percentage of traffic represented in the reqiat file is produced by URLs that have been fetched at least 60 times per month, where the standard deviation in intervals between fetches is no more than 0.75 days". That question can be answered by processing each line of the reqiat file, applying both conditions as a conjunction to each record.
This program performs that task for a range of times fetched per month and applies a range of standard deviation of intervals between fetches. This results in a two-dimensional surface of Z-axis percentage values.
The range of times fetched per month and the range of standard deviation of intervals between fetches is controlled by the bounds on two nested for loops within the hstrafficplot.pl program itself. To alter these ranges, the user must alter those loops.
Called as: hshitplot.pl <reqiat-file> <GNUPLOT-file>
Where <reqiat-file> is an reqiat input file that has been reverse sorted on its <request-count> field, <GNUPLOT-file> is an output file used for input into GNUPLOT and <URL-total> is the total number of unique URLs in <reqiat-file>.
This program processes an reqiat file that has been sorted in descending order of its <request-count> fields. It produces from that a GNUPLOT data file suitable for producing a 3-D surface plot.
Consider the question "How many fetches are represented in the reqiat file by URLs that have been fetched at least 60 times per month, where the standard deviation in intervals between fetches is no more than 0.75 days". That question can be answered by processing each line of the reqiat file, applying both conditions as a conjunction to each record.
This program performs that task for a range of times fetched per month and applies a range of standard deviation of intervals between fetches. This results in a two-dimensional surface of Z-axis fetch-count values.
The range of times fetched per month and the range of standard deviation of intervals between fetches is controlled by the bounds on two nested for loops within the hshitplot.pl program itself. To alter these ranges, the user must alter those loops.
A reqmat file summarizes the times of requests seen for each URL. The reqmat.pl program produces this file which is sorted by URL name. Each URL appears once only in this file. Each line of the file consists of at least four text fields separated by a space.
<URL> <timestamp> <size> <interval.reqs> ... <interval.reqs>
The <URL> is the complete URL as recorded in the log file. It may not contain embedded spaces. The <timestamp> records the time the first mention of the URL was made in seconds since 1/1/1970 GMT.
The <size> is the URL file size in bytes. The <size> value chosen to be represented here is the largest value for all requests seen in the log file for this URL. Because the size of a URL file can vary during the day, this can introduce a source of inaccuracy when computing total traffic per URL.
To reduce the data stored in a reqmat file, time is divided into intervals of equal length, with interval zero starting at time = <timestamp>. The interval definition is currently 15 minutes (900 seconds).
The field notation <interval.reqs> records the interval during which a request for <URL> was seen and the number of times during this interval that such a request was seen. For example, "43.2" indicates that during interval number 43 (the 44th interval) two requests for <URL> were seen. There is no <interval.reqs> field present for intervals during which no requests were seen. The interval numbers increase from left to right.
If the interval length is 15 minutes, there are 96 intervals per day. A longitudinal study of hit times may span as much as a month. A span of a month with an interval of 15 minutes would result in a matrix with approximately 2900 interval positions per URL encountered. Because the typical URL is referenced infrequently with respect to such an interval, a sparse external representation of the hit matrix is advantageous, if not indeed necessary, considering the amount of storage involved.
A longitudinal analysis is accomplished by appending a time-ordered sequence of reqmat files together and sorting the resulting file by URL name. The cmbreqmat.pl program processes this file to produce a combined reqmat file. The <interval.reqs> fields recorded in this file for each URL seen are relative to the <timestamp> of the first request seen for that URL.
A request-inter-arrival-time (reqiat) file digests a reqmat file and produces statistics on the inter-arrival times between successive requests for each URL. The reqiat.pl() program produces this file which is sorted by URL name. Each URL appears once only in this file. Each line of the file consists of five text fields separated by white space.
<URL> <size> <request-count> <mean> <std-dev>
The <URL> is the complete URL as recorded in the log file. It may not contain embedded spaces. The <size> is the URL file size in bytes. The <request-count> is the total number of requests seen for this URL in the request-time-matrix file. The <mean> and <std-dev> are the mean of inter-arrival times between requests and the standard deviation of those inter-arrival times.
The <mean> field value is zero if the <request-count> field value is one. The <std-dev> field value is zero if the <request-count> field value is either one or two.
A longitudinal analysis of request inter-arrival times is accomplished by passing to reqiat.pl() a combined reqmat file as produced by cmbreqmat.pl.