Spam Statistics
Explanations are further down.
Current Plots
(postscript, png)
|
(postscript, png)
|
(postscript, png)
|
(postscript, png)
|
(postscript, png)
|
(postscript, png)
|
Explanations
I get a fair amount of spam. Not a huge amount from what I hear from others, but a fair amount. The last time I reconfigured everything to chase spam away, I added hooks to track how well they were doing. Specifically, every time a piece of mail is actively detected as spam or is manually deleted as spam, I now keep track of it. I say “actively detected” because a fair amount of spam drops through to my spam bucket because it's simply not addressed to me; those messages are not counted.
Reading the Plots
There are 6 plots generated. Two of them are generated hourly and the rest daily. The hourly plots are a bar chart of the number of e-mail messages detected as spam each hour and a cumulative plot of the total number of spam messages detected up to that point.
Each plot is also broken out by detection mechanism. Between me, ISI, and USC, there are 4 possible ways spam can get marked. Specifically the spam detectors are:
- ISI runs a version of spamassassin on the Institute's mail servers. It is the first line of spam defense, and catches much of the inbound spam without my local workstation having to do any work. Its stats are plotted in red, labeled “central.”
- I run a local, more up-to-date version of spamassassin on my workstation, which catches a fair amount of spam that ISI misses. I don't think the central spamassassin is using the learning system, but I know that my local version is, so it's better able to pick out what I think is spam. I suspect that if ISI did not run any spamassassin software, this would catch the same amount as both in tandem, but that's just a supposition. Statistics for my local version of spam assassin are plotted in green, labeled “local”.
- USC runs some form of spam detector as well, though I'm not sure what it is. Little of my mail passes through USC's mail servers, though enough does that the USC system picks up a handful of messages a day for me. It's plotted in orange and labeled “USC.”
- Spam that gets through that gauntlet is marked by me by hand. I read mail in mutt, and I've written a little macro to report a message to spamassassin as spam (so spamassassin can learn to spot similar messages), report it to the statistics system, and delete the mail. The same keystroke that deletes it, reports it for statistics. I don't delete an awful lot by hand, and I've recently gotten better at using spamassassin's learning facilities, so I hope to do even better in the future.
Each bar in the bar chart represents the number of messages detected in the hour starting with the plotted value. The bar over 1 is the messages detected between 1:00 and 1:59 AM. Similarly the point on plotted above 2 on the cumulative plot represents the number of messages detected from midnight until 2:59 AM. The bars for the different categories are stacked. The total number of messages detected in a given hour is the height of the whole stack. Though this can make precise determination of each component's contribution to the total difficult to determine exactly, I think the proportions are more interesting.
The other four plots represent daily summaries of the number of detected spam mail for each day. Again there are cumulative and single-day summaries, broken out by detection mechanism.
A Word About False Positives
I don't keep track of the number of false positives – good mail flagged as spam – that this system generates. The manually deleted spam is a measure of false negatives, and I probably should do the other, but I haven't found a transparent way to do it with my system. For that matter, I occasionally delete a spam message without using the macro for learning and marking, or I may delete mail that's not strictly speaking spam using it if I want to stop seeing those messages. The false negatives aren't really very scientific, either. A real study would address these issues, but I'm just keeping stats for fun and to aid my intuition a little.
Generating the Plots
In principle, generating the plots is easy. In practice, a fair number of systems are brought to bear.
For each message detected as spam, its Subject: line is put into a simple text file, one for spamassassin and one for me. The spamassassin file is actually maintained by the procmail mail filtering program. Procmail does a lot for me, including picking my football pool when I don't get to it. Because it's already putting messages that spamassassin has identified as spam aside, it's easy to add a step that puts a copy of the subject into a file.
I read mail using mutt, and I have a macro to pass a message to spamassassin as spam (so spamassassin can spot similar messages in the future) and delete the message. I've added a line to copy the subject out as well. When I do see a spam message that's gotten through, one keypress teaches spamassassin how to spot it, records my deletion, and deletes it.
Once all that data is recorded, plotting is the next step. Plots are generated from the files created above using perl and grap/groff. They get some help from the netpbm suite of tools for the image conversions.
There are no thrilling breakthroughs in the scripting. The data's analyzed in perl and a grap script is generated. The perl script calls grap, groff and the netpbm tools to create the plots. Another perl script puts links to the most recent versions of the plots into this page after any new plots are generated. All of this is coordinated through cron. The most intensive work is the conversion from postscript – grap/groff's usual output format – into png.
