Spam Statistics

Explanations are further down.

Current Plots

Hourly spam plot Hourly cumulative spam plot
Weekly spam plot Weekly cumulative spam plot
Monthly spam plot Monthly cumulative spam plot

Explanations

I get a fair amount of spam. Not a huge amount from what I hear from others, but a fair amount. The last time I reconfigured everything to chase spam away, I added hooks to track how well they were doing. Specifically, every time a piece of mail is actively detected as spam or is manually deleted as spam, I now keep track of it. I say “actively detected” because a fair amount of spam drops through to my spam bucket because it's simply not addressed to me; those messages are not counted.

Reading the Plots

There are 6 plots generated. Two of them are generated hourly and the rest daily. The hourly plots are a bar chart of the number of e-mail messages detected as spam each hour and a cumulative plot of the total number of spam messages detected up to that point.

Each plot is also broken out by detection mechanism. Between me, ISI, and USC, there are 4 possible ways spam can get marked. Specifically the spam detectors are:

Each bar in the bar chart represents the number of messages detected in the hour starting with the plotted value. The bar over 1 is the messages detected between 1:00 and 1:59 AM. Similarly the point on plotted above 2 on the cumulative plot represents the number of messages detected from midnight until 2:59 AM. The bars for the different categories are stacked. The total number of messages detected in a given hour is the height of the whole stack. Though this can make precise determination of each component's contribution to the total difficult to determine exactly, I think the proportions are more interesting.

The other four plots represent daily summaries of the number of detected spam mail for each day. Again there are cumulative and single-day summaries, broken out by detection mechanism.

A Word About False Positives

I don't keep track of the number of false positives – good mail flagged as spam – that this system generates. The manually deleted spam is a measure of false negatives, and I probably should do the other, but I haven't found a transparent way to do it with my system. For that matter, I occasionally delete a spam message without using the macro for learning and marking, or I may delete mail that's not strictly speaking spam using it if I want to stop seeing those messages. The false negatives aren't really very scientific, either. A real study would address these issues, but I'm just keeping stats for fun and to aid my intuition a little.

Generating the Plots

In principle, generating the plots is easy. In practice, a fair number of systems are brought to bear.

For each message detected as spam, its Subject: line is put into a simple text file, one for spamassassin and one for me. The spamassassin file is actually maintained by the procmail mail filtering program. Procmail does a lot for me, including picking my football pool when I don't get to it. Because it's already putting messages that spamassassin has identified as spam aside, it's easy to add a step that puts a copy of the subject into a file.

I read mail using mutt, and I have a macro to pass a message to spamassassin as spam (so spamassassin can spot similar messages in the future) and delete the message. I've added a line to copy the subject out as well. When I do see a spam message that's gotten through, one keypress teaches spamassassin how to spot it, records my deletion, and deletes it.

Once all that data is recorded, plotting is the next step. Plots are generated from the files created above using perl and grap/groff. They get some help from the netpbm suite of tools for the image conversions.

There are no thrilling breakthroughs in the scripting. The data's analyzed in perl and a grap script is generated. The perl script calls grap, groff and the netpbm tools to create the plots. Another perl script puts links to the most recent versions of the plots into this page after any new plots are generated. All of this is coordinated through cron. The most intensive work is the conversion from postscript – grap/groff's usual output format – into png.

EFF Blue Ribbon Campaign Valid XHTML 1.0!
This page written and maintained by Ted Faber.
Please mail me any problems with, or comments about this page.