Choosing a Spam Filter
Spam Filter Mechanics
BySpam filters have different modes of operation. Understanding how they work can help you choose which one to use.
These days, the choice of spam filters comes down to Bogofilter and SpamAssassin. Other choices, like DSPAM, are no longer in development. Although a few other choices (e.g., SpamBayes) are available, when an email reader offers a plugin, it is almost always for either Bogofilter or SpamAssassin. However, what is less often discussed is which filter is the best to use in which circumstances.
Instead, most users simply nod solemnly when they read that both involve “Bayesian filtering.” Most of us – including many who use the phrase – have no idea what Bayesian filtering is, but it sounds scientific and reassures us that either choice is acceptable.
In fact, learning that Bogofilter and SpamAssassin are “Bayesian” is useless for choosing between them. To call them Bayesian means nothing more than their structure is based on the the 18th century work of Thomas Bayes in statistics and probability. More specifically, both apply Bayes’ work by collecting words and assigning a probability that each word indicates spam. The more suspect words contained in an email, the greater the chances it is spam. However, to make an informed choice between spam filters requires considerably more detail.
Bogofilter
Bogofilter has its roots in “A Plan for Spam,” a 2002 essay by English developer Paul Graham. After trying to develop filters based on the identifying characteristics of spam, Graham concluded that beyond a certain point, the more rules he added, the more false positives he obtained – that is, the more email messages that were incorrectly identified as spam.
Graham’s solution was to parse his samples of spam and non-spam into tokens, or individual words, and use Bayesian tools to assign each token the possibility that it indicates spam, biasing them slightly in favor of not being spam to minimize false positives. By examining the top 15 tokens in the header and body of each new email message, he calculated the possibility that it was spam. If the probability was greater than 0.9, the message was considered spam.
According to Graham, the advantage of this statistical approach is that it refers to something real – the probability of being spam – and worked with both neutral and spam-indicating words.
However, he also recognized that the more personalized the filter was, the more accurate it would be. For this reason, he also included the possibility of using white lists to indicate non-spam, or “ham,” and black lists to indicate spam.
After reading Graham’s essay, Eric S. Raymond founded the Bogofilter project. Today, Bogofilter is maintained by other developers,and has refined Graham’s calculations based on Gary Robinson’s suggestions. The modern refinements include recognizing MIME types, treating each hostname and IP address as a separate token (rather than dividing them up into separate words), and ignoring dates and Message-IDs as irrelevant. However, the basic approach remains that advocated by Graham.
The mathematically inclined can learn more about how Bogofilter assigns the probability of an email being spam by following the links and reading the man page for the filter. However, the most important point for the average user is that Bogofilter relies on statistical probability, supplemented by each user’s list of spam and ham. Advocates of this approach emphasize its simplicity, as well as its lower number of false positives once it is trained – that is, once the white and black lists are produced. These lists are contained in the .bogofilter folder in your home directory.
SpamAssassin
SpamAssassin takes a different approach from Bogofilter. SpamAssassin’s main approach is to identify the characteristics of spam and then run tests to locate them. Many tests, although not all, rely heavily on regular expressions to catch variations of words and phrases.
You can view the Perl scripts used by SpamAssassin in /usr/share/spamassassin. More than 50 are listed in my current installation of Debian Stable. From their number alone, you can tell they are a varied lot, but they include tests for the common indicators of spam in headings, in the bodies of email, and in HTML code, as well as tests for recognizing offers for anti-viruses, drugs, and pornography. In the English version, some basic tests for French, German, and Italian are also included. They also include a Bayesian probability test similar to Bogofilter’s, as well as white and black lists for individual customization.
Additionally, /etc/spamassassin includes a test developed for Debian that looks for spam involving anacron, cron, and debconf, as well as plugins installed with each recent version.
Each test assigns an email message a positive or negative value, which is added to the results of other tests to determine whether the email is ham or spam. Unlike Bogofilter, exactly what these values represent is uncertain, although considering many users probably have no understanding of Bayesian analysis, much the same could also be said for Bogofilter, of course.
With all these tests, SpamAssassin exemplifies the basic security principle of “defense in depth.” Unlike Bogofilter, it does not rely on one or two approaches, but on a wide variety of defenses. A piece of spam might slip by a single SpamAssassin test, but the odds of it slipping by all of them is unlikely.
Context is Everything
Both Bogofilter and SpamAssassin are available as plugins for major email readers and generally require little customization. Both also have high success rates. However, because black and white lists greatly improve each filter’s accuracy, be wary of the various comparisons online. Your own results are likely to be very different from those posted, especially before you have trained the filter to suit your personal email.
In fact, the filters are so different in their approaches and so dependent on how they are trained that deciding in any objective sense which one is most effective is almost impossible. To some extent, your decision as to which filter to use may depend on whether you prefer Bogofilter’s single, all-encompassing approach or SpamAssassin’s defense in depth.
Even more importantly, your choice will depend on context. To start, if the speed of filtering matters, Bogofilter is much faster than SpamAssassin for the simple reason that it runs fewer tests. If you ordinarily receive several hundred email messages in the first download of the day, SpamAssassin runs so many tests that you might be unable to access your email for five minutes – a delay that you might consider worse than manually deleting spam.
By contrast, in my experience, Bogofilter requires several days of training before it reaches full effectiveness. On the one hand, stopping to train Bogofilter in the middle of other tasks can be a nuisance, especially because it seems to require several examples before it recognizes posts on a mailing list as ham. On the other hand, SpamAssassin is so comprehensive that it generally identifies spam more accurately without training. If you prefer to minimize training, SpamAssassin is probably the filter you want.
Another consideration is how many false positives you have once your filter of choice has been trained. My experience is that, once trained, Bogofilter has fewer false positives. Just as Graham observed,adding more rules, the way SpamAssassin does, beyond a certain point seems to increase false positives.
Still another consideration is that SpamAssassin is reactive. It adds tests in response to the latest tactics used by spammers but appears to be slower to discard tests that are no longer needed – if it does so at all. Similarly, if new spamming tactics appear, you might temporarily have less effective filtering until a new software release is made. However, because Bogofilter relies on probability rather than on spam characteristics, it might not have the same problems – at least not to the same extent.
As you can see, the decision of which filter to use has no absolute answer. However, once you understand how both filters work, you can at least make a more informed choice to accommodate your preferences and your needs. If nothing else, you can choose the lesser of two evils.
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.
-
Fedora 41 Released with New Features
If you're a Fedora fan or just looking for a Linux distribution to help you migrate from Windows, Fedora 41 might be just the ticket.
-
AlmaLinux OS Kitten 10 Gives Power Users a Sneak Preview
If you're looking to kick the tires of AlmaLinux's upstream version, the developers have a purrfect solution.
-
Gnome 47.1 Released with a Few Fixes
The latest release of the Gnome desktop is all about fixing a few nagging issues and not about bringing new features into the mix.
-
System76 Unveils an Ampere-Powered Thelio Desktop
If you're looking for a new desktop system for developing autonomous driving and software-defined vehicle solutions. System76 has you covered.