Troubleshooters.Com Presents

Linux Productivity Magazine

Volume 3 Issue 2, February 2004

Spamassassin

Copyright (C) 2004 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Linux Productivity Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.

Have Steve Litt write you a quick application!

See also Troubleshooting Techniques of the Successful Technologist
and Rapid Learning: Secret Weapon of the Successful Technologist
by Steve Litt

[ Troubleshooters.Com | Back Issues |Troubleshooting Professional Magazine ]


 
THERE IS A STRICT JUNE DEADLINE. THE TIME TO START IS NOW!! -- Laurence Canter
(From the 4/12/1994 Canter and Siegel Greencard Lottery Usenet spam)

CONTENTS

Editor's Desk

By Steve Litt
Sanford A. Wallace.

How that name takes me back. Back to a simpler time, a more innocent time. The radio featured the Cranberries, Ace of Base, and Boyz II Men. Your TV sported a brand new sitcom called "Friends". The year was 1994, I was on Compuserve, and read every email as if it were gold. Even the ads. It was a time when the phrase "You've got mail" made you happy.

Many of those ancient ads came from Sanford A. Wallace. Great ads. Well written ads. Ads you could enjoy.

For a while.

After several months of ads from Sanford A. Wallace, I wrote him a rather blunt email telling him not to send me any more ads. As I remember (and my memory is rather vague), he wrote me back.

Sanford A. Wallace was the proprietor of Cyber Promotions, a pioneer in the field of unsolicited commercial email. He led the way for countless others to follow.

Not that any of this was important. In 1994 I could go through a day's email in 5 minutes. Whole days went by without unsolicited advertisements.

2004. Now we have a name for what Sanford A. Wallace did -- spam. 80% of my email are spams. I get at least 400 spams per day -- way too many to examine closely.

But I can't simply delete casually. Brand new customers contact me through email. If one of those were deleted, it could cost me ten thousand dollars. What's a businessperson to do?

Spam filtering is the answer. Create an automated program that flags probable spam messages in such a way that your email client can place them in a probable spam folder. When going through that folder, you quickly glance to see whether a message looks like something needed, and if not, delete it. I often delete 8 or 10 probable spams at a time, based only on their subjects and senders.

For messages that aren't probable spam, you use more consideration in making the deletion decision.

When a non-spam message is flagged as probable spam, that's called a false positive. It serves as a warning that your spam marking criteria are too loose. Put the address in the whitelist, and use that message to train Spamassassin not to flag similar messages.

Today spammers make war on the public. We maintain blacklists -- they send out emails designed specifically to poison spam detectors into creating false positives so that people will dump their spam detectors. How I long for the innocent days of Sanford A. Wallace.

This issue of Linux Productivity Magazine details Spamassassin: how to install it, how to configure it, and how to use it. No two SpamAssassin installations are alike because of how differently email is handled in different situations. But this issue will guide you through a few of the most common scenarios.

Using Spamassassin, you can put back the genie that Sanford A. Wallace released in a year when Newt Gingrich briefly became the most powerful politician on the planet.

So kick back, relax, and read this month's Linux Productivity Magazine. And remember, if you're a free software user, contributor, or evangelist, this is your magazine. Enjoy!
Steve Litt is the author of Samba Unleashed.   Steve can be reached at his email address.

Help Publicize Linux Productivity Magazine

By Steve Litt
Loyal readers, I need your help.

For months I've publicized Linux Productivity Magazine, expanding it from a new magazine to a mainstay read by thousands. There's a limit to what I can do alone, but if you take one minute to help, the possibilities are boundless.

If you like this magazine, please report it to one of the Linux magazines. Tell them the URL, why you like it, and ask them to link to it.

I report it to them, but they don't take it very seriously when an author blows his own horn. When a hundred readers report the magazine, they'll sit up and take notice.

Reporting is simple enough. Just click on one of these links, and report the magazine. It will take less than 5 minutes.

News Mag
Submission URL or Email address
Comments
Slashdot
http://slashdot.org/submit.pl
Just fill in the short form.
LinuxToday
http://linuxtoday.com/contribute.php3
Just fill in the short form.
Linux Weekly News
lwn@lwn.net
Just tell them the URL, why you like it.
NewsForge
webmaster@linux.com
Just tell them the URL, why you like it.
VarLinux
http://www.varlinux.org/vlox/html/modules/news/submit.php
Just fill in the short form.
LinuxInsider.Com,
Newsfactor Network
http://www.newsfactor.com/perl/contact_form.pl?to=contact
Just tell them the URL, why you like it.
The Linux Knowledge Portal
webmaster@linux-knowledge-portal.org
Just tell them the URL, why you like it.
OS News
http://www.osnews.com/submit.php
Just tell them the URL, why you like it.
DesktopLinux
http://www.desktoplinux.com/cgi-bin/news_post.cgi
Only for LPM issues involving the Linux desktop, not for programming or server issues.

If you really like this magazine, please take 5 minutes to help bring it to a wider audience. Submit it to one of the preceding sites.
Steve Litt is the founder and acting president of Greater Orlando Linux User Group (GoLUG).   Steve can be reached at his email address.

GNU/Linux, open source and free software

By Steve Litt
Linux is a kernel. The operating system often described as "Linux" is that kernel combined with software from many different sources. One of the most prominent, and oldest of those sources, is the GNU project.

"GNU/Linux" is probably the most accurate moniker one can give to this operating system. Please be aware that in all of Troubleshooters.Com, when I say "Linux" I really mean "GNU/Linux". I completely believe that without the GNU project, without the GNU Manifesto and the GNU/GPL license it spawned, the operating system the press calls "Linux" never would have happened.

I'm part of the press and there are times when it's easier to say "Linux" than explain to certain audiences that "GNU/Linux" is the same as what the press calls "Linux". So I abbreviate. Additionally, I abbreviate in the same way one might abbreviate the name of a multi-partner law firm. But make no mistake about it. In any article in Troubleshooting Professional Magazine, in the whole of Troubleshooters.Com, and even in the technical books I write, when I say "Linux", I mean "GNU/Linux".

There are those who think FSF is making too big a deal of this. Nothing could be farther from the truth. The GNU General Public License, combined with Richard Stallman's GNU Manifesto and the resulting GNU-GPL License, are the only reason we can enjoy this wonderful alternative to proprietary operating systems, and the only reason proprietary operating systems aren't even more flaky than they are now. 

For practical purposes, the license requirements of "free software" and "open source" are almost identical. Generally speaking, a license that complies with one complies with the other. The difference between these two is a difference in philosophy. The "free software" crowd believes the most important aspect is freedom. The "open source" crowd believes the most important aspect is the practical marketplace advantage that freedom produces.

I think they're both right. I wouldn't use the software without the freedom guaranteeing me the right to improve the software, and the guarantee that my improvements will not later be withheld from me. Freedom is essential. And so are the practical benefits. Because tens of thousands of programmers feel the way I do, huge amounts of free software/open source is available, and its quality exceeds that of most proprietary software.

In summary, I use the terms "Linux" and "GNU/Linux" interchangably, with the former being an abbreviation for the latter. I usually use the terms "free software" and "open source" interchangably, as from a licensing perspective they're very similar. Occasionally I'll prefer one or the other depending if I'm writing about freedom, or business advantage.

Steve Litt is the author of Troubleshooting Techniques of the Successful Technologist.   Steve can be reached at his email address.

Obligatory Abbreviations

By Steve Litt
I wish I didn't have to write this article. In my opinion the abbreviations MTA, MDA and MUA are so similar sounding as to be utterly confusing. But when you hear someone glibly rattle off these abbreviations, perhaps it's best to know them. I try not to use them. Everyone knows what an "email client" is. They might not know it as an "MUA".
Email route from Chandler to Monica

Abbrv
Stands for
Function
Examples
Label on
preceding
Diagram

MTA
Mail Transport Agent
Moves email from one host to another via the SMTP protocol.
Sendmail, qmail, Exim, Postfix, Exchange
SMTP
MDA
Mail Delivery Agent
Delivers email to the user's mail queue.
Procmail
Procmail
MUA
Mail User Agent
AKA: email client. This is what you use to read and compose email. Most modern email clients (MUA, if you must) can grab mail from a POP3 or IMAP server.
Kmail, mutt, pine, Evolution, Eudora, Outlook
Email
Client
Pop server
Pop server
Just to make things more difficult, the agent serving up data from the server's user queue to email clients doesn't have a cutesy name, but instead is called the Pop server. Ughh!
ipop3d (Red Hat, Mandrake)
POP3
IMAP server
IMAP server
Analogous to the Pop server, but uses the IMAP protocol instead.
imapd (Red Hat, Mandrake)
n/a


I'll try not to use abbreviations MTA, MDA and MUA throughout this document. They're just too similar, and therefore confusing. Whenever possible, I'll use diagrams.
Steve Litt is the author of the Universal Troubleshooting Process Courseware.   Steve can be reached at his email address.

Email Basics

By Steve Litt
NOTE

The following documentation is Sendmail based and Redhat/Mandrake centric. When this documentation talks about SMTP, it's referring to Sendmail's implementation of SMTP. When this documentation refers to Procmail, it's referring to the program packaged with Redhat to drop email into local mail queues.

That being said, the principles are sound. If you use qmail, Postfix, exim or whatever, just substitute your SMTP server's components.


Before understanding Spamassassin, you must understand the basics of email transmission.
Diagram of Email client, smtp and pop3

Definition
Email client: A computer program to compose and read email messages. Kmail, pine, mutt, and Outlook are examples of email clients. Most modern email clients have the ability to send a composed email to a SMTP server, and to retrieve an email from a POP3 server. This document assumes your email client has those abilities.



The Email client is how users interface with email. For many users, it's the only visible component of email.

Looking a little deeper, when you send an email after composing it, what you are really doing is pushing the email, as a file, to a SMTP server located at your ISP.


Definition
SMTP server: Simple Mail Transport Protocol server. A computer program that runs continuously, transferring email. When an email client pushes an email onto the SMTP server, the SMTP server reads the recipient address and pushes the email to the SMTP server on the recipient's ISP, which then drops the email in the correct mailbox. The SMTP is described fully in RFC 821.

Sendmail, qmail, Postfix and exim are all examples of SMTP servers. This document is Sendmail-centric, but the principles can be applied universally.

NOTE

In all diagrams in this document, blocks labeled SMTP refer to SMTP servers, not to the SMTP protocol. If one really wanted to get picky, one could make it look like this:
.--------.          .--------.          .--------.    .--------.    .--------.
| Email | SMTP | SMTP | SMTP | SMTP | | proc | \ Email \
| client |--------->| server |--------->| server |--->| mail |---->/ queue /
`--------' protocol `--------' protocol `--------' `--------' '--------'

In the preceding diagram, on Sendmail systems the "SMTP server" is sendmail.

For simplicity, in this document we leave out the protocol indicator, and abbrieviate "SMTP server" as "SMTP":
.--------.          .--------.          .--------.    .--------.    .--------.
| Email | | SMTP | | SMTP | | proc | \ Email \
| client |--------->| |--------->| |--->| mail |---->/ queue /
`--------' `--------' `--------' `--------' '--------'


The following diagram shows the route an email takes from the time Chandler sends it to the time Monica opens it.

Email route from Chandler to Monica

Chandler composes his email on his email client (kmail, Evolution, mutt, Eudora, Outlook), and sends it. It's sent to the SMTP server on the ISP that Chandler uses. Chandler's SMTP evaluates the message, notes that it's not destined for anyone local, and retransmits it, this time to the SMTP server at Monica's ISP. Monica's SMTP evaluates the message and deduces it IS for someone local, namely, Monica. So Monica's SMTP passes Chandler's email message to the procmail program, which deposits the email in Monica's mail queue  on her ISP's server.

This queue would typically be/var/spool/mail/monica, but this is configurable. Some Sendmail configurations deposit email directly in the user's home directory tree. For the remainder of this document we'll assume that incoming mail is stored in a file whose name is the same as the receiving user, and that this file is kept in directory/var/spool/mail. Note that on many systems, symlink/var/mail points to directory /var/spool/mail. For brevity's sake, this document often refers to the shorter symlink.

At this point, Monica could read Chandler's message using her ISP's webmail program. But Monica wants to read and store her programs locally, so she runs her email client, and clicks the "check mail" icon. Her email client then reaches out on port 110 to contact the POP3 server at her ISP and retrieve the email stored in her folder at the ISP. The email is placed into a folder within the $HOME/Mail tree according to the configuration and filters set up in Monica's email client.

Perhaps Monica wants more control over her email. If so, she could choose not to have her email client retrieve mail from the POP3 server directly. Perhaps she would instead use fetchmail and procmail between the POP3 server and her email client. This gives her many opportunities, including the opportunity to insert spamassassin:

Monica chooses fetchmail to retrieve from POP3


In the preceding diagram, Monica has chosen to use fetchmail to retrieve email from her ISP's POP3 server, and has chosen to use fetchmail's default behavior of passing the email on to the procmail program. The procmail program's purpose is to deposit incoming email into the user's email queue file, in this case /var/mail/monica, after calling any necessary filtering programs. Monica chooses to use spamassassin as a filtering program called by procmail.

Monica now modifies her email client (kmail for example) so instead of retrieving mail from the ISP's POP3 server, it retrieves it directly from the email queue on her local Linux box.

NOTE

In real life, Monica's fetchmail program would probably send email to some sort of local SMTP server listening on port 25, instead of sending it directly to procmail as shown in the preceding diagram. However, fetchmail can be configured to output directly to the procmail executable, and that is what we have chosen to show throughout this document.

If, on Monica's machine, there were nothing listening on port 25 (in other words, no sort of SMTP server was running), Monica could run her fetchmail like this:
fetchmail -d60 -m "/usr/bin/procmail -d %T"
The preceding command causes fetchmail to dump email directly to the procmail executable, bypassing port 25.

For the purposes of understanding SpamAssassin, a conceptual mapping of fetchmail dumping directly to procmail is easiest to understand.

Notice that an ISP could implement Spamassassin in exactly the same way, except that the SMTP server would replace fetchmail. Such a configuration provides Spamassassin filtering to all the ISP's POP3 users. Let's say that Rachel, Ross, Phoebe and Joey have a different ISP than Chandler. Watch the message flow:

Spamassassin on a mail server

NOTE

Mandrake and RedHat use symbolic link /var/mail to point to the real directory, /var/spool/mail. The preceding diagram uses the shorter names (/var/mail/ross etc) to save space. If your distro doesn't have this handy symlink, use /var/spool/mail/ross for Ross's mail queue.


The preceding diagram is just one of many ways to incorporate spamassassin onto a mail server. It has the advantage of not needing to reconfigure the server's sendmail configuration. On the other hand, it might not be the most effective use of Spamassassin, and certainly for performance's sake you'd substitute spamc for spamassassin.

Notice the preceding configuration filters all local email through Spamassassin, while leaving email "just passing through" unchanged. Unless you want to police the net, that's OK. The preceding spam filters email before it reaches user queues (/var/mail/username) on the server, so any web email system can also take advantage of Spamassassin.

Servers and Protocols

The following is a list of protocols and their usual port numbers:
Protocol
Port
Typical
Server
SMTP
25
Sendmail
POP3
110
ipop3d
IMAP3
220
imapd
SMTP over SS
465

POP-3 over SS
995

IMAP over SSL 993

One way to ascertain whether a server is listening to a port is to run nmap on localhost:
Starting nmap V. 3.00 ( www.insecure.org/nmap/ )
Interesting ports on obscured.obscure.fyi (127.0.0.1):
(The 1015 ports scanned but not shown below are in state: closed)
Port State Service
22/tcp open ssh
25/tcp open smtp
80/tcp open http
110/tcp open pop-3
111/tcp open sunrpc
443/tcp open https
631/tcp open ipp
783/tcp open hp-alarm-mgr
1011/tcp open unknown

Nmap run completed -- 1 IP address (1 host up) scanned in 1 second
[root@newbox root]#

The preceding shows the SMTP server running on port 25, and the POP3 server running on port 110.
Steve Litt is the author of Rapid Learning: Secret Weapon of the Successful Technologist.   Steve can be reached at his email address.

SpamAssassin Basics

By Steve Litt
At its simplest, SpamAssassin is an executable file, a Perl script UNIX type filter to be specific, that takes the email file coming in through stdin, parses and evaluates it every way from Sunday, and sends it to stdout, adding a few lines to the mail header describing where it was filtered, what tests were performed, and most importantly, a spam score which correlates fairly well to the likelihood of the email being spam. That spam score header looks like this:
X-Spam-Level: ****
The number of stars represents the likelihood that it's spam. Typically, a plain text email, to a single recipient, not containing words like "mortgage", "Viagra", offers of 15% of a Ugandan fortune, mention of various provocative body parts and the like, will have no stars.  An email with more than 10 stars is extremely likely to be spam. With 20 stars, I often throw it away sight unseen, although this runs the risk of throwing away good email if SpamAssassin somehow goes bad.

In your email client, you could use the following filters in the following order:
X-Spam-Level: ********************
X-Spam-Level: **********
X-Spam-Level: *****
The first one could send everything with 20 or more stars to the bitbucket. The second one might send everything with 10 to 20 stars to a folder called "probablyspam". The third could send everything with 5 to 10 stars to a folder called "likelyspam".

To foster a better understanding, let's start with some terminology:

Email message: An electronically transmitted message comprised of a header and a body, and possibly one or more attachments.
Email header: The part of the email describing the email itself -- addressees, subject, priority and the like.
Email body: The part of the email containing the message told by the sender to the receiver.
Email attachment:
A file that piggybacks along with the email.
Spam: An unasked for, non-personalized commercial email message. Often concerns size of body parts, sexual performance, or mortgage rates.
File: A disk based chunk of data. Email messages are often stored as files (Maildir), or as parts of a larger file (mbox, and inside mail queues).
Data repository: A container for data. A data repository does not alter or act on the data.
Process: A computer program or system that receives data from a data repository or another process, and gives it to a data repository or process, usually after altering or acting on the data
Push: When a process initiates a transfer of data from itself to another process or a data repository. SMTP (Simple Mail Transfer Protocol) servers push email, but receive email passively.
Pull: When a process sucks data out of another process or a data repository. Note that in any single data transfer between two processes, one process either pushes or pulls, and the other is passive. POP and IMAP servers passively wait for email clients to pull email data from them, and then pull that data from the user's mail queue.
Filter
(Unix terminology):
A process receiving data from one process or data repository and transferring it to another process or repository. The filter usually either alters the data, takes action on the data, or both.
Filter
(Email terminology):
The act of directing an email to a certain mailbox, or to the garbage can (/dev/null), based on the attributes of the email. As a noun, it refers to a single configuration to route a certain type of email to a certain mailbox.
Spamassassin: A (Unix style) filter that is passed an email file, analyzes that email file, determines the likelihood that the email file is spam, records that determination in the header of the email file, and then pushes that email file to another process. The receiving process, or a process downstream from that one, typically routes  that email file to a specific mailbox, depending on Spamassassin's determination.
SMTP:
Simple Mail Transfer Protocol. A method of transferring emails between two servers (called SMTP servers). Your email client pushes an outgoing email message to a SMTP server, which then pushes the email to the SMTP server local to the recipient. Possibly the email message flows through one or more relay SMTP servers. SMTP servers sit passively until another process pushes email to them, and then either relay the email to another SMTP server, or store the email for later pickup. You can read the details in RFC 0821. Chances are the SMTP server you use is located at your ISP.

This document is based on the Sendmail implementation of SMTP. If you have a different SMTP implementation (Postfix, qmail, or exim), the specifics of this document must be changed to fit your situation, but the principles are valid.
POP, POP3:
POP stands for Post Office Protocol. POP3 refers to version 3 of that protocol. RFC 1939 describes version 3. Your email client pulls your email from a POP3 server, which in turn obtains your email from the mail queue (typically /var/spool/mail/username) where the SMTP server deposited your email. Chances are the POP server is located at your ISP.
IMAP:
IMAP is similar to POP, but  is more versatile than POP. For instance, IMAP sends to the email client only the header information (Subject, sender, size and the like). From there, the user decides, for each email, whether to delete it, or whether to download the body. This is a huge plus for a person checking email from multiple computers.
Web Mail:
A web app enabling a user with a web browser to directly view his or her email messages in his mail queue on the ISP's server.

As mentioned, the spamassassin executable is a filter, meaning it can be inserted anywhere between two processes that pipe email. For my desktop system it looks something like this:

Diagram of email system including Spamassassin

In the preceding diagram, fetchmail pulls your mail from your ISP's pop server, and pushes it on to procmail. Procmail deposits it in the user's mail queue (this is a file) after sending it through a pipeline of various filters. One of those filters is spamassassin, which inserts several spam related headers, including the X-Spam-Levelheader. This header contains a number of starts corresponding to the likelihood that the email is spam. Kmail, or whatever email client you use, can then deposit email in its own mailboxes based partially on the headers inserted by SpamAssassin.
Steve Litt is the author of Samba Unleashed.   Steve can be reached at his email address.

Quick and Dirty Spamassassin

By Steve Litt
Time is money. This article gets you up and running with Spamassassin in record time. Here are the steps:
  1. Download, compile and install Spamassassin
  2. Test the Spamassassin program with a file containing a single email message
  3. Pipe each email through the spamassassin command by inserting it somewhere within the travel of emails. For instance, on my box, email goes thru fetchmail to procmail to spamassassin to kmail.
  4. Once step 3 is running well, improve performance by substituting the spamd daemon for the pipe through spamassassin. You'll also need the client side, spamc.

Download, compile and install Spamassassin

Some modern Linux distributions come wth Spamassassin. If so, just use your package management to install it. Otherwise, download it from http://www.spamassassin.org. The file will probably be called something like Mail-SpamAssassin-2.61.tar.bz2. Logged in as an ordinary user, put that file in your home directory and execute the following command:
tar xjvf Mail-SpamAssassin-2.61.tar.bz2
The preceding command creates a directory tree called Mail-SpamAssassin-2.61 inside your home directory.

Compiling is pretty easy. Do the following:
If you've done it correctly, typing spamassassin will run a program that appears to do nothing but hang. Then type Ctrl+D in order to send an EOF stdin, and after a few seconds the program outputs some text. Here's what it did on my computer:
[slitt@newbox slitt]$ spamassassin
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on
newbox.domain.cxm
X-Spam-Level: **
X-Spam-Status: No, hits=2.9 required=5.0 tests=DATE_MISSING,FROM_NO_LOWER
autolearn=no version=2.61

[slitt@newbox slitt]$

If you get that message, you know you've done something right.

Test the Spamassassin program with a single email file

There are a million ways to get a single email message, but in case you cannot find a way, the following works. This assumes you have the mutt email client -- most people do. In the following procedure, text between angle brackets are comments, and you do not type them. Here's the procedure:
Now test that file with the following command:
cat /home/yourname/Mail/sa_test | spamassassin
Here's the output I got:

[slitt@newbox slitt]$ cat /home/slitt/Mail/sa_test | spamassassin
From slitt@newbox.domain.cxm Sat Jan 31 20:14:38 2004
Return-Path: <slitt@newbox.domain.cxm>
Received: from newbox.domain.cxm (newbox.domain.cxm [127.0.0.1])
by newbox.domain.cxm (8.12.8/8.12.8) with ESMTP id i111EcBq021319
for <slitt@newbox.domain.cxm>; Sat, 31 Jan 2004 20:14:38 -0500
Received: (from slitt@localhost)
by newbox.domain.cxm (8.12.8/8.12.8/Submit) id i111Eck6021317
for slitt@localhost; Sat, 31 Jan 2004 20:14:38 -0500
Date: Sat, 31 Jan 2004 20:14:38 -0500
From: slitt@newbox.domain.cxm
To: slitt@newbox.domain.cxm
Subject: Test
Message-ID: <20040201011438.GA21311@newbox.domain.cxm>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.4i
Content-Length: 11
Lines: 1
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on
newbox.domain.cxm
X-Spam-Level:
X-Spam-Status: No, hits=0.3 required=5.0 tests=NO_REAL_NAME autolearn=no
version=2.61

Test email

[slitt@newbox slitt]$

Note that the X-Spam-Level has no stars, and the hits is listed as 0.3.

Now use mutt to create the following spam-like email to yourself:
To: you@localhost

Subject: ADV: As seen on TV, Free Instant GUARANTEED  H.O.T B.A.B.E.S for  Your Family. Lose  Pounds

Body: This is not spam! We strongly oppose the use of spam email too. This email conforms with House Bill 4176, HR 3113, the UCE-Mail Act. We GUARANTEE it. There is no catch! You can make lots of money. Nobody's perfect, but this is a cure for impotence. Order our report right now, subject to credit approval. This is a free investment with no credit check, and we do accept credit cards. Use our program to consolidate your bills, and stop those creditors from calling. No inventory, just invaluable marketing information with huge potential earnings. It even reverses aging while you sleep. Stop snoring, lose body fat, and get paid for hidden assets with an insurance policy from our affiliate partners.

The preceding is a characterature of a spam. It has it all, snoring, fat, money, insurance, affiliate partners, non-spam protests, and G.A.P.P.Y T.E.X.T. Following the previous instructions, save this one as /home/yourself/Mail/sa_test_spam. Then run it through SpamAssassin, and see what happens:
[slitt@newbox slitt]$ cat /home/slitt/Mail/sa_test | spamassassin
From slitt@newbox.domain.cxm Sat Jan 31 21:10:48 2004
Received: from localhost [127.0.0.1] by newbox.domain.cxm
with SpamAssassin (2.61 1.212.2.1-2003-12-09-exp);
Sat, 31 Jan 2004 21:14:26 -0500
From: slitt@newbox.domain.cxm
To: slitt@newbox.domain.cxm
Subject: ADV: As seen on TV, Free Instant GUARANTEED H.O.T B.A.B.E.S for Your Family. Lose Pounds
Date: Sat, 31 Jan 2004 21:10:48 -0500
Message-Id: <20040201021048.GA21383@newbox.domain.cxm>
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on
newbox.domain.cxm
X-Spam-Level: **************************************************
X-Spam-Status: Yes, hits=66.1 required=5.0 tests=ACCEPT_CREDIT_CARDS,
ADVERT_CODE,AS_SEEN_ON,BAD_CREDIT,CONSOLIDATE_DEBT,EARNINGS,EXCUSE_15,
FREE_INVESTMENT,GAPPY_SUBJECT,GUARANTEE,HIDDEN_ASSETS,
INVALUABLE_MARKETING,LOSEBODYFAT,LOSE_POUNDS,NO_CATCH,NO_CREDIT_CHECK,
NO_INVENTORY,NO_REAL_NAME,OUR_AFFILIATE_PARTNERS,REVERSE_AGING,
STOP_SNORING,SUBJ_2_CREDIT,SUBJ_AS_SEEN,SUBJ_FREE_INSTANT,
SUBJ_GUARANTEED,SUBJ_YOUR_FAMILY,THIS_AINT_SPAM,WE_HATE_SPAM,
WHILE_YOU_SLEEP autolearn=no version=2.61
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="----------=_401C6102.5F49A468"

This is a multi-part message in MIME format.

------------=_401C6102.5F49A468
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Spam detection software, running on the system "newbox.domain.cxm", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or block
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: This is not spam! We strongly oppose the use of spam
email too. This email conforms with House Bill 4176, HR 3113, the
UCE-Mail Act. We GUARANTEE it. There is no catch! You can make lots of
money. Nobody's perfect, but this is a cure for impotence. Order our
report right now, subject to credit approval. This is a free investment
with no credit check, and we do accept credit cards. Use our program to
consolidate your bills, and stop those creditors from calling. No
inventory, just invaluable marketing information with huge potential
earnings. It even reverses aging while you sleep. Stop snoring, lose
body fat, and get paid for hidden assets with an insurance policy from
our affiliate partners. [...]

Content analysis details: (66.1 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
2.8 SUBJ_FREE_INSTANT Subject contains "Free Instant"
0.3 NO_REAL_NAME From: does not include a real name
2.6 SUBJ_AS_SEEN Subject contains "As Seen"
[slitt@newbox slitt]$ cat /home/slitt/Mail/sa_test | spamassassin

The preceding scored 66.1 points. You've proven the concept. Now you're ready to put Spamassin to work.

Insert Spamassassin in the email processing chain

Email travels in a series of transfers. Spamassassin is an executable file. It's a filter that modifies the data passing through it. It is implemented by inserting it between two of the transfer points.

Personal Protection

If you're a typical user, you can implement Spamassassin by inserting it as a filter called by the Procmail program. In this scenario, Fetchmail pulls the mail from your ISP's SMTP server, and sends the mail to Procmail. Procmail sends the email through Spamassassin before depositing it in your Linux box's mail queue file/var/spool/mail/yourname. Then, at a later time, your kmail email client (or whatever email client you use) pulls the mail out of /var/spool/mail/yourname.

In your email client, set a filter so that if the X-Spam-Level header contains more than a certain number of stars, the mail is sent to a special folder (usually Trash). In the Trash folder, you can quickly scan to reassure yourself there are no false positives, and then delete the mail entirely.
Diagram of email system including Spamassassin
Configure Fetchmail with the following ~/.fetchmailrc

~/.fetchmailrc
# Configuration created Mon Jun  2 09:16:09 2003 by fetchmailconf
set postmaster "yourname"
set bouncemail
set no spambounce
set properties ""
set daemon 600
poll pop.myisp.com with proto POP3
user 'yourname@yourdomain.com' there is 'yourname' here limit 50000000 warnings 3200 expunge 60

In the preceding, Fetchmail queries every 600 seconds (10 minutes). It grabs email from POP server pop.myisp.com using the POP3 protocol, pulling mail sent to user 'yourname@yourdomain.com', and sends it on to user yourname on the local box. It does not pull emails over 50000000 bytes, and after each set of 60 emails downloaded it deletes those 60 on the server.

Expunging at intervals of 60 minimizes those horrible situations where you time out or otherwise bomb while downloading 3000 messages after a week's vacation, and each time it bombs you need to redownload everything.

You run Fetchmail as a daemon by configuring your server to do this:
fetchmail -d60
That command runs it as a daemon and awakens it every 60 seconds.

Unless Fetchmail is configured to do otherwise, its default behavior is to send the pulled mail on to procmail (if it exists). Typically, Procmail is implemented as a filter (/usr/bin/procmail). That filter program is configured with /etc/procmailrc. Here's an example:

/etc/procmailrc
LOGFILE=/var/log/procmail.log
VERBOSE=ON

# send to spamassasin
:0 fw
* < 256000
|/usr/bin/spamassassin
# |/usr/bin/spamc -f

The preceding ~/.fetchmailrc and /etc/procmailrc are sufficient to place send the mail through spamassassin, placing spamassassin headers in every file. The next step is to configure Spamassassin. That is done with ~/.spamassassin/user_prefs:

~/.spamassassin/user_prefs
# SpamAssassin user preferences file.  See 'perldoc Mail::SpamAssassin::Conf'
# for details of what can be tweaked.
###########################################################################

# How many hits before a mail is considered spam.
required_hits 40

spam_level_stars 1
score MICROSOFT_EXECUTABLE 15
score PYZOR_CHECK 5
score RAZOR2_CHECK 5
score HTML_WEB_BUGS 15

In the preceding, required_hits is the number of hits required for Spamassasin to declare the email a spam. The overwhelmingly vast majority of emails with a score of 10 are pure spam, so why set this so high? I set it high because I don't want Spamassassin converting the message into an attachment, which is what it does if it declares an email spam. Checking emails for false positives would be VERY slow if one needed to go into an attachment for each one. So I crank it up to 40. If something comes in higher than 40, I feel confident in deleting it without further investigation.

The spam_level_stars is a boolean declaring whether the X-Spam-Level line should have stars. It should, because that's the easiest thing for an email filter to parse for.

The remainder of the preceding file changes the weights of certain tests.

Once Spamassassin has imprinted the email with an X-Spam-Level with a number of stars corresponding to the number of hits, the final step is to configure a filter in your email client to send such emails directly to the trash can. From there, you can quickly scan those files to ascertain there are no false positives, and then delete the bunch of them.

Thus, you've just set up a Spamassassin for personal protection. But what about protection for everyone? That comes later, but first, let's discuss throughput...

Improve performance by substituting the spamd daemon for the pipe through spamassassin.

Can you imagine starting and stopping spamassassin 150 times when you download 150 emails? I've done it while watching my handy-dandy IceWM CPU monitor. When I download, the constant starting and stopping of the spamassassin filter pegs both CPU's. Spamassassin is a huge program that takes bigtime resources just to load.

What's needed is a Spamassassin that runs constantly. That's what spamd is. But how do you filter emails through a constantly running program? It's simple. spamd is run as a server, and a tiny program called spamc acts as the client. You filter through tiny spamc, and spamc sends the information, via a socket, to spamd. The spamd server filters the email and sends it, via the socket, to the spamc client that sent it.

To repeat, you substitute spamc -f for spamassassin in yourprocmailrcfile:

/etc/procmailrc
LOGFILE=/var/log/procmail.log
VERBOSE=ON

# send to spamassasin
:0 fw
* < 256000
# |/usr/bin/spamassassin
|/usr/bin/spamc -f

Here's a diagram:
Spamc and spamd diagram

The result is much lighter CPU usage. But there are two little tricks:
  1. spamd must be running at all times.
  2. Configuration cannot be done in ~/.spamassassin/user_prefs, but instead must be done site wide in /etc/mail/spamassassin/local.cf.
 To assure spamd runs all the time, create or obtain a startup file. Here's a quick and dirty /etc/rc.d/init.d/spamd I created:


#!/bin/sh
#########################################################################
#
# chkconfig: 2345 99 99
# description: "spamd" is the spamassassin daemon
#
# Start / stop script for spamd
#
# In order to be distibution independant, the server known a few
# extra commands:
# start
# stop
#
#
##########################################################################




# Special options, adapt this
NAME=spamd
PIDFILE=/var/run/spamd.pid

# where the program is located

PROG=/usr/bin/spamd

case $1 in
start)
echo -n "Starting $NAME "
$PROG -d -r $PIDFILE;
RETVAL=$?;
test $RETVAL && echo [ OK ];
test $RETVAL || echo [FAILED];;
stop)
echo -n "Stopping $NAME "
test -f $PIDFILE && kill `cat $PIDFILE` && RETVAL=$? && rm -f $PIDFILE;
test $RETVAL && echo [ OK ];
test $RETVAL || echo [FAILED];;
restart)
$0 stop
sleep 2
$0 start
RETVAL=$?;;
status)
$PROG status;;
*)
echo "Syntax `basename $0` start|stop|status|restart"
RETVAL=1;;
esac

RETVAL=$?
exit $RETVAL

Perhaps a better idea would be to copy the pld-rc-script.sh  or redhat-rc-script.sh  or debian-rc-script.sh  or netbsd-rc-script.sh or whatever script in the spamd directory of your distribution to /etc/rc.d/init.d/spamd. Either way, after creating that file (and you might be able to find a much better one elsewhere), make sure it's on at boottime with the following command:
chkconfig spamd on
That brings us to modifying  /etc/mail/spamassassin/local.cf. Copy all your special configuration lines from ~/.spamassassin/user_prefs, into that file. Also, if you will be doing site-wide bayes filtering, insert the following two lines:
bayes_path         /var/sa_bayes
bayes_file_mode 0666
In the preceding case, make sure /var/sa_bayes is world readable and writeable (ugh!). Alternatively, let users maintain their own bayes filters and leave those statements out of /etc/mail/spamassassin/local.cf.
Steve Litt is the author of the Universal Troubleshooting Process Courseware.   Steve can be reached at his email address.

Spamassassin and Sendmail

By Steve Litt
There are many ways to incorporate Spamassassin with Sendmail. Most are beyond the scope of this magazine, but one was discussed in an earlier article -- simply have Procmail run Spamassassin. Once again, here is the diagram:

Spamassassin and Sendmail

Remember, to save space the preceding diagram shows the mail queues as /var/mail/username, which is the symlink on some distros. The real location is /var/spool/mail/username.
Steve Litt is the author of Rapid Learning: Secret Weapon of the Successful Technologist.   Steve can be reached at his email address.

Life After Windows: Who's the Boss?

Life After Windows is a regular Linux Productivity Magazine column, by Steve Litt, bringing you observations and tips subsequent to Troubleshooters.Com's Windows to Linux conversion.
By Steve Litt
My interest in Spamassassin started when two different web hosts could not keep their Spamassassin correctly configured. After researching Spamassassin, I understand the challenge in applying it to huge numbers of remote users. It would be difficult for an ISP to consistently maintain Spamassassin such that most users are pleased most of the time.

As I lost trust in ISPs' Spamassassin, the solution became obvious -- install my own. Now I'm the boss. When the spam marking criteria need change, there's no need to go through a restrictive web interface or beg tech support to please, please, please help me. I just fire up an editor, make the change, and I'm done.

When a bug appears, there's no begging and pleading. I just troubleshoot it with logs and other techniques, and, if absolutely necessary, by going into the source code.

If Spamassassin needs updating, no waiting is necessary. Just download, back up, ./configure make make install and done.

Having your own Spamassassin makes you more portable. Twice in my life as a webmaster it's been necessary to very quickly transfer my websites to a different web host. The more plain-vanilla your website, email, and other accoutraments, the easier it is to move to a new web host. By moving your Spamassassin to your local machine, you can depend on your web host for your core need -- bandwidth.

Life after Windows is being the boss. If you want control over your spam filtering, you download Spamassassin.
Steve Litt is the founder and acting president of Greater Orlando Linux User Group (GoLUG).   Steve can be reached at his email address.

Letters to the Editor

All letters become the property of the publisher (Steve Litt), and may be edited for clarity or brevity. We especially welcome additions, clarifications, corrections or flames from vendors whose products have been reviewed in this magazine. We reserve the right to not publish letters we deem in bad taste (bad language, obscenity, hate, lewd, violence, etc.).


Submit letters to the editor to Steve Litt's email address, and be sure the subject reads "Letter to the Editor". We regret that we cannot return your letter, so please make a copy of it for future reference.

How to Submit an Article

We anticipate two to five articles per issue, with issues coming out monthly. We look for articles that pertain to the GNU/Linux or open source. This can be done as an essay, with humor, with a case study, or some other literary device. A Troubleshooting poem would be nice. Submissions may mention a specific product, but must be useful without the purchase of that product. Content must greatly overpower advertising. Submissions should be between 250 and 2000 words long.

Any article submitted to Linux Productivity Magazine must be licensed with the Open Publication License, which you can view at http://opencontent.org/openpub/. At your option you may elect the option to prohibit substantive modifications. However, in order to publish your article in Linux Productivity Magazine, you must decline the option to prohibit commercial use, because Linux Productivity Magazine is a commercial publication.

Obviously, you must be the copyright holder and must be legally able to so license the article. We do not currently pay for articles.

Troubleshooters.Com reserves the right to edit any submission for clarity or brevity, within the scope of the Open Publication License. If you elect to prohibit substantive modifications, we may elect to place editors notes outside of your material, or reject the submission, or send it back for modification. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com. Authors: please understand we can't place hyperlinks inside articles. If we did, only the first article would be read, and we can't place every article first.

Submissions should be emailed to Steve Litt's email address, with subject line Article Submission. The first paragraph of your message should read as follows (unless other arrangements are previously made in writing):

Copyright (c) 2003 by <your name>. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, version  Draft v1.0, 8 June 1999 (Available at http://www.troubleshooters.com/openpub04.txt/ (wordwrapped for readability at http://www.troubleshooters.com/openpub04_wrapped.txt). The latest version is presently available at  http://www.opencontent.org/openpub/).

Open Publication License Option A [ is | is not] elected, so this document [may | may not] be modified. Option B is not elected, so this material may be published for commercial purposes.

After that paragraph, write the title, text of the article, and a two sentence description of the author.

Why not Draft v1.0, 8 June 1999 OR LATER

The Open Publication License recommends using the word "or later" to describe the version of the license. That is unacceptable for Troubleshooting Professional Magazine because we do not know the provisions of that newer version, so it makes no sense to commit to it. We all hope later versions will be better, but there's always a chance that leadership will change. We cannot take the chance that the disclaimer of warranty will be dropped in a later version. 

Trademarks

All trademarks are the property of their respective owners. Troubleshooters.Com(R) is a registered trademark of Steve Litt.

URLs Mentioned in this Issue


_