Corpus reset

by Michael Alderete on 4/26/2005

SpamSieve, by far the best anti-spam email tool I’ve used, was updated to version 2.3 yesterday. The biggest change listed was increased accuracy, due to improvements in the tokenizers and parsers. John Gruber reported that the beta versions were running at 99.9% accuracy for him, which is several tenths of a percent above where I’d peaked.

When you get more than one thousand spams a week, you live for improvements of a couple of tenths of a percent. I of course upgraded immediately.

It was a little anticlimactic. Given the minor delta between my 99.5% accuracy and John’s 99.9%, I did not see an immediate difference in my spam protection. It remains very, very good. (Surprise. Given the small difference in accuracy, it will take months to have something to compare.)

Tonight I got around to reading the more detailed version history, which explained the improvements in a bit more detail:

Made lots of improvements to SpamSieve’s parsers and tokenizer for better accuracy. To fully take advantage of this, you will need to reset SpamSieve’s corpus and re-train it (e.g. with 300 recent good messages and 600 recent spams). However, this is certainly not required, and I expect that most people will opt for the simpler upgrade of just installing the new SpamSieve application. [Emphasis added.]

Given the amount of spam I get, I want to achieve the maximum level of accuracy, not opt for the easiest upgrade path. So, curious, I checked my Junk mailbox in Eudora: 598 spam messages. A sign from god. I collected 300 recent good messages, backed up my SpamSieve corpus, and reset it. Just finished retraining.

We’ll see how it goes. In a month or two.

Previous post:

Next post: