Latent Semantic Analysis Is Not Bayesian Filtering

Macworld recently ran an article about anti-spam tools for Mac OS X, which incorrectly simplified the world of anti-spam tools down to Boolean, points-based, and Bayesian filters. There are at least two more categories of anti-spam tools.

Macworld recently ran an article about anti-spam tools for Mac OS X, which incorrectly simplified the world of anti-spam tools down to Boolean, points-based, and Bayesian filters.

Two additional categories are distributed recognition, such as the Distributed Checksum Clearinghouse (DCC) and Vipul’s Razor, and latent semantic analysis. I don’t know of any distributed recognition products for the Mac (there’s a very good one for Windows Outlook, SpamNet by Cloudmark), but there certainly is a latent semantic analysis tool — Apple’s Mail in Jaguar!

The simplification (or oversight) is relatively understandable. From an end-user perspective, there’s no meaningful difference — even though the math is very different. It’s not clear which will prove better at filtering out spam, even though in the article Mail’s filtering did the best. Seems like it’s good to have both in the fight!

While I’m posting about it, I should note that the article was written prior to the release of my new favorite anti-spam tool, Spamnix, and so it doesn’t include it in the roundup. From my own experience with Mac OS anti-spam tools I think that, with the caveat that it only works with Eudora, it would have done well in the evaluation. Perhaps Geoff Duncan, or someone else at TidBITS, will review it soon, and confirm that guess. I know they like Eudora at TidBITS — they literally wrote the book!