Bioinformatics techniques and spam
It seems that the fight against spam is a tough one and not only Microsot, but IBM is investing heavily on it. The last news about it is the use of DNA sequencing algorithms to detect spam:
Instead of chains of characters representing DNA sequences, the research group fed the algorithm 65,000 examples of known spam. Each email was treated as a long, DNA-like chain of characters. Teiresias identified six million recurring patterns in this collection, such as "Viagra".
And it seems that the new algorithm is quite aware of the spammer tricks:
Chung-Kwei deals with common spammer strategies to dodge pattern-recognition schemes, such as replacing the s with a $, as in "increa$e your $ex power" using its built-in tolerance for different, but functionally equivalent, DNA sequences.
The success rate is 97%, quite good and probably better than most speech recognitions algorithms. The false positives are around 1 in 6000, also not bad at all.
One possible flaw, is that the algorith needs to let go through large messages with few spam-like sequences. Very easy to imagine that spammers will just add a load of gibberish in the end of the e-mail to decrease the spam-like/good text ratio. Position of the spam sequences certainly counts too. I wonder whether in the future we will have to be careful about e-mail content. If a guy advises a friend to try Viagra in a short message, this might become spam...
Also, no mention about the consonant/vowel multiplying technique which I mentioned in the other post.
One funny note: to train the algorithm, non-spam e-mails are used. These are called 'ham'.
Article can be read here.