2004-08-24

Bioinformatics techniques and spam

It seems that the fight against spam is a tough one and not only Microsot, but IBM is investing heavily on it. The last news about it is the use of DNA sequencing algorithms to detect spam:

Instead of chains of characters representing DNA sequences, the research group fed the algorithm 65,000 examples of known spam. Each email was treated as a long, DNA-like chain of characters. Teiresias identified six million recurring patterns in this collection, such as "Viagra".


And it seems that the new algorithm is quite aware of the spammer tricks:

Chung-Kwei deals with common spammer strategies to dodge pattern-recognition schemes, such as replacing the s with a $, as in "increa$e your $ex power" using its built-in tolerance for different, but functionally equivalent, DNA sequences.

The success rate is 97%, quite good and probably better than most speech recognitions algorithms. The false positives are around 1 in 6000, also not bad at all.

One possible flaw, is that the algorith needs to let go through large messages with few spam-like sequences. Very easy to imagine that spammers will just add a load of gibberish in the end of the e-mail to decrease the spam-like/good text ratio. Position of the spam sequences certainly counts too. I wonder whether in the future we will have to be careful about e-mail content. If a guy advises a friend to try Viagra in a short message, this might become spam...

Also, no mention about the consonant/vowel multiplying technique which I mentioned in the other post.

One funny note: to train the algorithm, non-spam e-mails are used. These are called 'ham'.

Article can be read here.

3 Comments:

Blogger amber said...

I enjoyed your information on dna testing. I have a dna testing blog if you want to check it out.

10:28 AM  
Blogger Bruce Riddell said...

Does your blog ever get spammed?
If you hate blog spamming then you should read this post! Is there such a thing as spamming a blog?

6:50 AM  
Blogger Kim said...

This is very interesting, You are a very skilled blogger. I’ve joined your rss feed and look forward to seeking more of your great post. Also, I’ve shared your web site in my social networks!
buy Lortab online

12:48 PM  

Post a Comment

<< Home