2004-08-22

Anti-spam technology improvement

I do not receive that much spam on most of my e-mail accounts, especially because of anti-spam server tools. I never relied that much on automatic spam tagging because of false positives, but it is more rare now (it happens once in a while with requested commercial e-mail. One I received from my ISP account, correctly tagged as spam, is curious:

''Cheeapest Medicaationns
High Qua1ity
shiiiip to all countriies
70% off discccountt


Cl1ick to ennjoy our offfeer''

The name of the fake sender is also curious: "Claretta Masako", a mix of italian and japanese name? But it could pass for a perfectly good american name. Anyway, even with the intentional spelling eeeerrrorrrrs, it was tagged as spam. And it did not contain any of the easily identifiable spam terms like "Viagra", " low interest rates", "size does matter", etc. Another possibility was the blocking by IP, but I doubt it, since it is apparently it is from Cox Communications, and it would be really bad if my provider blocked ALL e-mail coming from them. Now it occured me that it could use both information...

The point is that humans can easily recognize the trick, but, with language like english which allows lots of doubled consonants (and vowels), an algorithm to detect the trick is tricky, especially when the spammer is also using numbers inside the words. A dictionary approach is not feasible, due to the large number of possibilities. Maybe a dictionary approach with some good string matching (regular expressions) and probably in a quantitative fashion, but I am not sure about that either. I doubt there are people with linguistic background helping to improve anti-spam technology, and I doubt they are really necessary at all, but it certainly has a lot to do with language.

LO

5 Comments:

Blogger W1ll13 30% Hacker said...

"The point is that humans can easily recognize the trick..."

On a somewhat similar tack (though mostly dissimilar), I read about professional Ebayers using creative spellings to win auctions cheap. If you were looking for a book by John Searle you might try an Ebay search for Surl or Searll and find an auction that all the other correct-spelling Searle bidders missed. So the thing to do would be to come up with an algorithm that would give you all the right wrong spellings.

4:53 PM  
Blogger LO said...

An algorithm to find the Ebay misspellings is less tricky. You can go by phonological similarity. As long as you have a list of correspondences of common spellings to same phonemes, that would be easy. Of course, that is sort of a nightmare in English names, but is a potentially finite problem. For ortographical misspelings, the concept of ortographical lexical neighborhood could be used. You search for words which differ by adding, deleting or replacing one or more letters. The phonological one is more interesting to solve, but I am not sure that most errors are phonological.

8:11 PM  
Blogger Shock Carlos said...

Hey, quite the interesting blog you have here. You obviously put some thought into it.
I have a ISP blog/site you may find interesting.

If you get a chance, especially if you're looking for online-ISP services or
ISP, please come on by for a visit.

2:44 AM  
Blogger Jack Naka said...

Just thought i would say hi from Japan. Doing some blog surfing and found your site. Im looking for some cool styles of fashion illustrator for my own blog. Theres some really amazing blogs about. if you have time check out my site you will find information on fashion illustrator. Well when i get my blog running hope you come and check it out.

11:28 AM  
Blogger Bruce Riddell said...

Does your blog ever get spammed?
If you hate blog spamming then you should read this post! Is there such a thing as spamming a blog?

6:49 AM  

Post a Comment

<< Home