I do not receive that much spam on most of my e-mail accounts, especially because of anti-spam server tools. I never relied that much on automatic spam tagging because of false positives, but it is more rare now (it happens once in a while with requested commercial e-mail. One I received from my ISP account, correctly tagged as spam, is curious:
shiiiip to all countriies
70% off discccountt
Cl1ick to ennjoy our offfeer''
The name of the fake sender is also curious: "Claretta Masako", a mix of italian and japanese name? But it could pass for a perfectly good american name. Anyway, even with the intentional spelling eeeerrrorrrrs, it was tagged as spam. And it did not contain any of the easily identifiable spam terms like "Viagra", " low interest rates", "size does matter", etc. Another possibility was the blocking by IP, but I doubt it, since it is apparently it is from Cox Communications, and it would be really bad if my provider blocked ALL e-mail coming from them. Now it occured me that it could use both information...
The point is that humans can easily recognize the trick, but, with language like english which allows lots of doubled consonants (and vowels), an algorithm to detect the trick is tricky, especially when the spammer is also using numbers inside the words. A dictionary approach is not feasible, due to the large number of possibilities. Maybe a dictionary approach with some good string matching (regular expressions) and probably in a quantitative fashion, but I am not sure about that either. I doubt there are people with linguistic background helping to improve anti-spam technology, and I doubt they are really necessary at all, but it certainly has a lot to do with language.