A list compiled by Sophos. The press release talks about the long known spammer techniques used by spammers to disguise the target words. They claim to be able to detect 5.6 billion variations of the word "Viagra", which is probably the number of possible combination according to some distance metric. But it looks a total waste of resources to be able to do it. Test even a fraction of that would be a bit dumb. Not to mention they would need to care at least about the top 10 of the list. Well, I don't have the details, but it sounds an unnecessary exaggeration. The top 5:


"Cialis" has the same function of Viagra. The meaning of "milf" is unpublishable (as half the list. What I don't buy is that debt is number 102 or credit is in the 64th position!

Now, the part which might be interesting to inguists is that this list can indicate what might be biased in a corpus harvested from the web or from discussion groups (which would be a corpus considered closer to speech).

It is also worth mentioning a recent discussion about several problems with Google counts in the Corpora List.

Sophos articles about spam: Sophos report reveals words that spammers most commonly try to disguise


Interesting list! Especially as the only word from the list I even know the meaning of is "shipping"!

