Thursday, January 14, 2010

Contextual Spam

While I was posting regularly to this site, I had wondered to myself why more spammers weren't trying out contextual spam. After all, they already had bots out there scanning web pages for email addresses, and they already had bots that were trying to manipulate their SpamAssassin scores down by using Bayesian theory in reverse (well, really more of a Markov Chain I guess)...

So it seemed to me the next step was that they would scan in the text from sites where they get the email address, and then use that text to build up a Markov Chain of text for the email.

Theoretically, that should then mean that whatever was generated would click more with the end recipient of the spam. Think about it, if you see an email come in from an address that you never deal with, and it says "Hey Friend!" in the subject, you are likely going to think "Hey Spam!" and delete it (I know I do).
But if it were to have a name of someone you talk to on a discussion board where your email was, or a subject that you were just talking about in a blog, etc - then you might be more drawn to the email.

So time passed and then within the last 6 months I have seen an absolutely huge increase in my spam that is doing exactly this. At first I thought I was just seeing things, but then I started to see enough links to things that I had publicly on the web that it was becoming clear this is what at least one bot system is doing out there.

On the good side, they are doing it very poorly - perhaps partially due to poor programming, or perhaps due to the limits of the data - if it doesn't have much text to build a database on, then it is going to output some fairly garbage data.

I won't go into the nitty gritty details of what is involved since it is boring and I don't really want to tell spammers what they are doing wrong, but the general idea is that you build a Markov Matrix in which you track some level of granularity of the text you are looking at. I'm guessing that these people are doing it at the word level. You then essentially just count how many times that word shows up in the text following the word before it.
Then you reverse your way out of it when generating text, based on the statistically probability of the next word, with a random weighting thrown in there.
And out comes something that looks somewhat like what it learned on (there are ways to greatly improve on that, but that is the general idea).

What is interesting is that even though it tends to get by the Bayesian based spam filters, it will also get by the human many times, at least to the level where they open the email.
Of course, then once they open it and see that it is junk, it will get tossed by the bulk of all users. But then spammers survive on that tiny percentage of people who apparently open that email and then actually do click through and buy whatever is on the other side.

No comments:

Post a Comment