Michel Py wrote: 1. Reduce the efficiency of Bayesian-like filters: Trouble with this kind of email is that they are a) of sufficient length b) contain only "real" words c) contain none of the words regularly used by spammers such as the v. word.
Paul Jakma wrote: Good bayesian filters do not score on single words alone, they also score on "phrases" (ie multiple words). Random strings of words will result in neutral scores (presuming those words are also used in non-spam), while the phrases will be slightly higher. Re-used gibberish (ie apparently random) strings of words will result in "phrases" from that gibberish having high scores.
Indeed; notice I did write "Bayesian-like" and not "Bayesian" and never mentioned anything about good ones or not-as-good ones.
Also, a good bayesian filter should prune its database regularly of phrases (including one word phrases) that have not had their score updated recently, further reducing "pollution" by random words and phrases. noise is just noise. the spam specific stuff will still be statistically significant, hopefully.
I understand this too. However, I think the point you are missing here is the difference between "what could be done" and "what people have". The fact of the matter is that spam messages including a bunch of random dictionary words have had and still have a much higher penetration rate than messages that don't feature it. The proof is in the pudding. And as I said earlier, expect the "bunch of dictionary words" to mutate into a more sophisticated animal that includes correct grammar. What you and I do or could do (on a small scale) in terms of spam filtering is largely irrelevant. If spammers were smart they would not send us (collectively) spam to begin with, as the only thing it achieves is to get us pissed and implement more filtering. In the end, the only thing that matters is not what we could do about filtering neither how much spam _we_ get, but how many spams joe-six-pack gets per day. WRT this, although it is true that we have made tremendous progress in terms of filtering, it is equally true that the spammers have made tremendous progress in defeating our counter-measures, resulting in end-users getting unprecedented and still increasing amounts of spam. The measuring metric here is _not_ that we successfully filter 90% or 95% or 99.99% of spam; this is meaningless. The meaningful metric is: how many spams does joe-six-pack get a day. There is no difference between a) joe-six-pack getting 50 spams a day and us canceling 450 a day and b) joe-six-pack getting 50 spams a day and us canceling 9950 a day. Actually, there might be one: the spammers laughing their bottoms off thinking that filtering 9950 spams per day per user costs us 100 times more than it takes them to send 10000 spams per user per day. Michel.
On Sun, 4 Apr 2004, Michel Py wrote: : And as I said earlier, expect the "bunch of dictionary words" to mutate : into a more sophisticated animal that includes correct grammar. This has already happened; there is some well known spam that consists of HTML "content" or an image-based ad, with snippets of recent AP newswire stories in the plaintext body section. -- -- Todd Vierling <tv@duh.org> <tv@pobox.com>
On Sun, 4 Apr 2004, Michel Py wrote:
Indeed; notice I did write "Bayesian-like" and not "Bayesian" and never mentioned anything about good ones or not-as-good ones.
Right, but if we're going to talk about bayesian filtering in general, there's little sense in constraining the discussion to "not-as-good" bayesian filters. The not-as-good filters are obviously doomed to extinction, if they do not improve and become good ones.
I understand this too. However, I think the point you are missing here is the difference between "what could be done" and "what people have".
I dont see why that matters to a general discussion about the limits of or "attacks" against bayesian filters.
penetration rate than messages that don't feature it. The proof is in the pudding. And as I said earlier, expect the "bunch of dictionary words" to mutate into a more sophisticated animal that includes correct grammar.
However, if we ignore "probe" mails, these emails will still have a spam payload somewhere in them, otherwise they're not spam. That spam payload should in theory stick out like a sore thumb, in bayesian terms. The added text will just eventually cause the bayesian filter to tend to score those phrases towards 0.5 - ie no indicator and, once again, a good bayesian filter will only consider phrases that are good indicators of spam or non-spam. Ie drop all phrases with probabilities of P between x <= P <= y, where x and y are arbitrary. (eg 0.1 and 0.9). If we add in the probe emails, these will just help with better weighting of common text towards 0.5. The problem at the moment is that *not enough* spammers are using the extraneous added text bayesian attack to significantly affect filters to class common text towards 0.5 and hence be pruned from affecting the outcome due to x <= P <= y. As more spam starts to use this attack, the (half-decent) bayesian filters will become increasingly immune to it.
What you and I do or could do (on a small scale) in terms of spam filtering is largely irrelevant.
I dont see why it is irrelevant, what you or I or others use today for our spam filtering, is potentially what you or I or others will use tomorrow to protect joe-six-pack customers. I give friends, family and some others email - what I find works well for me, I eventually apply to their email too if I can. If I were to have to protect customers from spam, my experience gained from using filtering solutions in more personal situations, I would try to apply to protect the customers, or alternatively, if I lacked direct experience, I would try go by the experience of others.
have made tremendous progress in terms of filtering, it is equally true that the spammers have made tremendous progress in defeating our counter-measures, resulting in end-users getting unprecedented and still increasing amounts of spam.
Right.
The measuring metric here is _not_ that we successfully filter 90% or 95% or 99.99% of spam; this is meaningless. The meaningful metric is: how many spams does joe-six-pack get a day.
If you pick "90%" or "95%", then you can indeed try to imply a percentage metric is meaningless. However, I'm pretty sure that those who receive email via services I admin are much happier that those services catch x% of spam than 0%.
There is no difference between a) joe-six-pack getting 50 spams a day and us canceling 450 a day and b) joe-six-pack getting 50 spams a day and us canceling 9950 a day.
If you wish to compare 90% against 95%, yes. I wonder though if we're anywhere near 90% filter rate (least not for any useful filtering service that doesnt have a similarly large false-positive rate).
Michel.
regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A warning: do not ever send email to spam@dishone.st Fortune: Cats are smarter than dogs. You can't make eight cats pull a sled through the snow.
participants (3)
-
Michel Py
-
Paul Jakma
-
Todd Vierling