OpenNTF Main Bar

Topic	Bayesian spam filter and large java object storage
Name	James B Fraser

So I implemented this idea. I've got a decent Notes client based java spam filter working. I used different base64 code than the previous post. Also, I had to break writes to a rich text field into 25KB or smaller chunks.

See the web links below for details on the theory of my implementation.

One java agent builds the "probability table" then stores it in a notes document and another actually assigns new incoming mail a "spamminess" number.

I started actually saving my spam last thursday morning. (I'm up to 255 spam messages.) I'm just about ready to actually trust the agent to file away some spam without my looking at it.

Right now I run the agent against new mail manually. My probability hash table is already at 1 MB, and base64 unencoding that for every new mail individually seems a bit heavy handed. So right now it does it only when I run the agent, usually once per several emails.

Any ideas on how to efficently store a >1 MB java object in a Notes database? I'm going to start filtering out some tokens, but I'll also be adding many more as I get more spam, so I think the object size will only grow.

This message may only make sense in terms of this (rather old) thread. If folks are interested in how to implement a pretty good adaptive spam filter based on http://www.paulgraham.com/spam.html and
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
let me know, & I'll put together all the details of my implementation. It's currently ugly code with cryptic comments, so it might take me a few days to clean it up.

Jamie Fraser
jamie.fraser@doblin.com

Bayesian spam filter - Andrew Price - 08-16-2002
  RE: Bayesian spam filter - Joshua b Jore - 08-16-2002
    RE: Bayesian spam filter - Joshua b Jore - 08-16-2002
       You are here

Bayesian spam filter and large java object storage - James B Fraser - 08-23-2002
        * Cool stuff. Looking forward to trying it out. No... - Andrew Price - 01-15-2003
        Large java object storage - Robert L Shaver - 01-06-2003
  very interesting, lets leverage what we have that ... - Alan Bell - 08-16-2002
  RE: Bayesian spam filter - Scott Kingery - 08-16-2002
  RE: Bayesian spam filter - Scott Kingery - 08-16-2002
    RE: Bayesian spam filter - Dietmar Dumke - 06-23-2003
      I'll go for that - James Redmond - 08-20-2003