OpenNTF.org - The Open Source Community for Collaboration Solutions

Obvious Spam with KS_BL_PROB = 0.0000
By Scott Davis 2 decades ago

I have been running the Bayesian filter for a couple of days now, and it seems to be working fairly well, except that I'm getting a fair number of obvious spam messages passing with a KS_BL_PROB of zero or very close to zero, even though the KS_BL_TOKENS shows many tokens with a 0.99 probability. Is there any way to tell what is causing this?

Could you post the probabilities of the individual tokens here
By Tom Lyne 2 decades ago

Then I can run them through the code to see if there is a bug of some sought,

-Tom

Examples
By Scott Davis 2 decades ago

Example 1:

KS_BL_PROB = 0.0000

KS_BL_TOKENS=

Develop: 0.1336

Larger: 0.4000

Penis: 0.9900

Weeks: 0.5812

dosage: 0.4000

Content-Type: 0.4000

www: 0.1820

href: 0.1336

type: 0.0422

index: 0.0555

table: 0.0522

HTML: 0.0127

cid: 0.0100

quoted-printable: 0.9900

MIME: 0.9900

Example 2:

KS_BL_PROB=0.0000

KS_BL_TOKENS=

hello: 0.4000

there: 0.1026

check: 0.1236

this: 0.5000

out: 0.2121

otjv: 0.4000

Content-Transfer-Encoding: 0.4000

world: 0.2254

Endorsed: 0.1952

text: 0.0808

html: 0.0525

visit: 0.1128

href: 0.0422

title: 0.0243

HGH: 0.9900

I can post some more if you need them.

Thanks,

Scott

kSpam uses combined probabilities
By Tom Lyne 2 decades ago

Taken from http://www.paulgraham.com/naivebayes.html ;

If a and b are the probabilities associated with two independent pieces of evidence, then combined they indicate a probability of:

ab 

——————-

ab + (1 - a)(1 - b)

See also Paul's article on Bayesian filtering ;

http://www.paulgraham.com/spam.html

-Tom

Spam calc
By Scott Kingery 2 decades ago

Did the math with Excel:

Using the probablitites posted by Scott Davis:

hello: 0.4000

there: 0.1026

check: 0.1236

this: 0.5000

out: 0.2121

otjv: 0.4000

Content-Transfer-Encoding: 0.4000

world: 0.2254

Endorsed: 0.1952

text: 0.0808

html: 0.0525

visit: 0.1128

href: 0.0422

title: 0.0243

HGH: 0.9900

Using Paulgraham's explanation http://www.paulgraham.com/naivebayes.html :

ab

——————-

ab + (1 - a)(1 - b)

1.83781E-12

————–

0.000231422

Answer:

7.94139E-09

Very negative so not considered spam.
calculating probabilities in Excel
By Randy L Dyck 2 decades ago

I've been trying to figure out some of my obvious spam that had been getting through kspam and intended to work out the calculations in Excel. I've included that below.

I finally did get a handle on most of my HTML spam by adding 20, 50 or 100 each of a bunch of HTML formatting codes (i.e. font, FACE, BODY, etc.) that doesn't show up in normal email much, but is valid. I added enough of these into the goodmail.nsf by pasting the same words into a single text mail over and over.

One note. It seems that in order to use the new probabilites it's not enough just to let kspam recalc, which it does every 6 hours by default. I had to unload nspam and reload it (or restart Domino . . .) That forced the new calcs to be applied against new email.

Anyway, here is the method I used in Excel.

Token

Probability 1 - Probability 

You: 0.2181 0.7819

center: 0.2449 0.7551

idea: 0.0272 0.9728

font: 0.1142 0.8858

div: 0.1005 0.8995

bgcolor:0.0773 0.9227

corp: 0.01 0.99

FF0000: 0.99 0.01

52:01.0 0.99 0.01

44:01.0 0.99 0.01

51:01.0 0.99 0.01

Verdana:0.99 0.01

50:00.0 0.01 0.99

Editor: 0.01 0.99

FrontPage:0.99 0.01

=PRODUCT(E1:E15) =PRODUCT(F1:F15) =PRODUCT(E1:E15)/(PRODUCT(E1:E15)+PRODUCT(F1:F15) 

Try it, and it works out to withing 5 decimal places of what Kspam is reporting as KS_BL_PROB.

kSpam.bload recalc
By Tom Lyne 2 decades ago

The period between recalculating the probabilities can be changed by the KS_BL_PERIOD notes.ini variable. However these don't become effective until kSpam reloads it's configuration which it does every 60 minutes.

So by default the longest time between probabilities reloading is 7 hours.

-tom

kSpam Download

kspam 2.0 for linux 32Bit - All,Open

kspam 2.0 release - All,Open

Linux Intel 64-bit for Domino 9? - All,Open

Clustered servers not refreshing Config - All,Open

goodlist.txt and spamlist.txt and "bload" task??? - All,Open

Source Code ? - All,Open

Error when starting kSpam - All,Open

kSpam on W2008 64bit Domino 8.5 - All,Open

No Description - All,Open

No Description - All,Open

Accuracy Not Good - All,Open

Anyone Running kSpam on Domino 8.5 Linux? - All,Open

mail from / from reason - All,Open

many [?SPAM?] tags - All,Open

Lots of Spam tagged as good - All,Open

Mailgood.nsf Agent creates Whitelist Allow-rules - All,Open

kSpam on Linux with Lotus Domino 8 - All,Open

bload.txt - All,Open

Mailspam.nsf Webform for end-users - All,Open

Wierd Deny List results - All,Open

Obvious Spam with KS_BL_PROB = 0.0000

Could you post the probabilities of the individual tokens here

Examples

kSpam uses combined probabilities

Spam calc

calculating probabilities in Excel

kSpam.bload recalc

kSpam