• Obvious Spam with KS_BL_PROB = 0.0000

    By Scott Davis 2 decades ago

    I have been running the Bayesian filter for a couple of days now, and it seems to be working fairly well, except that I'm getting a fair number of obvious spam messages passing with a KS_BL_PROB of zero or very close to zero, even though the KS_BL_TOKENS shows many tokens with a 0.99 probability. Is there any way to tell what is causing this?

    • Could you post the probabilities of the individual tokens here

      By Tom Lyne 2 decades ago

      Then I can run them through the code to see if there is a bug of some sought,



      -Tom

      • Examples

        By Scott Davis 2 decades ago

        Example 1:

        KS_BL_PROB = 0.0000



        KS_BL_TOKENS=

        Develop: 0.1336

        Larger: 0.4000

        Penis: 0.9900

        Weeks: 0.5812

        dosage: 0.4000

        Content-Type: 0.4000

        www: 0.1820

        href: 0.1336

        type: 0.0422

        index: 0.0555

        table: 0.0522

        HTML: 0.0127

        cid: 0.0100

        quoted-printable: 0.9900

        MIME: 0.9900



        Example 2:

        KS_BL_PROB=0.0000



        KS_BL_TOKENS=

        hello: 0.4000

        there: 0.1026

        check: 0.1236

        this: 0.5000

        out: 0.2121

        otjv: 0.4000

        Content-Transfer-Encoding: 0.4000

        world: 0.2254

        Endorsed: 0.1952

        text: 0.0808

        html: 0.0525

        visit: 0.1128

        href: 0.0422

        title: 0.0243

        HGH: 0.9900



        I can post some more if you need them.



        Thanks,

        Scott

        • kSpam uses combined probabilities

          By Tom Lyne 2 decades ago

          Taken from http://www.paulgraham.com/naivebayes.html ;



          If a and b are the probabilities associated with two independent pieces of evidence, then combined they indicate a probability of:




             ab<br/>
          

          ——————-

          ab + (1 - a)(1 - b)





          See also Paul's article on Bayesian filtering ;



          http://www.paulgraham.com/spam.html



          -Tom

          • Spam calc

            By Scott Kingery 2 decades ago

            Did the math with Excel:

            Using the probablitites posted by Scott Davis:

            hello: 0.4000

            there: 0.1026

            check: 0.1236

            this: 0.5000

            out: 0.2121

            otjv: 0.4000

            Content-Transfer-Encoding: 0.4000

            world: 0.2254

            Endorsed: 0.1952

            text: 0.0808

            html: 0.0525

            visit: 0.1128

            href: 0.0422

            title: 0.0243

            HGH: 0.9900



            Using Paulgraham's explanation http://www.paulgraham.com/naivebayes.html :



            ab

            ——————-

            ab + (1 - a)(1 - b)





            1.83781E-12

            ————–

            0.000231422



            Answer:

            7.94139E-09



            Very negative so not considered spam.

          • calculating probabilities in Excel

            By Randy L Dyck 2 decades ago

            I've been trying to figure out some of my obvious spam that had been getting through kspam and intended to work out the calculations in Excel. I've included that below.



            I finally did get a handle on most of my HTML spam by adding 20, 50 or 100 each of a bunch of HTML formatting codes (i.e. font, FACE, BODY, etc.) that doesn't show up in normal email much, but is valid. I added enough of these into the goodmail.nsf by pasting the same words into a single text mail over and over.



            One note. It seems that in order to use the new probabilites it's not enough just to let kspam recalc, which it does every 6 hours by default. I had to unload nspam and reload it (or restart Domino . . .) That forced the new calcs to be applied against new email.



            Anyway, here is the method I used in Excel.



            Token

                Probability<br/>
                        1 - Probability<br/>
            

            You: 0.2181 0.7819

            center: 0.2449 0.7551

            idea: 0.0272 0.9728

            font: 0.1142 0.8858

            div: 0.1005 0.8995

            bgcolor:0.0773 0.9227

            corp: 0.01 0.99

            FF0000: 0.99 0.01

            52:01.0 0.99 0.01

            44:01.0 0.99 0.01

            51:01.0 0.99 0.01

            Verdana:0.99 0.01

            50:00.0 0.01 0.99

            Editor: 0.01 0.99

            FrontPage:0.99 0.01


                =PRODUCT(E1:E15)<br/>
                        =PRODUCT(F1:F15)<br/>
               =PRODUCT(E1:E15)/(PRODUCT(E1:E15)+PRODUCT(F1:F15)<br/>
            



            Try it, and it works out to withing 5 decimal places of what Kspam is reporting as KS_BL_PROB.

            • kSpam.bload recalc

              By Tom Lyne 2 decades ago

              The period between recalculating the probabilities can be changed by the KS_BL_PERIOD notes.ini variable. However these don't become effective until kSpam reloads it's configuration which it does every 60 minutes.



              So by default the longest time between probabilities reloading is 7 hours.



              -tom