• Bayesian becomes ineffective (MIME & Token-limit)

    By Daniel Stelter 2 decades ago

    I'm running kSpam now for a way long time (great tool!) and I

    think I have a good control about it. ;)

    (for reference: running kSpam 1.6b2 on Notes 7.0.2FP3 on W2k.)



    Since some weeks I noticed that the Bayesian probability system

    becomes more and more ineffective.

    The problem is with MIME encoded mails.

    kSpam limits the Bayesian Token to 15 and I have seen that

    when receiving MIME (-Spam), most of the Tokens are stupid

    "Content-Type", "multipart", "Content-Transfer-Encoding",

    "NextPart" and other "—–"-stuff.

    As far as I can see, kSpam takes the RAW-Text (via NSFItemGetText)

    of the mail and goes from top-to-bottom to get the first 15 Tokens…

    The "Subject" will be taken fine and after that

    the mail-body begins its MIME-encoding/decoding text.

    And before kSpam gets to the real Body content, the 15 Tokens are

    filled. Hm…

    I tried several things… I wrote an Agent to get rid of (better: convert)

    MIME mails to RichText (with exact 1 Body containing SPAM message) in

    the mailspam.nsf to let bload calculate only "real" SPAM text. So at least I have

    now real SPAM-Tokens with real probabilities.

    Doesn't make sense when these Tokens never be "touched" for calculation

    in incoming mails, I know…

    Even setup Bayesian to ignore the described Tokens will not help, because it's

    not that the Tokens itself will be ignored, but only their probabilities (think about THIS!).



    So what…I don't know…

    Is anybody experiencing similar things?



    I know (do I?) that kSpam is no longer in development…it's really a pity…

    so what chances do we (okay, I) have letting kSpam not die?



    -Daniel.

    • kspam is alive

      By Nico Vis 2 decades ago

      Hi Daniel,

      thank you for you post.

      Even if the last version were released in october-november we are always looking for new ideas to be implemented.



      The calculation of the bayesian tokens is always a "problem". Depending on the method you use there are good and bad points.

      For the bayesian part I populate the dbs with internet mail only, in both databaes, so that "Content-Type" and the other stuff like this are balanced. More you can set to ignore this tokens in the configuration document.

      Sometime it is useful to tokenize mime messages because parts of encoded images and files can be tokenized (it works with virus) and will help to block next messages with different image or zip names but same content.



      We'll think about it.



      Regards



      Nico

      • I'm glad to hear

        By Daniel Stelter 2 decades ago

        Hi Nico -

        thank you for reading… ;)



        Sure you're right. There are good reasons keeping mime messages tokenized. But at least on my side other software is taking about viruses…

        With the "Ignore Token" possibility it's up to the (kSpam-) Admin whether he wants to take care about the "Content-Type" stuff or not… that's good!

        I'm still playing with adding ignore tokens to the conf doc but it still fills up the 15 available tokens for the probability calculation.

        Looking from "outside", an easy solution to fit all needs is to have the option to really ignore the ignored tokens…so at the end, each mail has 15 "real" tokens… but I think on the programming side, it's not so easy. As far as I understood, bload takes care about the ignore tokens only at startup while loading/calculating tokens…to really ignore tokens, bload has to check the ignore list realtime when receiving mails…tricky…

        What could (maybe?) easier…: what about increasing the count of tokens a mail will get? Maybe configurable in the conf doc. I'm not sure what other effects this may have (performance? mem usage?)…



        Other thoughts/ideas…?!



        It's time for Easter now…have a good time!



        Cheers -Daniel.

        • More tokens

          By Nico Vis 2 decades ago

          Hi Daniel,

          thank you for you post, it's a good idea, I will check this possibility.

          Have a nice Easter!

          ;-)