« My Position on Business Method Patents | Main | No statistical evidence regarding amazon here. »

Instructions for Training to Exaustion

TRAINING TO EXHAUSTION WITH A PRE-CLASSIFIED EMAIL DATABASE

Here's how the training set is handled.

Start with an empty word probability database.

Consecutively classify each email in the training set. Switch between ham and spam back and forth.

If an email is classified correctly, ignore it.

When an email is encountered that is misclassified, update the probability database using the misclassified email, and continue processing the emails in sequence using the updated probability database, until the last email is reached.

Go back to the first email in the training set and start again.

The above is repeated until either: a) the very last email is properly classified, or b) there is one complete iteration through the system with no improvement in accuracy.

Notes:

1) The probability database is not recreated during each iteration. Rather, the counts that lead to the probabilities are further refined with each iteration.

2) Some emails are counted more than once. In practice this happens relatively rarely. Also this can be avoided, possibly leaving wrongly classified messages in the end.

3) Experiments have shown that it is useful to use a so-called security margin. This is an interval around the cutoff value which is forbidden for all messages. For example, spams may be required to be >.7 and hams <.2. In practice, almost all mails can be made to conform to this (if you don't allow for messages used more than once, a few might not conform). Then, when .5 is used as the cutoff in testing, messages more likely get scores not close to the cutoff and are more likely to be classified correctly.

4) In preliminary tests, this procedure has let to very good performance.

5) Using this procedure, we don't need to tune the filter for parameters such as finding the ideal spam/ham cutoff point. Rather that probability database is automatically adjusted to match the cutoff point(s) initially chosen by the software developer (see 3 above).

6) In actual use in an inline spam filter, the process would ideally iterate through the entire database of emails every time there is a misclassification. If resources are limited this process might be restricted to recent messages.

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

My Photo
Blog powered by TypePad

See My Main Blog, Frequently Updated