Instructions for Training to Exaustion
TRAINING TO EXHAUSTION WITH A PRE-CLASSIFIED EMAIL DATABASE
Here's how the training set is handled.
Start with an empty word probability database.
Consecutively classify each email in the training set. Switch between ham and spam back and forth.
If an email is classified correctly, ignore it.
When an email is encountered that is misclassified, update the probability database using the misclassified email, and continue processing the emails in sequence using the updated probability database, until the last email is reached.
Go back to the first email in the training set and start again.
The above is repeated until either: a) the very last email is properly classified, or b) there is one complete iteration through the system with no improvement in accuracy.
Notes:
1) The probability database is not recreated during each iteration. Rather, the counts that lead to the probabilities are further refined with each iteration.
2) Some emails are counted more than once. In practice this happens relatively rarely. Also this can be avoided, possibly leaving wrongly classified messages in the end.
3) Experiments have shown that it is useful to use a so-called security margin. This is an interval around the cutoff value which is forbidden for all messages. For example, spams may be required to be >.7 and hams <.2. In practice, almost all mails can be made to conform to this (if you don't allow for messages used more than once, a few might not conform). Then, when .5 is used as the cutoff in testing, messages more likely get scores not close to the cutoff and are more likely to be classified correctly.
4) In preliminary tests, this procedure has let to very good performance.
5) Using this procedure, we don't need to tune the filter for parameters such as finding the ideal spam/ham cutoff point. Rather that probability database is automatically adjusted to match the cutoff point(s) initially chosen by the software developer (see 3 above).
6) In actual use in an inline spam filter, the process would ideally iterate through the entire database of emails every time there is a misclassification. If resources are limited this process might be restricted to recent messages.
Comments