The "class-train" goal is used for text classification experiments.
This goal expects as input one or several files in DSV format (see formats). Each entry must contain two fields:
Each DSV file specifies a test run to execute. For each file, the internally used classifier is reset to zero knowledge (the prediction model is deleted). Then the classifier is presented with each file (in the order listed in the DSV file) and required to classify the message. After each classification the true class of the message is revealed and the classifier has the possibility to update its prediction model accordingly prior to classifying the next file.
For each input file FILENAME.dsv
, an output file FILENAME.cls
is
written that records the prediction of the classifier in a third field
"Classification". This field either contains a "+" (if the prediction was
correct) or the name of the predicted class (otherwise, in case of a wrong
prediction). These output files can be used to generate statistics.
Sample input file (from the SpamAssassin corpus described below):
Class|File nonspam|easy_ham/01408.3967d7ac324000b3216b1bdadf32ad69 nonspam|easy_ham/02152.8df514c41920019281f8f0723dad0001 spam|spam_2/01164.55202a4234914b20004dc7f9264313e5 .....
Corresponding sample output file:
File|Class|Classification easy_ham/01408.3967d7ac324000b3216b1bdadf32ad69|nonspam|spam easy_ham/02152.8df514c41920019281f8f0723dad0001|nonspam|+ spam_2/01164.55202a4234914b20004dc7f9264313e5|spam|nonspam .....
Spam filtering experiments were performed on the SpamAssassin corpus available at http://www.spamassassin.org/publiccorpus/. To allow comparison with the experiments reported by CRM114 (http://crm114.sourceforge.net/Plateau_Paper.html), only a subset of the corpus was used.
To run these experiments, download the files 20030228_easy_ham.tar.bz2
,
20030228_hard_ham.tar.bz2
, and 20030228_spam_2.tar.bz2
from the
SpamAssassin corpus and unpack them in a directory of your choice. The
DSV files that control the test runs are stored in
http://www.inf.fu-berlin.de/inst/ag-db/data/ties/spamassassin-runs.tar.bz2.
Unpack this file in the same directory. The DSV files contain the ten
"shuffles" (test runs) that were also used for the CRM114 experiments.
For each experiment, you should create a subdirectory (e.g. eval1
,
eval2
etc.) in this folder. This avoids confusion between the results of
different experiments.
You start an experiment by changing into a new subdirectory and executing
the following command (assuming you have defined an alias ties
,
see install):
ties class-train -outdir=. -classifier=Winnow ../*.dsv
The outdir
option tells the system to write output files to the current
subdirectory (instead of the directory of the corresponding input file);
the classifier
option specifies the classifier to use. After the
experiment is finished (this will take a while), the directory should
contain a log file (ties.log
) and ten *.cls
files containing the
results of the ten test runs.
You can use the errorgrep
alias (see install) to check the log file for
warnings (there shouldn't be any). For getting statistics from the
*.cls
files, we found the following alias useful:
alias spamstats='tail -q --lines 500 *.cls|grep -v "|+"|wc -l; tail -q --lines 500 *.cls|grep "spam|nonspam"|wc -l; tail -q --lines 500 *.cls|grep "nonspam|spam"|wc -l; tail -q --lines 1000 *.cls|grep -v "|+"|wc -l; egrep -v "(\|\+|Classification)" *.cls|wc -l'
This prints five numbers:
For the command given above, you should get the numbers:
21 13 8 37 525
We reached our best results on the SpamAssassin corpus when using Jaakko
Hyvätti's normalizemime
(http://hyvatti.iki.fi/~jaakko/spam/) for
preprocessing the mails. To reproduce this, you need to download and
install normalizemime
.
Then you can create normalized variants of all mails by executing the command
find . -name "?????.????????????????????????????????" -exec \ normalizemime "{}" "{}.norm" \;
To process the normalized variants, add the two options -file.ext=.norm
-charset=UTF-8
when calling ties
. This time you should get the
following statistics:
16 12 4 33 514
Other preprocessing tools can be integrated in the same way.