The "class-train" goal is used for text classification.
This goal expects as input one or several files in DSV format (see formats). Each entry must contain one or two fields:
Each DSV file specifies a list of text files to train and/or to classify. The classifier is presented with each text (in the order listed in the DSV file) and required to classify the text. If the true class of a text is is given (in the "Class" field), it is revealed to the classifier after each classification and the classifier has the chance to update its prediction model accordingly prior to classifying the next file.
For each input file FILENAME.dsv
, an output file FILENAME.cls
is
written that records the prediction of the classifier in a further field
"Classification". This field either contains a "+" (if the prediction was
correct) or the name of the predicted class (otherwise, in case of a wrong
prediction or if the true class is unknown).
Sample input file (from the SpamAssassin corpus described below):
File|Class easy_ham/01408.3967d7ac324000b3216b1bdadf32ad69|nonspam easy_ham/02152.8df514c41920019281f8f0723dad0001|nonspam spam_2/01164.55202a4234914b20004dc7f9264313e5|spam .....
Corresponding sample output file:
File|Class|Classification easy_ham/01408.3967d7ac324000b3216b1bdadf32ad69|nonspam|spam easy_ham/02152.8df514c41920019281f8f0723dad0001|nonspam|+ spam_2/01164.55202a4234914b20004dc7f9264313e5|spam|nonspam .....
If the true class of all texts is given, you will also get a corresponding '*.metrics' file that contains the classification accuracy.
The "Class" field is optional: If it is empty or missing, the file will be classified and the predicted class will be written to the "Classification" field, but no training will take place (since we do not know the true class).
By default, the same classification model is used for all files specified
on the command line. This means that you can separate training and test
data in different files (e.g. train.dsv
and test.dsv
), where the
"Class" field is only given for the training data.
Then invoke the goal "class-train train.dsv test.dsv". If you spread your training instances across several files, you should note that the first file is inspected to find out which classes exist, so each class must occur at least once in the first file.
Sample training file (train.dsv
):
File|Class file1.html|bad file2.html|good file3.html|good file4.html|bad .....
Sample test file (test.dsv
):
File file1.html file2.html .....
The corresponding sample test output file (test.cls
) adds the
"Classification" proposed by the classifier:
File|Classification file1.html|bad file2.html|good .....
You will also get output files corresponding to the training file
(train.dsv
+ train.metrics
), but you can ignore them since they only
document the training process.
If your texts use a character set that's different from the standard
encoding used by your system, you should pass that in using the
-charset=...
option, otherwise the tokenizer might get confused.
If several files are specified on the command line, the same
classification model is re-used for all of them. If you want to initialize
a new classifier with an empty classification model for each file, you
must set the option -classifier.re-use=false
.
It is also possible to store the prediction model persistently. If you set
-classifier.store
(which is a shortcut for -classifier.store=true
),
the classier will write its final classification model into a file
classifier.xsj
(a gzip-compressed XML file, in case you want to take a
look). At the next invocation, if the classifier finds this file, it will
load the initial classification model from this file instead of starting
with a new one (unless you have set -classifier.re-use=false
as
mentioned above).
By setting the parameter -classifier.file=...
to can specify a different
name for the classification model (relative or absolute file name).
Whenever you want to create a new classification model, you'll have to delete or rename any existing model with the chosen name, otherwise the old model will be loaded.
Note that lots of main memory are required while reading and writing this file. You might have to increase the amount of memory assigned to the virtual machine to prevent OutOfMemoryErrors.
If you specify -classifier.test-only
(a shortcut for
-classifier.test-only=true
) on the command line, the "Class" field (if
present) is only used for checking the correctness of the prediction, but
no training takes place. Of course, this makes only sense if you have
built and stored a classification model in a prior run, so you'll have to
invoke the goal at least two times:
Training (will create a classifier.xsj
file):
ties class-train -classifier.store train.dsv
Evaluation (will load the classifier.xsj
file):
ties class-train -classifier.store -classifier.test-only test.dsv
You can enable iterative (TUNE) training by seting the option
-train.tune.text=15
when training. This will cause the classifier to
iterate over the training set for at most 15 iterations (typically
training stops earlier when training accuracy saturates). For small corpora,
TUNE training seems to be unnecessary, but for large ones it might help.
Spam filtering experiments were performed on the SpamAssassin corpus available at http://www.spamassassin.org/publiccorpus/. To allow comparison with the experiments reported by CRM114 (http://crm114.sourceforge.net/Plateau_Paper.html), only a subset of the corpus was used.
To run these experiments, download the files 20030228_easy_ham.tar.bz2
,
20030228_hard_ham.tar.bz2
, and 20030228_spam_2.tar.bz2
from the
SpamAssassin corpus and unpack them in a directory of your choice. The
DSV files that control the test runs are stored in
http://www.inf.fu-berlin.de/inst/ag-db/data/ties/spamassassin-runs.tar.bz2.
Unpack this file in the same directory. The DSV files contain the ten
"shuffles" (test runs) that were also used for the CRM114 experiments.
For each experiment, you should create a subdirectory (e.g. eval1
,
eval2
etc.) in this folder. This avoids confusion between the results of
different experiments.
You start an experiment by changing into a new subdirectory and executing
the following command (assuming you have defined an alias ties
,
see install):
ties class-train -outdir=. -classifier.re-use=false ../*.dsv
The outdir
option tells the system to write output files to the current
subdirectory (instead of the directory of the corresponding input file);
the classifier
option specifies the classifier to use. After the
experiment is finished (this will take a while), the directory should
contain a log file (ties.log
) and ten *.cls
files containing the
results of the ten test runs.
You can use the errorgrep
alias (see install) to check the log file for
warnings (there shouldn't be any). For getting statistics from the
*.cls
files, we found the following alias useful:
alias spamstats='tail -q --lines 500 *.cls|grep -v "|+"|wc -l; tail -q --lines 500 *.cls|grep "spam|nonspam"|wc -l; tail -q --lines 500 *.cls|grep "nonspam|spam"|wc -l; tail -q --lines 1000 *.cls|grep -v "|+"|wc -l; egrep -v "(\|\+|Classification)" *.cls|wc -l'
This prints five numbers:
For the command given above, you should get the numbers:
21 13 8 37 525
We reached our best results on the SpamAssassin corpus when using Jaakko
Hyvätti's normalizemime
(http://hyvatti.iki.fi/~jaakko/spam/) for
preprocessing the mails. To reproduce this, you need to download and
install normalizemime
.
Then you can create normalized variants of all mails by executing the command
find . -name "?????.????????????????????????????????" -exec \ normalizemime "{}" "{}.norm" \;
To process the normalized variants, add the two options -file.ext=.norm
-charset=UTF-8
when calling ties
. This time you should get the
following statistics:
16 12 4 33 514
Other preprocessing tools can be integrated in the same way.