Text Classification How-to

The "class-train" goal is used for text classification.

This goal expects as input one or several files in DSV format (see formats). Each entry must contain one or two fields:

File
Relative path to the file to classify. Required.
Class
The correct classification of this file (e.g. "spam" or "nonspam" in case of spam filtering). Optional (see below).

Each DSV file specifies a list of text files to train and/or to classify. The classifier is presented with each text (in the order listed in the DSV file) and required to classify the text. If the true class of a text is is given (in the "Class" field), it is revealed to the classifier after each classification and the classifier has the chance to update its prediction model accordingly prior to classifying the next file.

For each input file FILENAME.dsv, an output file FILENAME.cls is written that records the prediction of the classifier in a further field "Classification". This field either contains a "+" (if the prediction was correct) or the name of the predicted class (otherwise, in case of a wrong prediction or if the true class is unknown).

Sample input file (from the SpamAssassin corpus described below):

    File|Class
    easy_ham/01408.3967d7ac324000b3216b1bdadf32ad69|nonspam
    easy_ham/02152.8df514c41920019281f8f0723dad0001|nonspam
    spam_2/01164.55202a4234914b20004dc7f9264313e5|spam
    .....

Corresponding sample output file:

    File|Class|Classification
    easy_ham/01408.3967d7ac324000b3216b1bdadf32ad69|nonspam|spam
    easy_ham/02152.8df514c41920019281f8f0723dad0001|nonspam|+
    spam_2/01164.55202a4234914b20004dc7f9264313e5|spam|nonspam
    .....

If the true class of all texts is given, you will also get a corresponding '*.metrics' file that contains the classification accuracy.

The "Class" field is optional: If it is empty or missing, the file will be classified and the predicted class will be written to the "Classification" field, but no training will take place (since we do not know the true class).

By default, the same classification model is used for all files specified on the command line. This means that you can separate training and test data in different files (e.g. train.dsv and test.dsv), where the "Class" field is only given for the training data.

Then invoke the goal "class-train train.dsv test.dsv". If you spread your training instances across several files, you should note that the first file is inspected to find out which classes exist, so each class must occur at least once in the first file.

Sample training file (train.dsv):

    File|Class
    file1.html|bad
    file2.html|good
    file3.html|good
    file4.html|bad
    .....

Sample test file (test.dsv):

    File
    file1.html
    file2.html
    .....

The corresponding sample test output file (test.cls) adds the "Classification" proposed by the classifier:

    File|Classification
    file1.html|bad
    file2.html|good
    .....

You will also get output files corresponding to the training file (train.dsv + train.metrics), but you can ignore them since they only document the training process.

If your texts use a character set that's different from the standard encoding used by your system, you should pass that in using the -charset=... option, otherwise the tokenizer might get confused.

If several files are specified on the command line, the same classification model is re-used for all of them. If you want to initialize a new classifier with an empty classification model for each file, you must set the option -classifier.re-use=false .

Storing and Loading of Classification Models

It is also possible to store the prediction model persistently. If you set -classifier.store (which is a shortcut for -classifier.store=true ), the classier will write its final classification model into a file classifier.xsj (a gzip-compressed XML file, in case you want to take a look). At the next invocation, if the classifier finds this file, it will load the initial classification model from this file instead of starting with a new one (unless you have set -classifier.re-use=false as mentioned above).

By setting the parameter -classifier.file=... to can specify a different name for the classification model (relative or absolute file name).

Whenever you want to create a new classification model, you'll have to delete or rename any existing model with the chosen name, otherwise the old model will be loaded.

Note that lots of main memory are required while reading and writing this file. You might have to increase the amount of memory assigned to the virtual machine to prevent OutOfMemoryErrors.

Separating Evaluation from Training

If you specify -classifier.test-only (a shortcut for -classifier.test-only=true ) on the command line, the "Class" field (if present) is only used for checking the correctness of the prediction, but no training takes place. Of course, this makes only sense if you have built and stored a classification model in a prior run, so you'll have to invoke the goal at least two times:

Training (will create a classifier.xsj file):

    ties class-train -classifier.store train.dsv

Evaluation (will load the classifier.xsj file):

    ties class-train -classifier.store -classifier.test-only test.dsv

TUNE Training ("Train-Until-No-Errors")

You can enable iterative (TUNE) training by seting the option -train.tune.text=15 when training. This will cause the classifier to iterate over the training set for at most 15 iterations (typically training stops earlier when training accuracy saturates). For small corpora, TUNE training seems to be unnecessary, but for large ones it might help.

Spam Filtering Experiments (SpamAssassin Corpus)

Spam filtering experiments were performed on the SpamAssassin corpus available at http://www.spamassassin.org/publiccorpus/. To allow comparison with the experiments reported by CRM114 (http://crm114.sourceforge.net/Plateau_Paper.html), only a subset of the corpus was used.

To run these experiments, download the files 20030228_easy_ham.tar.bz2, 20030228_hard_ham.tar.bz2, and 20030228_spam_2.tar.bz2 from the SpamAssassin corpus and unpack them in a directory of your choice. The DSV files that control the test runs are stored in http://www.inf.fu-berlin.de/inst/ag-db/data/ties/spamassassin-runs.tar.bz2. Unpack this file in the same directory. The DSV files contain the ten "shuffles" (test runs) that were also used for the CRM114 experiments.

For each experiment, you should create a subdirectory (e.g. eval1, eval2 etc.) in this folder. This avoids confusion between the results of different experiments.

You start an experiment by changing into a new subdirectory and executing the following command (assuming you have defined an alias ties, see install):

    ties class-train -outdir=. -classifier.re-use=false ../*.dsv

The outdir option tells the system to write output files to the current subdirectory (instead of the directory of the corresponding input file); the classifier option specifies the classifier to use. After the experiment is finished (this will take a while), the directory should contain a log file (ties.log) and ten *.cls files containing the results of the ten test runs.

You can use the errorgrep alias (see install) to check the log file for warnings (there shouldn't be any). For getting statistics from the *.cls files, we found the following alias useful:

    alias spamstats='tail -q --lines 500 *.cls|grep -v "|+"|wc -l;
      tail -q --lines 500 *.cls|grep "spam|nonspam"|wc -l;
      tail -q --lines 500 *.cls|grep "nonspam|spam"|wc -l;
      tail -q --lines 1000 *.cls|grep -v "|+"|wc -l;
      egrep -v "(\|\+|Classification)" *.cls|wc -l'

This prints five numbers:

  1. Number of errors on the last 10x500 mails
  2. False negatives (spam misclassified as nonspam) on the last 10x500 mails
  3. False positives (nonspam misclassified as spam) on the last 10x500 mails
  4. Number of errors on the last 10x1000 mails
  5. Number of errors on all 10x4147 mails

For the command given above, you should get the numbers:

     21
     13
      8
     37
    525

Preprocessing (Normalization)

We reached our best results on the SpamAssassin corpus when using Jaakko Hyvätti's normalizemime (http://hyvatti.iki.fi/~jaakko/spam/) for preprocessing the mails. To reproduce this, you need to download and install normalizemime.

Then you can create normalized variants of all mails by executing the command

    find . -name "?????.????????????????????????????????" -exec \
      normalizemime "{}" "{}.norm" \;

To process the normalized variants, add the two options -file.ext=.norm -charset=UTF-8 when calling ties. This time you should get the following statistics:

     16
     12
      4
     33
    514

Other preprocessing tools can be integrated in the same way.