Text Classification Experiments

The "class-train" goal is used for text classification experiments.

This goal expects as input one or several files in DSV format (see formats). Each entry must contain two fields:

File
Relative path to the file to classify.
Class
The correct classification of this file (e.g. "spam" or "nonspam" in case of spam filtering).

Each DSV file specifies a test run to execute. For each file, the internally used classifier is reset to zero knowledge (the prediction model is deleted). Then the classifier is presented with each file (in the order listed in the DSV file) and required to classify the message. After each classification the true class of the message is revealed and the classifier has the possibility to update its prediction model accordingly prior to classifying the next file.

For each input file FILENAME.dsv, an output file FILENAME.cls is written that records the prediction of the classifier in a third field "Classification". This field either contains a "+" (if the prediction was correct) or the name of the predicted class (otherwise, in case of a wrong prediction). These output files can be used to generate statistics.

Sample input file (from the SpamAssassin corpus described below):

    Class|File
    nonspam|easy_ham/01408.3967d7ac324000b3216b1bdadf32ad69
    nonspam|easy_ham/02152.8df514c41920019281f8f0723dad0001
    spam|spam_2/01164.55202a4234914b20004dc7f9264313e5
    .....

Corresponding sample output file:

    File|Class|Classification
    easy_ham/01408.3967d7ac324000b3216b1bdadf32ad69|nonspam|spam
    easy_ham/02152.8df514c41920019281f8f0723dad0001|nonspam|+
    spam_2/01164.55202a4234914b20004dc7f9264313e5|spam|nonspam
    .....

Spam Filtering (SpamAssassin Corpus)

Spam filtering experiments were performed on the SpamAssassin corpus available at http://www.spamassassin.org/publiccorpus/. To allow comparison with the experiments reported by CRM114 (http://crm114.sourceforge.net/Plateau_Paper.html), only a subset of the corpus was used.

To run these experiments, download the files 20030228_easy_ham.tar.bz2, 20030228_hard_ham.tar.bz2, and 20030228_spam_2.tar.bz2 from the SpamAssassin corpus and unpack them in a directory of your choice. The DSV files that control the test runs are stored in http://www.inf.fu-berlin.de/inst/ag-db/data/ties/spamassassin-runs.tar.bz2. Unpack this file in the same directory. The DSV files contain the ten "shuffles" (test runs) that were also used for the CRM114 experiments.

For each experiment, you should create a subdirectory (e.g. eval1, eval2 etc.) in this folder. This avoids confusion between the results of different experiments.

You start an experiment by changing into a new subdirectory and executing the following command (assuming you have defined an alias ties, see install):

    ties class-train -outdir=. -classifier=Winnow ../*.dsv

The outdir option tells the system to write output files to the current subdirectory (instead of the directory of the corresponding input file); the classifier option specifies the classifier to use. After the experiment is finished (this will take a while), the directory should contain a log file (ties.log) and ten *.cls files containing the results of the ten test runs.

You can use the errorgrep alias (see install) to check the log file for warnings (there shouldn't be any). For getting statistics from the *.cls files, we found the following alias useful:

    alias spamstats='tail -q --lines 500 *.cls|grep -v "|+"|wc -l;
      tail -q --lines 500 *.cls|grep "spam|nonspam"|wc -l;
      tail -q --lines 500 *.cls|grep "nonspam|spam"|wc -l;
      tail -q --lines 1000 *.cls|grep -v "|+"|wc -l;
      egrep -v "(\|\+|Classification)" *.cls|wc -l'

This prints five numbers:

  1. Number of errors on the last 10x500 mails
  2. False negatives (spam misclassified as nonspam) on the last 10x500 mails
  3. False positives (nonspam misclassified as spam) on the last 10x500 mails
  4. Number of errors on the last 10x1000 mails
  5. Number of errors on all 10x4147 mails

For the command given above, you should get the numbers:

     21
     13
      8
     37
    525

Preprocessing (Normalization)

We reached our best results on the SpamAssassin corpus when using Jaakko Hyvätti's normalizemime (http://hyvatti.iki.fi/~jaakko/spam/) for preprocessing the mails. To reproduce this, you need to download and install normalizemime.

Then you can create normalized variants of all mails by executing the command

    find . -name "?????.????????????????????????????????" -exec \
      normalizemime "{}" "{}.norm" \;

To process the normalized variants, add the two options -file.ext=.norm -charset=UTF-8 when calling ties. This time you should get the following statistics:

     16
     12
      4
     33
    514

Other preprocessing tools can be integrated in the same way.