de.fu_berlin.ties.preprocess
Class PreProcessor

java.lang.Object
  extended by de.fu_berlin.ties.ConfigurableProcessor
      extended by de.fu_berlin.ties.TextProcessor
          extended by de.fu_berlin.ties.preprocess.PreProcessor
All Implemented Interfaces:
Processor

public class PreProcessor
extends TextProcessor

Preprocesses documents by converting them a suitable XML format and adding lingustic information. Instances of this class are thread-safe.

Version:
$Revision: 1.16 $, $Date: 2004/12/07 12:02:05 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String CONFIG_HTMLCONV_COMMAND
          Configuration key prefix: command name and arguments of an external converter from a specified type to HTML.
static String CONFIG_PREPROCESS_TAGGER
          Configuration key: A tagger (or a list of taggers) used to annotate a text e.g. with linguistic information.
static String CONFIG_PREPROCESS_TEXT
          Configuration key: Whether plain text is preprocessed to recognize and reformat definition lists.
 
Fields inherited from class de.fu_berlin.ties.TextProcessor
CONFIG_POST, KEY_DIRECTORY, KEY_LOCAL_NAME, KEY_OUT_DIRECTORY, KEY_URL
 
Constructor Summary
PreProcessor()
          Creates and configured a new instance, using a default extension and the standard configuration.
PreProcessor(String outExt)
          Creates and configured a new instance, using the standard configuration.
PreProcessor(String outExt, TiesConfiguration config)
          Creates and configured a new instance.
 
Method Summary
 String cleanHTML(String input, String charset)
          Converts HTML input to a clean XHTML representation, if necessary.
protected  void doProcess(Reader reader, Writer writer, ContextMap context)
          Preprocesses the contents of a file.
 String toString()
          Returns a string representation of this object.
 
Methods inherited from class de.fu_berlin.ties.TextProcessor
getOutFileExt, process, process, process, process, process, process
 
Methods inherited from class de.fu_berlin.ties.ConfigurableProcessor
getConfig
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

CONFIG_HTMLCONV_COMMAND

public static final String CONFIG_HTMLCONV_COMMAND
Configuration key prefix: command name and arguments of an external converter from a specified type to HTML.

See Also:
Constant Field Values

CONFIG_PREPROCESS_TEXT

public static final String CONFIG_PREPROCESS_TEXT
Configuration key: Whether plain text is preprocessed to recognize and reformat definition lists.

See Also:
Constant Field Values

CONFIG_PREPROCESS_TAGGER

public static final String CONFIG_PREPROCESS_TAGGER
Configuration key: A tagger (or a list of taggers) used to annotate a text e.g. with linguistic information. Each tagger must implement the TextProcessor interface and accept a string (the output extension) as single constructor argument.

See Also:
Constant Field Values
Constructor Detail

PreProcessor

public PreProcessor()
Creates and configured a new instance, using a default extension and the standard configuration.


PreProcessor

public PreProcessor(String outExt)
             throws IllegalArgumentException
Creates and configured a new instance, using the standard configuration.

Parameters:
outExt - the extension to use for output files
Throws:
IllegalArgumentException - if the configured linguistic tagger(s) cannot be instantiated

PreProcessor

public PreProcessor(String outExt,
                    TiesConfiguration config)
             throws IllegalArgumentException
Creates and configured a new instance.

Parameters:
outExt - the extension to use for output files
config - used to configure superclasses
Throws:
IllegalArgumentException - if the configured linguistic tagger(s) cannot be instantiated
Method Detail

cleanHTML

public final String cleanHTML(String input,
                              String charset)
                       throws IOException
Converts HTML input to a clean XHTML representation, if necessary. Delegates to JTidy for checking and cleaning the HTML code.

Parameters:
input - the HTML to tidy
charset - the character to be used for storing the resulting XHTML document (required to write the XML Declaration correctly)
Returns:
the cleaned-up XHTML
Throws:
IOException - if the I/O goes wrong

doProcess

protected final void doProcess(Reader reader,
                               Writer writer,
                               ContextMap context)
                        throws IOException,
                               ProcessingException
Preprocesses the contents of a file. Neither input stream nor output writer are closed by this method.

Specified by:
doProcess in class TextProcessor
Parameters:
reader - a reader containing the text to preprocess; not closed by this method
writer - a writer used to store the preprocessed text; flushed but not closed by this method
context - a map of objects that are made available for processing; the ContentType.KEY_MIME_TYPE key should to mapped to the MIME type of the document and the IOUtils.KEY_LOCAL_CHARSET key to the character set of the writer
Throws:
IOException - if an I/O error occurred
ProcessingException - if the file couldn't be parsed, e.g. due to an error in the XML input

toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class TextProcessor
Returns:
a textual representation


Copyright © 2003-2004 Christian Siefkes. All Rights Reserved.