Class CIndex
java.lang.Object
|
+----CIndex
- public class CIndex
- extends Object
Central index of the search engine consisting of:
- An index map containing the stems of all indexed words.
Each stem is linked to its posting list via an unique id.
- An indexed array of posting lists.
- A hashed URL list. Each URL is linked to the appropriate document
via an unique id.
- An indexed array of documents.
Creating the search engine's index is a two-step process:
- all tables are set up as either vectors or nested arrays
to make extensions fast and easy.
- all vectors and nested arrays are replaced by "normal" arrays to
make retrieval fast and easy.
-
CIndex()
-
-
addPage(PPage)
- adds a document to the index.
-
dump()
-
-
getMatchingWordIDs(String)
- returns a all words within the idex that match a
given wildcard expression (string*)
-
getPage(int)
- returns a document from a given document ID.
-
getPage(String)
- returns a document from a given URL.
-
getPostingList(int)
- returns a given word's posting list
-
getPostingList(String)
- returns a given word's posting list
-
getWordLink(String)
- returns a given words ID as used in docReps.
-
optimize()
- optimizes the index by replacing all vectors and nested arrays
with "real" arrays.
-
statistics(PrintStream)
-
CIndex
public CIndex()
addPage
public boolean addPage(PPage page)
- adds a document to the index. The steps performed are:
- add the document to the document table
- add the document's URL to the URL table
- create a local index for this document
- rank all words based on the local index
- create the document's docRep from the local index
- merge the local index with the global index. Add all
new words to the global word list.
- Parameters:
- page - page to be added to the index
- Returns:
- true, if the page was added, otherwise false.
optimize
public void optimize()
- optimizes the index by replacing all vectors and nested arrays
with "real" arrays.
The steps performed are:
- flatten the word list by using a 3-letter prefixed array
- flatten the posting array by un-nesting it
- optimize all posting lists by transforming them into arrays
- flatten the document list by un-nesting it
- replace the vectors holding each document's links by arrays
getWordLink
public int getWordLink(String strWord)
- returns a given words ID as used in docReps.
- Returns:
- an integer > 0 if the word was found, otherwise 0.
getPostingList
public int[] getPostingList(String strWord)
- returns a given word's posting list
- Parameters:
- strWord - word to be looked up
- Returns:
- posting list as an array of document IDs. If
strWord is unknown, null will be returned.
getPostingList
public int[] getPostingList(int idWord)
- returns a given word's posting list
- Parameters:
- idWord - word ID to be looked up
- Returns:
- posting list as an array of document IDs
getMatchingWordIDs
public Vector getMatchingWordIDs(String mask)
- returns a all words within the idex that match a
given wildcard expression (string*)
- Parameters:
- mask - - wildcard expression
- Returns:
- Vector of all word's IDs that match the expression
getPage
public CIndexedPage getPage(int idPage)
- returns a document from a given document ID.
- Parameters:
- idPage - - document ID
- Returns:
- CIndexedPage object if the document was found,
otherwise null.
getPage
public CIndexedPage getPage(String strURL)
- returns a document from a given URL.
- Parameters:
- strURL - - URL
- Returns:
- CIndexedPage object if the document was found,
otherwise null.
dump
public void dump()
statistics
public void statistics(PrintStream pout)