There are two types of indexes: word lists and persistent indexes.
Words and properties extracted from a document first appear in a word list, then move to a persistent index.
This organization is optimized for query responsiveness and performance also ensures optimal resource usage. Even though there are multiple indexes internally, these details completely hidden from the user.
The user sees only a list of documents that satisfy the query posted.
Content filters do several tasks, including:
- Extracting text chunks.
- Recognizing language shifts in multilingual documents.
- Handling embedded objects.
Content filters emit streams of characters, but Index Server indexes words, so it must be able to identify the words within the character stream. Different languages treat words and breaks between words differently.
Normalizer
The normalizer cleans up the words emitted by the word breaker, handling things like capitalization, punctuation and noise word removal.
In most languages, written text contains many noise words. English examples include “the,” “of,” “and,” “you,” and several hundred similar words. References to these words are not stored in the content index.
The system maintains a system-wide list of noise words on a per-language basis, which the administrator can customize. When one of these noise words produced while a document is being filtered, the noise word is ignored.
Noise word removal can significantly reduce the size of the overall index because noise words constitute the bulk of written text. Users can customize noise word lists to account for local slang and application-specific words.
Once words are normalized, they finally put into the content index.