Lucene in Teamcenter: Unravelling Inverted Index and Search Optimization
In this blog, we aim to provide a deeper understanding of the Indexer components in Teamcenter. While many of us use the Indexer in Teamcenter, few are familiar with its inner components. Our goal is to offer the Teamcenter community valuable insights, which can help you customize or optimize the Indexer as needed.
Key areas we are going to talk about are
Lucene
Inverted Index
Indexing mechniz
What is Inverted Index
Inverted index is an Index data structure.
In Simple words it inverts the “document-centric” data structure(document -> terms) to “term-centric” data structure(terms -> document).
The index is built by analyzing the text of the documents and extracting terms from them.
The inverted index allows for fast and efficient searching by providing a way to look up documents that contain a specific term or set of terms.
The inverted index is composed of two substructures:
term dictionary - groups all the terms included in the documents in a sorted list.
postings list - creates a list of each term, indicating the documents where the term appears.
In the above example we can see three documents indexed into Lucene’s inverted index. Each of the document’s content is analyzed (tokenized) into terms which are inserted into inverted index.
Since the terms in the dictionary are sorted, we can quickly find a term (think binary search), and subsequently its occurrences in the postings-structure. This is contrary to a “forward index”, which lists terms related to a specific document.
What is Analyzer:
Analysis, in Lucene, is the process of converting field text into its most fundamental indexed representation, terms. These terms are used to determine what documents match a query during searching.
An analyzer is an encapsulation of the analysis process.
An analyzer tokenizes text by performing any number of operations on it, which could include extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization).
This process is also called tokenization, and the chunks of text pulled from a stream of text are called tokens (terms) that are optimized for indexing and searching, allowing for efficient and accurate search operations.
Lets understand it with short example. Using Analyzer a sentence can be processed as below.
Consider you have a sentence: “Lucene is an amazing library for searching and indexing text.”
Tokenization: The sentence is broken down into individual words or tokens: ["Lucene", "is", "an", "amazing", "library", "for", "searching", "and", "indexing", "text"].
Lowercasing: The tokens are then converted to lowercase: ["lucene", "is", "an", "amazing", "library", "for", "searching", "and", "indexing", "text"].
Stop Word Removal: Common words like "is", "an", "for", "and" might be removed: ["lucene", "amazing", "library", "searching", "indexing", "text"].
Stemming: Words could be reduced to their root forms, so "searching" becomes "search" and "indexing" becomes "index": ["lucene", "amazing", "library", "search", "index", "text"].
Types of Analyzer:
Standard Analyzer: A general-purpose analyzer that uses the Standard Tokenizer, removing common English stop words and applying lowercasing.
Text: "Lucene is awesome!"
Tokens: ["lucene", "awesome"]
Explanation: This removes stop words ("is") and converts to lowercase.
Whitespace Analyzer: Tokenizes text based solely on whitespace, without applying any additional filtering or processing.
Text: "Lucene is awesome!"
Tokens: ["Lucene", "is", "awesome!"]
Explanation: This splits on whitespace without further processing.
Keyword Analyzer: Treats the entire input text as a single token, useful for exact match scenarios.
Text: "Lucene is awesome!"
Tokens: ["Lucene is awesome!"]
Explanation: The entire text is treated as one token.
Simple Analyzer: Splits text on non-letter characters, converts tokens to lowercase, and does not remove stop words.
Text: "Lucene is awesome!"
Tokens: ["lucene", "is", "awesome"]
Explanation: This splits on non-letter characters and lowercases tokens.
Stop Analyzer: Similar to SimpleAnalyzer but removes a predefined list of stop words.
Text: "Lucene is awesome!"
Tokens: ["lucene", "awesome"]
Explanation: Similar to Simple Analyzer but also removes stop words.
Language Analyzer: Specifically designed for English text, incorporating stemming, stop word removal, and other filters tailored for the language.
Text: "Lucene is awesome!"
Tokens: ["lucene", "awesom"]
Explanation: Includes stemming ("awesome" to "awesom") and stop word removal.
Custom Analyzer: You can create a custom analyzer by combining various tokenizers and filters to suit specific needs.
Text: "Lucene is awesome!"
Tokens: Depends on the specific combination of tokenizers and filters used.
Explanation: You could create an analyzer that, for example, removes stop words, lowercases, and stems the tokens.