The Solr Search Engine

The Solr Search Engine

The Solr search engine is one of the most widely deployed search platforms worldwide. Solr is written in Java, and provides both a RESTful XML interface, and a JSON API with which search applications can be built.

  • Solr isn’t a web search engine like Google or Bing.
  • Solr has nothing to do with search engine optimization (SEO) for a website.

Solr help to solve a specific class of problem quite well, problems requiring the ability to search across large amounts of unstructured text and pull back the most relevant results.

Solr is designed to extract the implicit structure of text into its index to improve searching. Solr always searches for the query term match with in these indexes.

Common search-engine use cases

  • Basic keyword search : In general, users want to type in a few simple keywords and get back great results with as short response time as possible. This seems simple but there are challenges to achieve this basic requirements.
  1. Response time should be as minimal as possible to achieve a great user/Search experience
  2. Spelling correction, Synonyms mapping might required with some user inputs.
  3. User should be provided with the pagination option to browse 'n' number of results from the system with one time entry of search string.
  4. Localization might also be a challenge in some use cases.

With a search engine like Solr, these features come out of the box and are easy to implement.

  • Ranked retrieval : In SQL query response to a fired query will be either match found or no match found. There is no option/not a inexpensive option for achieving partial match with the data. But search engines like solr will try to match the fired query term to the stored document in a fuzzy way and for each match it assigns the rank which indicates the strength of the match. So higher the match rank more relevant the documents to the search term. One can tweak the solr to assign more weights to certain fields/terms so that these contribute more toward the document match ranking.
  • Beyond keyword search : If one observe the search solution implemented in any e-comm websites (let's consider amazon.com), users search experience topically starts with typing the item of interest in the search bar. amazon search engines responds to this query with the matched item results. User search navigation doesn't end here, customers typically filters the results using the filters provided. The challenge here for amazon is different sets of items would be required provided with the different filter. Filters provided for the Fashion items will be different completely from the filters for electronic items. So the search solution should also be include the way to categorize search results using document features to allow users to narrow down their results. This is known as faceted search, and it’s one of the main strengths of Solr.

In built features of Solr which enhances the user search experience

  • Pagination and sorting : Typical use case for the search engines involves a process of returning a small set of documents per query, usually 10 to 100. More documents for the same query can be retrieved using Solr’s built-in paging support. Results are sorted by relevance score. But you can request Solr to sort results by other fields in your documents.
  • Faceting : Faceting use case i had already discussed above here.
  • Autosuggest
  • Spell-checking : This might be a very well aware condition in google search experience, where if any search terms incorrectly type google will auto correct it and search accordingly. When google presents back the search results it informs the users that it corrected the search terms from "acb" to "abc". Here there are two implementations can be found, one is auto correcting the spellings and second one is informing user on what is the corrected term used as the search term. This can be achieved in Solr using its inbuilt Spell-checking feature.
  • Hit highlighting : When searching documents that have a significant amount of text, you can display specific sections of each document using Solr’s hit-highlighting feature.

Advanced features of Solr

Advanced Full-Text Search Capabilities : Solr enables powerful matching capabilities including phrases, wildcards, joins, grouping and much more across any data type

Optimized for High Volume Traffic : Solr is proven at extremely large scales the world over. For higher query throughput, you add replicas of your index so that more servers can handle more requests. This means that if your index is replicated across three servers, you can handle roughly three times the number of queries per second, because each server handles one-third of the query traffic.

Near Real-Time Indexing : Solr takes advantage of Lucene's Near Real-Time Indexing capabilities to make sure you see your content when you want to see it. This feature allows you to index thousands of documents per second and have them be searchable almost immediately. Solr achieves this using mechanism known as soft commit to the index.

Highly Configurable and User Extensible Caching : Fine-grained controls on Solr's built-in caches make it easy to optimize performance

Multiple search indices : Solr supports multi-tenant architectures, making it easy to isolate users and content.

Solr architecture

No alt text provided for this image


Query parser is responsible for parsing the queries passed by the end search, and converting it to Lucene query objects. Lucene provides TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, RangeQuery, MultiTermQuery, FilteredQuery, SpanQuery, and so on as query implementations.

Index searcher is a basic component of Solr searched with a default base searcher class. This class is responsible for returning ordered match results of searched keyword ranked as per the computed score. 

Index reader provides access to indexes stored in the filesystem. It can be used to search for an index.

 Index writer allows you to create and maintain indexes in Apache Lucene.

Analyzer is responsible for examining the fields and generating tokens. You can define your own customer analyzers depending upon your use case

Tokenizer breaks field data into lexical units or tokens.

The filter examines the field of tokens from the tokenizer and either it keeps them and transforms them, or discards them and creates new ones. Tokenizers and filters together form a chain or pipeline of analyzers. There can only be one tokenizer per analyzer. The output of one chain is fed to another. Analyzing the process is used for indexing as well as querying by Solr.

When a user fires a search query on Solr, it actually gets passed on to a request handler. Based on the request, the request handler calls the query parser. You can see an example of the filter in the bellow figure

No alt text provided for this image

Once a query is parsed, it hands it over to the index searcher. The job of the index reader is to run the queries on the index store and gather the results to the response writer.

No alt text provided for this image


To view or add a comment, sign in

More articles by Avinash Upadhya

  • Glimpse of Exploratory Data Analysis(EDA)

    We talk a lot about the science side of data analysis, the calculations and algorithms needed to perform complex…

    2 Comments

Explore content categories