Vector Indexing
This post is in my series of posts on vector databases - an overview of vector databases and vector embeddings.
As discussed in the last post, vector embeddings change queries and content into high-dimensional vectors. Vector indexing is a method for effectively storing and searching this high-dimensional vector data, allowing quick similarity searches across large datasets.
These techniques involve some form of vector quantization, i.e., mapping fine-granularity vector space to coarse-granularity vector spaces. Some methods utilized for indexing and querying vectors are:
Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines: This technique involves encoding and searching dense semantic vectors using full-text search engines like Elasticsearch. The main idea is to transform the vectors into text-encoded documents that can be indexed and queried by the full-text engine while preserving the vector similarity. Vector values are encoded as text and indexed using a full-text index. For example, w~ = [0.12, −0.13, 0.065] can be encoded as w~ = [' 0I10i0d1',' 1I10ineg0d2',' 2I10i0d0']. The actual search is performed in two phases. The query vector is encoded and searched using a full-text search engine in phase one. In phase two, the results of the full-text search engines are then converted back to vectors and evaluated for similarity. One might question why this approach is being used instead of text search. The answer lies in the fact that these vectors not only encode the text value but also contextual information, which can result in more accurate relevance matches.
Inverted file index (IVF): This vector indexing strategy partitions the vector space into clusters using k-means or other clustering algorithms and assigns each vector to the nearest cluster centroid. Then, it creates an inverted index that maps each centroid to a list of vectors in its cluster. To search for a query vector, it only compares the query with the centroids and the vectors in the closest clusters.
Product quantization (PQ): Product quantization is a method for compressing high-dimensional vectors to reduce memory usage and improve search speed. The process involves:
Recommended by LinkedIn
This method is used to reduce the memory footprint of indexes, which is important when comparing large arrays of vectors, as they must all be loaded in memory to be compared. PQ is implemented in Faiss, an open-source library for efficient similarity search and clustering of dense vectors. PQ is widely used in various domains, such as image retrieval, face recognition, and audio processing.
Hierarchical navigable small worlds (HNSW): This vector indexing strategy builds a graph structure that connects each vector to its nearest neighbors at different levels of granularity. The graph is constructed by inserting vectors one by one and linking them to existing nodes using a greedy heuristic. To search for a query vector, it traverses the graph from a random entry point, moving closer to the query at each step until it reaches a local minimum. HNSW is an efficient and scalable way to perform approximate nearest neighbor search on high-dimensional vectors. It can achieve high accuracy and recall rates while requiring low memory consumption and query time. HNSW can also handle dynamic updates of vectors without rebuilding the index.
Spherical hashing (SH): This vector indexing strategy maps high-dimensional vectors onto binary codes using hyperspheres as hash functions. Each hypersphere defines a binary bit by dividing the vector space into two regions: inside or outside. Then, each vector is hashed by checking its membership in each hypersphere, forming a binary code word. To search for a query vector, it computes the Hamming distance between the query code word and the database code words and retrieves the nearest ones. SH effectively performs a similarity search on normalized unit vectors, such as cosine similarity or angular distance. It can generate compact and balanced binary codes that preserve the similarity structure of the original vectors. SH can also be combined with IVF to form IVF-SH, improving search performance by reducing the number of hash functions.
If you want to delve deeper, consider checking out Awesome Vector Search. It's a curated collection of outstanding vector search frameworks, libraries, cloud services, and research papers for vector similarity search. It encompasses a range of techniques and uses for vector indexing and retrieval.
Really Informative