Vector Quantization

Vector quantization (VQ) is a process used to encode high-dimensional vectors into a compact representation, which enables efficient similarity search and storage in RAG pipelines

Why Vector Quantization?

Reducing Storage Requirements
- High-dimensional vectors (e.g., embeddings) consume significant memory, making it infeasible to store or transmit them at scale. We can compresses vectors by approximating them with a small set of representative centroids or binary codes, drastically reducing storage while retaining essential information.
Accelerating Similarity Search
- Exact nearest-neighbor search in high-dimensional spaces is computationally expensive. We can enables approximate nearest-neighbor (ANN) search by comparing compact representations instead of full-dimensional vectors, improving retrieval speed.
Handling Large-Scale Datasets
- Processing millions or billions of high-dimensional vectors requires immense computational power. We can make large-scale retrieval systems scalable by using quantized vectors that require fewer resources for similarity computations.
Enhancing Efficiency on Resource-Constrained Systems
- Devices with limited memory and processing capabilities (e.g., mobile or edge devices) cannot handle large high-dimensional datasets. We can provides lightweight, compressed vector representations suitable for such environments.

VQ is used along with other techniques to improve end to end retrieval. Refer to this post for a code example on how to use them for multi vector query processing with Milvus vector db

Following are few quantization techniques examples:

1. Product Quantization

Concept:
- Divides high-dimensional vectors into smaller subvectors.
- Quantizes each subvector independently by mapping them to cluster centroids in their subspace.
- Represented by a sequence of cluster indices for each subvector.
Advantages:
- Allows for compressed storage of vectors.
- Enables approximate nearest neighbor (ANN) search with high recall and reduced storage overhead.
- Efficient in large-scale retrieval systems.
Applications in Retrieval:
- Often used in systems requiring fast vector similarity search, like image or text retrieval.
- Embeddings are stored as compact quantized representations, reducing memory usage.

from sklearn.cluster import KMeans
import numpy as np

# Simulate high-dimensional vectors
np.random.seed(0)
vectors = np.random.rand(100, 8)  # 100 vectors of dimension 8

# Divide into subvectors (e.g., two subvectors of dimension 4 each)
subvectors = [vectors[:, :4], vectors[:, 4:]]

# Quantize each subvector independently
codebooks = []
quantized_vectors = []
for subvector in subvectors:
    kmeans = KMeans(n_clusters=8, random_state=0).fit(subvector)  # 8 centroids
    codebooks.append(kmeans.cluster_centers_)
    quantized_vectors.append(kmeans.predict(subvector))

print("Quantized vector indices for subvector 1:", quantized_vectors[0])
print("Quantized vector indices for subvector 2:", quantized_vectors[1])

2. Binary Quantization

Concept:
- Maps continuous vectors to binary vectors (e.g., using hashing techniques like Locality Sensitive Hashing or neural binarization).
- The original space is approximated by encoding vectors into a binary code, making computations fast and efficient.
Advantages:
- Fast computation: Similarity metrics like Hamming distance are computationally efficient.
- Storage efficiency: Binary representations are very compact.
- Energy efficiency: Suitable for low-power devices.
Applications in Retrieval:
- Widely used in applications where computational speed and storage constraints are critical, such as mobile devices or edge computing.

from sklearn.preprocessing import binarize

# Simulate high-dimensional vectors
np.random.seed(1)
vectors = np.random.rand(100, 8)

# Binarize vectors (e.g., threshold at 0.5)
binary_vectors = binarize(vectors, threshold=0.5)

print("Original vectors:\n", vectors[:5])
print("Binary vectors:\n", binary_vectors[:5])

3. Scalar Quantization

Concept:
- Maps each component of the vector independently to a finite set of quantized values.
- Represents each scalar dimension as a small integer or codebook index.
Advantages:
- Simplest form of quantization with minimal computation.
- Useful for coarse-grained approximations of vector data.
Applications in Retrieval:
- Scalar quantization is often employed as a preprocessing step in systems where fine-grained quantization like PQ is later applied.

# Simulate high-dimensional vectors
np.random.seed(2)
vectors = np.random.rand(100, 8)

# Define a quantization function (e.g., round to nearest 0.1)
quantized_vectors = np.round(vectors, decimals=1)

print("Original vectors:\n", vectors[:5])
print("Quantized vectors:\n", quantized_vectors[:5])