Vector Databases — Concepts, Architecture, and Usage in AI, RAG, and Modern ML Systems
Prerequisites: Basic Python proficiency, a general understanding of what machine learning models do (no deep math required), and familiarity with traditional databases (SQL or NoSQL). Some awareness of LLMs and embeddings is helpful but not required — the guide explains them from scratch.
What Vector Databases Are and Why They Exist
A vector database is a purpose-built storage and retrieval system designed for high-dimensional vector data — the dense numerical representations (called embeddings) produced by modern AI models. Unlike traditional databases that store and query scalar values like strings, integers, and dates, a vector database is optimized for a fundamentally different operation: similarity search.
The distinction matters because the queries you ask are different. In PostgreSQL, you ask "give me all rows where status = 'active'." In a vector database, you ask "give me the 10 items most similar to this query vector." Equality versus similarity — that is the dividing line between the two worlds.
The Core Problem: Traditional Databases Weren't Built for This
Relational databases excel at exact-match lookups and range queries. B-tree and hash indexes make it trivial to find rows where price > 100 or email = 'user@example.com'. These operations compare scalar values along well-defined orderings.
But modern AI doesn't produce scalar values — it produces embeddings. An embedding model like OpenAI's text-embedding-3-small, Sentence-BERT, or CLIP takes raw input (text, images, audio, code) and compresses its semantic meaning into a fixed-length array of floating-point numbers. A single sentence might become a vector of 768 or 1,536 dimensions. In this space, there's no meaningful "greater than" or "equals" — only distance and direction.
graph LR
subgraph Vector Path
direction LR
A["📄 Raw Data
(text, images, audio)"] --> B["🧠 Embedding Model
(e.g. text-embedding-3-small)"]
B --> C["📊 High-Dimensional Vectors
[0.021, -0.87, ..., 0.34]"]
C --> D[("🗄️ Vector Database")]
D --> E["🔍 Similarity Query
'find 10 nearest neighbors'"]
end
subgraph Traditional Path
direction LR
F["📄 Raw Data
(forms, transactions)"] --> G["📋 Structured Rows
(name, age, price)"]
G --> H[("🗄️ Relational Database")]
H --> I["🔍 SQL Query
WHERE status = 'active'"]
end
How Data Becomes Vectors
Embedding models are trained to place semantically similar inputs close together in high-dimensional space. The sentence "How do I reset my password?" and "I forgot my login credentials" produce vectors that are nearly identical in direction, even though they share almost no words. This is what makes vector search so powerful — it captures meaning, not just keyword overlap.
Here's what an embedding actually looks like in practice. The model takes your input and returns a plain array of floats:
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="How do I reset my password?"
)
vector = response.data[0].embedding
print(len(vector)) # 1536
print(vector[:5]) # [0.0213, -0.0087, 0.0341, -0.0156, 0.0278]
That 1,536-element float array is the embedding. Every piece of content you want to search — every document, image, or audio clip — gets transformed into one of these vectors. You store millions of them in the vector database, and at query time, you embed the user's question with the same model and ask for the nearest neighbors.
The embedding model used to generate stored vectors must match the model used to embed queries. Vectors from text-embedding-3-small (1,536 dims) are incompatible with vectors from Sentence-BERT (768 dims) — they live in completely different spaces.
What Modalities Can Be Embedded?
Embeddings aren't limited to text. The table below shows common modalities and the models that produce their vector representations:
| Modality | Example Models | Typical Dimensions |
|---|---|---|
| Text | OpenAI text-embedding-3-small, Sentence-BERT, Cohere Embed | 384 – 3,072 |
| Images | CLIP (ViT-L/14), DINOv2, OpenCLIP | 512 – 1,024 |
| Audio | CLAP, Whisper encoder, AudioMAE | 512 – 768 |
| Code | CodeBERT, StarEncoder, OpenAI text-embedding-3-small | 768 – 1,536 |
| Multimodal | CLIP, ImageBind, ONE-PEACE | 512 – 1,024 |
The key insight: once data from any modality is converted into a vector, the search problem is identical. Finding a similar image and finding a similar document are the same mathematical operation — computing distances in high-dimensional space.
The Curse of Dimensionality: Why Brute Force Fails
You might wonder: why not just store vectors in a regular database column and compute distances at query time? For a few thousand vectors, that works. But the math falls apart at scale.
Comparing a single query vector against one stored vector of 1,536 dimensions requires 1,536 floating-point multiplications plus a sum. Against a million vectors, that's ~1.5 billion operations per query. Against 100 million vectors, it becomes completely impractical for real-time responses.
This is a manifestation of the curse of dimensionality — as dimensions increase, the volume of the space grows so fast that data becomes sparse, distances between points converge, and naive search strategies collapse. Traditional indexing structures like B-trees and R-trees, designed for low-dimensional data (2D, 3D), lose their effectiveness above roughly 10–20 dimensions.
A brute-force cosine similarity search works perfectly in a Jupyter notebook with 10,000 vectors. But it doesn't scale. The jump from "works in a notebook" to "serves production traffic at millisecond latency over 50 million vectors" is exactly the gap that vector databases fill — through specialized indexing algorithms like HNSW and IVF.
The Mental Model
Think of it this way: a vector database is to embeddings what PostgreSQL is to structured rows. PostgreSQL gives you schemas, indexes (B-tree, GIN, GiST), query planning, ACID transactions, and a query language (SQL) — all optimized for structured, scalar data. A vector database gives you specialized indexes (HNSW, IVF, PQ), distance metrics (cosine, euclidean, dot product), efficient memory management, and APIs — all optimized for dense floating-point vectors where the only meaningful query is "what's closest to this?"
Both are storage engines. Both build indexes to avoid full scans. The difference is the shape of the data and the nature of the questions you ask.
Vector Embeddings, Similarity Search, and Distance Metrics
At the heart of every vector database is a deceptively simple idea: represent data as points in high-dimensional space, then find nearby points. The numerical arrays that encode this spatial meaning are called vector embeddings — and understanding how they work, and how "nearby" is defined, is essential to building effective search systems.
What Are Vector Embeddings?
A vector embedding is a dense, fixed-length array of floating-point numbers (typically 256–1536 dimensions) produced by a trained neural network. Unlike sparse representations like one-hot encoding or TF-IDF — where most values are zero — embeddings pack semantic meaning into every dimension. Each dimension captures some learned feature of the input, and the geometric relationships between vectors encode semantic relationships between the original data.
When an embedding model is trained, it learns to place semantically similar items close together and dissimilar items far apart. The word "dog" ends up near "puppy" and "canine," but far from "refrigerator." A photo of a sunset clusters near other sunset photos, not near spreadsheets. This geometric proximity is the similarity — and it's what makes vector search possible.
The Classic Example: Word Arithmetic
The most famous demonstration of embedding structure is the word analogy task. In a well-trained word embedding space, the vectors for words like king, queen, man, and woman form a parallelogram. The vector from man to king captures the concept of "royalty," and that same directional offset, applied to woman, lands you near queen.
Algebraically, this means:
# Conceptual: vectors from a trained word embedding model
# king - man + woman ≈ queen
result = embeddings["king"] - embeddings["man"] + embeddings["woman"]
nearest = find_closest_vector(result, vocabulary_embeddings)
# nearest → "queen"
This works because the embedding space has learned to separate the "gender" axis from the "royalty" axis. Subtracting man from king removes the male-associated direction, and adding woman reintroduces the female-associated direction — landing near queen. These aren't hand-coded rules; they emerge from the statistical patterns in training data.
No human decides what each dimension means. The neural network discovers structure during training. Dimension 47 might loosely correlate with "animacy" and dimension 312 with "formality," but these features are distributed and entangled — not cleanly labeled. What matters is that the geometry captures meaning.
The Three Primary Distance Metrics
Once you have embeddings, you need a way to measure how "close" two vectors are. This is where distance and similarity metrics come in. Vector databases support several, but three dominate in practice: Cosine Similarity, Euclidean Distance (L2), and Dot Product (Inner Product). Each defines "closeness" differently, and the right choice depends on your embedding model and use case.
1. Cosine Similarity
Cosine similarity measures the angle between two vectors, ignoring their magnitude entirely. Two vectors pointing in nearly the same direction have a cosine similarity close to 1, regardless of whether one is twice as long as the other. Perpendicular vectors score 0, and opposite vectors score −1.
This is the default choice for text embeddings. When a sentence embedding model encodes "How do I reset my password?" and "I need to change my password," both vectors point in a similar direction — even though longer documents might produce vectors with larger magnitudes. By stripping out magnitude, cosine similarity focuses purely on what a piece of text is about, not how much text there is.
2. Euclidean Distance (L2)
Euclidean distance is the straight-line distance between two points in space — the familiar Pythagorean formula generalized to n dimensions. Unlike cosine similarity, it's sensitive to both direction and magnitude. Two vectors pointing in the same direction but with different lengths will have a nonzero Euclidean distance.
This metric is commonly used with image embeddings and spatial data where absolute position in the embedding space matters. If your model encodes images such that brightness or complexity affects vector magnitude, L2 captures those differences. Note that smaller L2 values mean more similar vectors — the opposite convention from cosine similarity, where bigger is better.
3. Dot Product (Inner Product)
The dot product multiplies corresponding elements of two vectors and sums the results. It combines both direction and magnitude into a single score: vectors that are both aligned and large produce the highest dot products. This makes it useful in two scenarios.
First, when your vectors are already normalized to unit length (magnitude = 1), the dot product becomes mathematically identical to cosine similarity — but it's faster to compute because you skip the normalization step. Second, when magnitude carries meaningful signal (e.g., a popularity or confidence score baked into the vector), dot product naturally weights higher-magnitude vectors more heavily.
Computing All Three in Python
The math behind these metrics is straightforward. Here's a concrete implementation using NumPy on two sample vectors:
import numpy as np
# Two sample embedding vectors (small for clarity; real ones are 256-1536 dims)
a = np.array([0.12, 0.85, 0.44, 0.31])
b = np.array([0.15, 0.79, 0.48, 0.28])
# 1. Cosine Similarity — angle between vectors (range: -1 to 1)
cosine_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine Similarity: {cosine_sim:.4f}") # → 0.9971
# 2. Euclidean Distance (L2) — straight-line distance (range: 0 to ∞)
l2_distance = np.linalg.norm(a - b)
print(f"Euclidean Distance: {l2_distance:.4f}") # → 0.0837
# 3. Dot Product — direction + magnitude combined
dot_product = np.dot(a, b)
print(f"Dot Product: {dot_product:.4f}") # → 0.9875
Notice how these two vectors — nearly identical in direction — score a cosine similarity of ~0.997 (very close to 1) and a small Euclidean distance of ~0.084. The dot product of ~0.988 reflects both their alignment and their magnitudes. On a pair of dissimilar vectors, all three metrics would tell a dramatically different story.
When to Use Each Metric
The choice of distance metric isn't arbitrary — it should match the geometry that your embedding model was trained to optimize. Using the wrong metric can degrade search quality even with a perfect index.
| Metric | Best For | Key Property | Convention |
|---|---|---|---|
| Cosine Similarity | Text / NLP embeddings (OpenAI, Cohere, Sentence-Transformers) | Ignores magnitude; compares direction only | Higher = more similar |
| Euclidean (L2) | Image embeddings, spatial data, anomaly detection | Sensitive to both direction and magnitude | Lower = more similar |
| Dot Product | Pre-normalized vectors; recommendation systems with magnitude signals | Combines direction and magnitude; fastest to compute | Higher = more similar |
The distance metric you configure in your vector database must align with how the embedding model was trained. OpenAI's text-embedding-3-small is optimized for cosine similarity. Using L2 distance with it won't crash anything — but your search results will be subtly worse because the model never learned to make L2 distances meaningful. Always check the model card or documentation for the recommended metric.
In practice, if you're building a RAG pipeline or semantic search system with text embeddings, start with cosine similarity — it's the safest default. If you're working with image embeddings from models like CLIP, check whether the model normalizes its output vectors. If it does, dot product and cosine become equivalent and dot product is the faster choice. Reserve L2 for cases where absolute positioning in the embedding space matters, such as spatial data or certain anomaly detection pipelines.
Indexing Algorithms: Trading Accuracy for Speed
Finding the single closest vector in a collection of a billion 1536-dimensional embeddings using brute-force comparison means computing a billion distance calculations — each involving 1536 floating-point operations. That is O(n × d) per query, and it is completely unusable at production scale. A single query could take seconds or minutes instead of milliseconds.
This is the fundamental tradeoff in vector search: Approximate Nearest Neighbor (ANN) algorithms sacrifice a small, controlled amount of recall (the fraction of true nearest neighbors returned) in exchange for orders-of-magnitude speed improvements. Instead of scanning every vector, they build clever data structures that let queries skip most of the dataset.
Recall@k measures what fraction of the true top-k nearest neighbors appear in the ANN result. A recall of 0.95 means 95 out of 100 true nearest neighbors are found. Most production systems target 0.95–0.99 recall while achieving 10–1000× speedup over brute-force.
Four major algorithm families dominate the landscape. Each makes a different structural bet on how to organize vectors for fast retrieval.
HNSW — Hierarchical Navigable Small World
HNSW builds a multi-layer graph where each vector is a node connected by edges to its nearby neighbors. The top layer is the sparsest — only a few long-range connections between distant regions of the space. Each layer below adds more nodes and shorter-range connections, with the bottom layer containing every vector. Think of it like navigating a city: first you take the highway (top layer), then a main road, then a side street to your exact destination.
At query time, the algorithm starts at a random entry point in the top layer and greedily walks toward the query vector. When it can't get closer in the current layer, it drops down to the next layer and continues with finer-grained connections. This greedy traversal is extremely fast — typically O(log n) — and produces excellent recall (>0.95 easily).
# HNSW index in Qdrant — ef_construct controls build quality,
# m controls max edges per node (higher = better recall, more RAM)
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, HnswConfigDiff
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1536, distance="Cosine"),
hnsw_config=HnswConfigDiff(
m=16, # edges per node (default: 16)
ef_construct=128, # build-time search width (default: 100)
),
)
The catch is memory: HNSW stores the full graph in RAM. For 1 billion 768-dim vectors at float32, that is ~3 TB of vector data alone, plus graph overhead. Despite this, HNSW is the default algorithm in most modern vector databases — Qdrant, Weaviate, pgvector (hnsw index), and Milvus all use it as their primary index type.
IVF — Inverted File Index
IVF takes a completely different approach: instead of building a graph, it partitions the entire vector space into regions using k-means clustering. Each cluster has a centroid, and every vector is assigned to its nearest centroid. These regions form Voronoi cells — geometric partitions where every point inside a cell is closer to that cell's centroid than to any other.
At query time, the algorithm computes distances from the query vector to all centroids (fast — there are only nlist of them, typically 256–4096), then searches only the vectors inside the nprobe nearest clusters. If you have 1000 clusters and set nprobe=10, you search just 1% of the dataset.
# IVF index in FAISS — nlist = number of clusters
import faiss
import numpy as np
d = 768 # vector dimension
nlist = 1024 # number of Voronoi cells
quantizer = faiss.IndexFlatL2(d) # brute-force for centroids
index = faiss.IndexIVFFlat(quantizer, d, nlist)
vectors = np.random.rand(1_000_000, d).astype("float32")
index.train(vectors) # runs k-means to find centroids
index.add(vectors)
index.nprobe = 10 # search 10 nearest clusters at query time
distances, ids = index.search(query_vector, k=10)
IVF uses less memory than HNSW since it stores only the centroids plus the original vectors (no graph edges). The tradeoff is that recall depends heavily on nprobe: too low and you miss neighbors that sit near cluster boundaries; too high and you approach brute-force speed. IVF is a strong choice for very large datasets where HNSW's memory footprint becomes prohibitive.
Product Quantization (PQ) — Compressing Vectors
Product Quantization attacks the problem from a different angle: instead of organizing vectors for faster search, it compresses the vectors themselves. PQ splits each vector into m sub-vectors (e.g., a 768-dim vector split into 96 sub-vectors of 8 dimensions each), then quantizes each sub-vector independently to its nearest centroid from a small codebook (typically 256 entries = 8 bits per sub-vector).
The result is dramatic compression. A 768-dim float32 vector normally takes 3072 bytes. With PQ using 96 sub-vectors × 8 bits each, it compresses to just 96 bytes — a 32× reduction. Distance calculations use precomputed lookup tables against the codebooks, making them surprisingly fast.
PQ is almost always combined with IVF as IVF-PQ, which is the go-to approach for billion-scale search in FAISS. IVF narrows the search space to a few clusters; PQ compresses the vectors within those clusters so they fit in memory.
# IVF-PQ in FAISS — billion-scale search with compressed vectors
d = 768
nlist = 4096 # number of Voronoi cells
m = 96 # number of sub-vectors (must divide d)
nbits = 8 # bits per sub-quantizer (256 centroids each)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
index.train(training_vectors) # learns both centroids and codebooks
index.add(all_vectors)
index.nprobe = 32
distances, ids = index.search(query, k=10)
Annoy — Approximate Nearest Neighbors Oh Yeah
Annoy, created at Spotify, builds a forest of binary trees using random projections. Each tree recursively splits the vector space with random hyperplanes, creating a binary partition. At query time, Annoy searches multiple trees in parallel and merges the results. More trees means higher recall but more memory and slower queries.
Annoy's key advantage is simplicity: the index is a single memory-mapped file that can be shared across processes. It is fast to build and very efficient for read-only, static datasets. However, it does not support adding or removing vectors after the index is built, and its recall is generally lower than HNSW at equivalent speed.
# Annoy — static index with memory-mapped file
from annoy import AnnoyIndex
d = 768
index = AnnoyIndex(d, "angular") # "angular", "euclidean", "dot"
for i, vector in enumerate(vectors):
index.add_item(i, vector)
index.build(n_trees=100) # more trees = better recall
index.save("embeddings.ann") # single file, memory-mappable
# Query — search_k controls accuracy/speed tradeoff
ids, distances = index.get_nns_by_vector(query, n=10, search_k=5000,
include_distances=True)
How HNSW and IVF Search Differently
The diagram below shows the two most common approaches side by side. HNSW navigates a layered graph from coarse to fine; IVF narrows search to a few clusters in a partitioned space.
graph LR
subgraph HNSW["HNSW: Multi-Layer Graph Traversal"]
direction TB
L2_title["Layer 2 — Sparse"]
L2A(("A")) ---|long-range link| L2B(("B"))
L2B ---|long-range link| L2C(("C"))
L2_title ~~~ L2A
L1_title["Layer 1 — Medium"]
L1A(("A")) --- L1D(("D"))
L1D --- L1B(("B"))
L1B --- L1E(("E"))
L1E --- L1C(("C"))
L1_title ~~~ L1A
L0_title["Layer 0 — Dense"]
L0A(("A")) --- L0F(("F"))
L0F --- L0D(("D"))
L0D --- L0G(("G"))
L0G --- L0B(("B"))
L0B --- L0H(("H"))
L0H --- L0E(("E"))
L0E --- L0C(("C"))
L0_title ~~~ L0A
L2B --> L1B
L1D --> L0D
L0G -->|result| Q1["✦ Query Result"]
end
subgraph IVF["IVF: Cluster-Based Partitioning"]
direction TB
QV["Query Vector"] --> Centroids["Compare to all centroids"]
Centroids --> C1["Cluster 1\n●●●●"]
Centroids --> C2["Cluster 2 ✓\n●●●●●"]
Centroids --> C3["Cluster 3 ✓\n●●●●"]
Centroids --> C4["Cluster 4\n●●●"]
Centroids --> C5["Cluster 5 ✓\n●●●●●●"]
C2 --> Search["Search only\nnprobe=3 clusters"]
C3 --> Search
C5 --> Search
Search --> Q2["✦ Query Result"]
end
Algorithm Comparison
Choosing the right algorithm depends on your dataset size, memory budget, and whether your data changes frequently. The table below summarizes the practical tradeoffs.
| Algorithm | Query Speed | Memory Usage | Recall | Build Time | Update Support | Best For |
|---|---|---|---|---|---|---|
| HNSW | ⚡ Excellent (~1 ms) | 🔴 High (vectors + graph) | 🟢 >0.95 typical | Moderate | ✅ Insert & delete | Default choice, <100M vectors |
| IVF-Flat | ⚡ Good (~2-5 ms) | 🟡 Medium (vectors + centroids) | 🟢 0.90–0.99 (nprobe-dependent) | Slow (k-means training) | ⚠️ Add only, no delete | Large datasets, memory-constrained |
| IVF-PQ | ⚡ Good (~2-5 ms) | 🟢 Very low (compressed) | 🟡 0.80–0.95 (lossy) | Slow (k-means + PQ training) | ⚠️ Add only, no delete | Billion-scale, tight memory |
| Annoy | ⚡ Good (~3-10 ms) | 🟡 Medium (tree forest) | 🟡 0.70–0.90 | Fast | ❌ Static only | Read-only, static datasets |
Newer Approaches Worth Knowing
ScaNN (Google, 2020) introduces anisotropic vector quantization — instead of minimizing reconstruction error uniformly, it focuses compression fidelity on the directions that matter most for inner-product ranking. ScaNN consistently tops the ANN Benchmarks leaderboard and is available as an open-source library.
DiskANN (Microsoft, 2019) solves the "too big for RAM" problem by building a graph-based index (similar to HNSW) that lives on SSD rather than in memory. It uses a small in-memory compressed representation for routing, then fetches full vectors from disk only when needed. DiskANN can index billions of vectors on a single machine with just 64 GB of RAM. It powers vector search in Microsoft's Bing and Azure Cognitive Search.
Start with HNSW. It gives the best recall-to-speed ratio and is the default in nearly every managed vector database. Only move to IVF-PQ or DiskANN when your dataset exceeds your RAM budget — typically beyond 50–100 million vectors. If your data is static and you want simplicity, Annoy is a lightweight alternative for smaller scales.
Architecture of Vector Databases vs Traditional Databases
Traditional databases and vector databases solve fundamentally different problems, and their internals reflect that. An RDBMS like PostgreSQL organizes data into rows and columns, indexes them with B-trees or LSM-trees, and optimizes for exact lookups, joins, and range scans — all governed by ACID transactions. A vector database, by contrast, is purpose-built around approximate nearest neighbor (ANN) search across high-dimensional embedding spaces. Its core index structures — HNSW graphs, IVF clusters, PQ-compressed representations — have no equivalent in the relational world.
This distinction isn't cosmetic. The entire data path — how data is ingested, indexed, stored, queried, and distributed — is re-engineered to serve a single goal: find the most similar vectors to a query vector, fast, at scale.
Architecture at a Glance
The following diagram traces the lifecycle of a vector through a modern vector database, from ingestion to query result. Notice how the architecture separates concerns — batching and validation happen before indexing, the index lives in memory for speed while raw vectors persist on disk, and the query engine combines ANN search with metadata filtering before returning ranked results.
graph LR
subgraph Client
C1["Client App
vectors + metadata"]
end
subgraph Ingestion Layer
I1["Batching &
Validation"]
I2["Normalization &
Pre-processing"]
end
subgraph Index Layer
IX1["HNSW Graph
Builder"]
IX2["IVF / PQ
Index"]
end
subgraph Storage Layer
S1[("Vectors
on Disk")]
S2[("Index
in Memory")]
S3[("Metadata
Store")]
end
subgraph Query Engine
Q1["ANN Search"]
Q2["Metadata
Filter"]
Q3["Re-ranking &
Scoring"]
end
subgraph Distributed Layer
D1["Shard 1"]
D2["Shard 2"]
D3["Shard N"]
end
C1 --> I1
I1 --> I2
I2 --> IX1
I2 --> IX2
IX1 --> S2
IX2 --> S2
I2 --> S1
I2 --> S3
C1 -. "query vector" .-> Q1
S2 --> Q1
S3 --> Q2
Q1 --> Q2
Q2 --> Q3
Q3 -. "ranked results" .-> C1
S1 -.-> D1
S1 -.-> D2
S1 -.-> D3
Traditional vs Vector Database — Structural Comparison
Before diving into each layer, it helps to see the two architectures side-by-side. The table below maps each concern from a traditional RDBMS to its vector database counterpart.
| Concern | Traditional RDBMS | Vector Database |
|---|---|---|
| Primary data unit | Row / Document | High-dimensional vector + optional payload |
| Index structures | B-tree, LSM-tree, hash index | HNSW graph, IVF-Flat, IVF-PQ, ScaNN |
| Query paradigm | Exact match, range scan, JOIN | Approximate nearest neighbor (ANN) similarity |
| Consistency model | ACID transactions | Eventual consistency (commonly); some offer tunable |
| Distance metrics | N/A | Cosine similarity, Euclidean (L2), dot product |
| Scaling strategy | Vertical (read replicas for reads) | Horizontal sharding of vector partitions |
| Memory profile | Disk-oriented with buffer pool | Memory-heavy (index often pinned in RAM) |
| Filtering | WHERE clauses via indexed B-tree | Pre-filter, post-filter, or integrated ANN filter |
Key Architectural Components of a Vector Database
1. Ingestion Layer
The ingestion layer is the front door. It accepts vectors (typically 128–4096 dimensional float arrays) along with optional metadata payloads — structured fields like category, timestamp, or source_url that you'll later use for filtering. Before anything hits the index, this layer handles batching (grouping individual inserts into bulk operations), validation (checking dimensionality consistency), and normalization (e.g., L2-normalizing vectors for cosine similarity).
Most production vector databases expose both single-insert and batch-upsert APIs. Batching matters enormously: inserting 100,000 vectors one at a time can be 10–50× slower than a single bulk operation because each insert may trigger index maintenance overhead.
# Qdrant batch upsert — vectors + metadata payloads
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
client = QdrantClient(url="http://localhost:6333")
points = [
PointStruct(id=i, vector=embeddings[i], payload={"category": "ai", "source": urls[i]})
for i in range(len(embeddings))
]
# Batch upsert: far faster than inserting one at a time
client.upsert(collection_name="articles", points=points)
2. Index Layer
The index layer is the heart of what makes a vector database different. Instead of B-trees that support exact lookups, vector databases maintain ANN index structures designed for fast approximate similarity search. The two most common are:
- HNSW (Hierarchical Navigable Small World) — A multi-layered graph where each node is a vector. Search starts at the top layer (sparse, long-range connections) and greedily descends to the bottom layer (dense, local connections). Offers excellent recall (95–99%) with sub-millisecond latency at the cost of high memory usage.
- IVF (Inverted File Index) — Partitions the vector space into clusters using k-means. At query time, only the nearest clusters are searched. Faster to build than HNSW and uses less memory, but typically lower recall for the same speed budget.
A critical design decision is real-time inserts vs. batch re-indexing. HNSW supports incremental inserts — you can add vectors to the graph without rebuilding it. IVF-based indexes, however, often require periodic re-training of cluster centroids as data distribution shifts. Some systems (like Milvus) use a hybrid approach: new vectors land in a small, mutable "growing segment" and are periodically merged into the main immutable index.
HNSW's ef_construction (graph build quality) and M (max connections per node) directly trade build time and memory for search recall. IVF's nlist (number of clusters) and nprobe (clusters searched at query time) offer a similar dial. Tuning these is not optional in production — default values are rarely optimal for your specific dataset size and latency budget.
3. Storage Layer
The storage layer manages three distinct categories of data: raw vectors, index data, and metadata. Each has different access patterns, so many vector databases treat them separately.
- Raw vectors are stored on disk and loaded on demand. A single 1536-dimensional float32 vector (OpenAI's
text-embedding-3-small) consumes ~6 KB. At 100 million vectors, that's roughly 600 GB of raw vector data alone. - Index data (the HNSW graph or IVF centroids) typically lives in memory for fast traversal. This is the primary memory bottleneck — an HNSW graph for 100M vectors can consume 50–200 GB of RAM depending on
Mand dimensionality. - Metadata is often stored in a secondary structure — a columnar store, RocksDB-backed key-value store, or SQLite — optimized for the filtering operations that accompany vector search.
Storage engines vary significantly across implementations. Qdrant and Weaviate use memory-mapped files (mmap), letting the OS page vector data in and out of RAM transparently. Milvus uses a tiered architecture with separate object storage (S3/MinIO) for cold data and in-memory caches for hot segments. FAISS, being a library rather than a database, gives you raw control — you choose whether indexes live on CPU memory, GPU memory, or disk.
4. Query Engine
The query engine receives a query vector, searches the ANN index for the top-K most similar vectors, optionally applies metadata filters, and returns scored results. The mechanics look like this:
# Qdrant query: ANN search + metadata filtering
from qdrant_client.models import Filter, FieldCondition, MatchValue
results = client.search(
collection_name="articles",
query_vector=query_embedding, # 1536-dim vector from your embedding model
query_filter=Filter(
must=[FieldCondition(key="category", match=MatchValue(value="ai"))]
),
limit=10, # top-K results
score_threshold=0.75 # minimum similarity score
)
for result in results:
print(f"ID: {result.id}, Score: {result.score:.4f}, Source: {result.payload['source']}")
Many vector databases also support hybrid search — combining dense vector similarity with sparse keyword signals (BM25) into a single ranked result. Weaviate and Qdrant both offer this natively, using reciprocal rank fusion (RRF) or weighted linear combination to merge the two score distributions. This is particularly valuable in RAG pipelines where pure semantic search can miss keyword-specific matches.
5. Distributed Layer
When a single node can't hold all your vectors in memory, you need horizontal scaling. The distributed layer handles this through two mechanisms: sharding (splitting vectors across nodes) and replication (copying shards for fault tolerance).
Sharding strategies vary. Some systems hash vector IDs to assign shards (simple, but queries must fan out to all shards). Others use partition-based sharding — assigning vectors to shards based on their position in the embedding space, so a query only hits the relevant shards. Milvus, for example, uses a segment-based architecture where data is split into sealed segments that can be distributed across a cluster of query nodes.
Replication adds availability but introduces consistency challenges. Since most vector databases favor eventual consistency over strict ACID, a recently upserted vector might not be immediately searchable on all replicas. For most AI/ML workloads this is acceptable, but it's worth understanding if your application needs strong read-after-write guarantees.
Filtering Strategies: Pre, Post, and Integrated
One of the trickiest problems in vector database design is combining metadata filters with ANN search. This comes up constantly: "find the 10 most similar articles, but only from the last 30 days." There are three approaches, each with real tradeoffs.
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Pre-filtering | Apply metadata filter first to get candidate IDs, then run ANN search only within that subset | No wasted ANN computation; results always match filter | If the filter is very selective (few matches), ANN index can't work efficiently — may degrade to brute-force scan |
| Post-filtering | Run ANN search first to get top-K candidates, then discard those that don't match the filter | ANN search runs at full efficiency; simple to implement | Wasteful — you may discard most results. If top-100 ANN results all fail the filter, you return nothing |
| Integrated filtering | Filter-aware ANN — the index structure itself prunes non-matching vectors during graph traversal | Best of both worlds: efficient search that respects filters natively | Complex to implement; requires tight coupling between index structure and metadata store |
If your queries almost always include metadata filters (common in RAG and multi-tenant systems), choose a database with integrated filtering. Qdrant and Weaviate both implement filter-aware HNSW traversal. Pinecone uses a hybrid approach that automatically selects pre- or post-filtering based on filter selectivity. Avoid relying on pure post-filtering in production — it fails silently by returning fewer results than requested.
The Deployment Spectrum
Vector databases span a wide range of deployment models, from a Python import to a fully managed cloud service. Your choice depends on dataset size, operational maturity, and whether you need persistence, scaling, or managed infrastructure.
| Deployment Model | Examples | Best For | Limitations |
|---|---|---|---|
| Embedded / Library | FAISS, ChromaDB, LanceDB | Prototyping, small datasets (<1M vectors), single-process apps | No built-in replication, no multi-tenant isolation, limited filtering |
| Self-hosted Server | Milvus, Qdrant, Weaviate | Full control, data sovereignty, custom tuning, large-scale deployments | Operational burden — you manage upgrades, backups, scaling |
| Managed Cloud | Pinecone, Zilliz Cloud, Weaviate Cloud | Zero-ops, auto-scaling, production workloads without infra team | Vendor lock-in risk, cost at scale, less tuning control |
The single most common surprise when moving from prototype to production is memory consumption. An HNSW index for 10 million 1536-dimensional vectors can easily require 30–60 GB of RAM — far more than the raw vector data itself. Plan your deployment model and infrastructure budget around the index memory footprint, not just the data size.
The Vector Database Landscape: Pinecone, Weaviate, Qdrant, Milvus, ChromaDB, and pgvector
The vector database market has exploded alongside the rise of LLMs and retrieval-augmented generation (RAG). As of 2024–2025, you have roughly a dozen serious options — but they fall into four distinct categories based on how they're built, deployed, and operated. Understanding these categories is more useful than memorizing feature lists, because each category implies a different set of trade-offs around control, cost, and operational burden.
Below, we walk through each category and its flagship databases, then consolidate everything into a comparison table at the end.
Purpose-Built Managed: Pinecone
Pinecone is the fully managed, proprietary option. You don't deploy anything — you get an API key, create an index, and start upserting vectors. Their serverless architecture (launched in 2024) means you no longer even provision pods; you pay per query and per storage unit, which makes small-scale experimentation essentially free.
Pinecone supports namespaces for logical partitioning within an index, metadata filtering for combining vector search with attribute-based constraints, and sparse-dense hybrid queries for blending keyword and semantic search. The developer experience is polished — client SDKs for Python, Node.js, Java, and Go, plus a clean dashboard for monitoring.
The catch: Pinecone is not open source and cannot be self-hosted. You're locked into their infrastructure and pricing. For prototyping and moderate production workloads, costs are reasonable. But at scale — billions of vectors, high query throughput — pricing can escalate quickly, and you have limited control over index tuning or infrastructure placement.
from pinecone import Pinecone
pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("product-catalog")
# Upsert vectors with metadata
index.upsert(vectors=[
("vec-1", [0.12, 0.84, ...], {"category": "electronics", "price": 299.99}),
("vec-2", [0.45, 0.21, ...], {"category": "books", "price": 14.99}),
])
# Query with metadata filter
results = index.query(
vector=[0.11, 0.82, ...],
top_k=5,
filter={"category": {"$eq": "electronics"}, "price": {"$lt": 500}}
)
Purpose-Built Open Source: Weaviate, Qdrant, and Milvus
This category gives you the power of a dedicated vector engine with the freedom to self-host, inspect, and modify the source. Each of these three has carved out a distinct identity.
Weaviate
Weaviate is written in Go and distinguishes itself with vectorization modules — built-in integrations that generate embeddings for you at ingest time. You can plug in OpenAI, Cohere, Hugging Face, or local models, meaning you send raw text and Weaviate handles the embedding step. This simplifies your pipeline significantly.
Weaviate exposes a GraphQL API, supports hybrid search (BM25 + vector), and has strong multi-tenancy support, making it a solid fit for SaaS platforms that serve per-tenant data. The managed version is Weaviate Cloud Services (WCS), but you can also run it via Docker or Kubernetes.
{
Get {
Article(
hybrid: { query: "machine learning trends", alpha: 0.75 }
where: { path: ["year"], operator: GreaterThan, valueInt: 2023 }
limit: 10
) {
title
summary
_additional { score }
}
}
}
Qdrant
Qdrant is written in Rust, and it shows — benchmarks consistently place it among the fastest options for both indexing and querying. It offers both gRPC and REST APIs, giving you the choice between raw throughput and ease of integration.
Qdrant's payload indexes are a standout feature. You can create indexes on metadata fields (string, integer, float, geo, datetime) and combine them with vector search in a single query, all with efficient pre-filtering. The documentation is exceptionally well-organized, which has contributed to its rapid community growth. Qdrant Cloud provides the managed option.
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
client = QdrantClient(url="http://localhost:6333")
results = client.query_points(
collection_name="articles",
query=[0.11, 0.82, ...],
query_filter=Filter(
must=[
FieldCondition(key="category", match=MatchValue(value="tech")),
FieldCondition(key="year", range=Range(gte=2024)),
]
),
limit=10,
)
Milvus
Milvus is the heavy lifter, built in C++ and Go, designed for billion-scale deployments. It supports GPU-accelerated indexing, a wide variety of index types (IVF_FLAT, IVF_SQ8, HNSW, DiskANN, and more), and separates storage from compute for elastic scaling. If you're working at the scale of a large recommendation engine or image search across hundreds of millions of items, Milvus is built for that.
The trade-off is operational complexity. A production Milvus deployment involves multiple components — proxy, query nodes, data nodes, index nodes, etcd, MinIO/S3, and Pulsar/Kafka. It's not something you spin up casually. Zilliz Cloud provides the fully managed version for teams that want Milvus-scale without the infrastructure burden.
Weaviate, Qdrant, and Milvus are all open source, but they vary enormously in operational complexity. Qdrant and Weaviate can run as a single binary or Docker container for small deployments. Milvus, by contrast, is a distributed system from the ground up — even the "standalone" mode pulls in etcd and MinIO. Factor in ops burden, not just features, when choosing.
Lightweight / Embedded: ChromaDB
ChromaDB takes the SQLite approach to vector databases: it's a Python-native library that runs in-process, requires zero configuration, and stores everything locally (backed by SQLite and a local HNSW index). You pip install chromadb and start inserting documents in under a minute.
This makes ChromaDB the default choice for prototyping RAG applications, Jupyter notebook experiments, and small-to-medium production apps where your collection fits comfortably in memory on a single machine (up to a few million vectors). It also handles document storage and optional embedding generation via integrations with OpenAI, Cohere, and sentence-transformers.
The limitation is scaling. ChromaDB doesn't shard across nodes, doesn't support replication, and performance degrades as collections grow into the tens of millions. For production systems at scale, you'll eventually graduate to one of the purpose-built options above. But for getting a RAG pipeline running in an afternoon, nothing beats it.
import chromadb
client = chromadb.Client() # in-memory; use PersistentClient for disk
collection = client.create_collection("docs")
collection.add(
documents=["Vector databases store embeddings", "RAG combines retrieval with generation"],
ids=["doc-1", "doc-2"],
metadatas=[{"topic": "databases"}, {"topic": "llm"}],
)
results = collection.query(query_texts=["how do vector DBs work?"], n_results=3)
Extension-Based: pgvector
pgvector is a PostgreSQL extension that adds vector column types and approximate nearest-neighbor search to your existing Postgres database. If your team already runs Postgres — and statistically, you probably do — pgvector lets you add vector search without introducing a new database into your stack.
As of pgvector 0.7+, it supports both HNSW and IVF-Flat indexes for approximate nearest-neighbor search, as well as exact (brute-force) search for smaller datasets. You store vectors alongside your relational data, which means you can join vector similarity results with traditional SQL queries — user tables, permissions, application data — in a single transaction. No synchronization headaches between two databases.
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create a table with a vector column
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title TEXT,
category TEXT,
embedding vector(1536) -- OpenAI ada-002 dimension
);
-- Create an HNSW index for fast approximate search
CREATE INDEX ON articles USING hnsw (embedding vector_cosine_ops);
-- Query: find similar articles in a specific category
SELECT id, title, 1 - (embedding <=> '[0.12, 0.84, ...]'::vector) AS similarity
FROM articles
WHERE category = 'technology'
ORDER BY embedding <=> '[0.12, 0.84, ...]'::vector
LIMIT 10;
The trade-off is performance at scale. Dedicated vector databases optimize for this one job — memory layout, SIMD instructions, custom disk formats, distributed query execution. pgvector runs within Postgres's general-purpose query engine, which means it can't match the throughput or latency of Qdrant or Milvus at tens of millions of vectors. For most applications with under ~5 million vectors, the difference is negligible. Beyond that, benchmark carefully.
If your dataset is under a few million vectors and you already use PostgreSQL, pgvector is the pragmatic default. You avoid adding infrastructure, your team already knows SQL, and you can always migrate to a dedicated system later if performance demands it. The migration path is straightforward — export vectors, re-index in the new system, swap the query layer.
Head-to-Head Comparison
The table below summarizes the key differentiators across all six options. Use it as a quick reference, but remember that the "best" choice depends on your scale, team, and existing infrastructure — not just raw features.
| Database | Open Source | Primary Language | Index Types | Max Scale | Filtering | Pricing Model | Ideal Use Case |
|---|---|---|---|---|---|---|---|
| Pinecone | No (proprietary) | N/A (managed service) | Proprietary (auto-tuned) | Billions (managed) | Metadata filters with $eq, $gt, $in, etc. | Serverless (pay-per-use) or pod-based | Teams wanting zero-ops vector search with fast time-to-production |
| Weaviate | Yes (BSD-3) | Go | HNSW, flat, BQ/PQ compression | Hundreds of millions | Where filters via GraphQL; hybrid BM25 + vector | Self-host free; WCS managed tiers | SaaS apps needing multi-tenancy, built-in vectorization, or hybrid search |
| Qdrant | Yes (Apache 2.0) | Rust | HNSW (default), scalar/product quantization | Hundreds of millions | Payload indexes (string, int, float, geo, datetime); efficient pre-filtering | Self-host free; Qdrant Cloud managed | Performance-critical apps needing rich filtering and low latency |
| Milvus | Yes (Apache 2.0) | C++ / Go | IVF_FLAT, IVF_SQ8, HNSW, DiskANN, GPU indexes | Billions (GPU-accelerated) | Boolean expressions on scalar fields | Self-host free; Zilliz Cloud managed | Billion-scale deployments: recommendation engines, large-scale image search |
| ChromaDB | Yes (Apache 2.0) | Python | HNSW (via hnswlib) | Low millions (single node) | Metadata where filters with $eq, $gt, $in, etc. | Free (open source) | Prototyping, Jupyter notebooks, small-to-medium RAG apps |
| pgvector | Yes (PostgreSQL License) | C | HNSW, IVF-Flat, exact (brute-force) | Low millions (single Postgres instance) | Full SQL WHERE clauses — any Postgres expression | Free extension; managed via cloud Postgres providers | Teams already on Postgres who want vector search without new infrastructure |
Public benchmarks (like ANN Benchmarks) test raw ANN performance on specific datasets at specific scales. Your real-world performance depends on dimensionality, dataset size, filtering complexity, update frequency, and hardware. Always run your own benchmarks with your actual data and query patterns before committing to a database at scale.
Powering RAG Pipelines with Vector Search
Large language models are powerful, but they hallucinate, their training data goes stale, and they know nothing about your proprietary documents. Fine-tuning on custom data is one solution — but it's expensive, slow to iterate, and the model still can't tell you which source it drew from. Retrieval-Augmented Generation (RAG) takes a different approach: at query time, you search a knowledge base for relevant context, stuff that context into the prompt, and let the LLM generate an answer grounded in real data.
Vector databases are the backbone of this pattern. They let you store document chunks as embeddings and retrieve the semantically closest ones in milliseconds — even across millions of records. The result is an LLM that can answer questions about your codebase, internal docs, or product catalog without ever having been trained on them.
The RAG Pipeline at a Glance
A RAG system has two distinct data paths. The offline ingestion path preprocesses your source documents into searchable embeddings. The online query path runs every time a user asks a question. Here's how the two paths interact:
sequenceDiagram
box rgb(45, 50, 60) Offline Ingestion
participant Docs as Source Documents
participant Chunker as Chunker
participant EmbModel as Embedding Model
end
box rgb(45, 50, 60) Online Query
participant User as User
participant App as Application
participant VDB as Vector DB
participant LLM as LLM
end
Note over Docs, EmbModel: Offline — run once or on schedule
Docs->>Chunker: Raw text (PDF, HTML, Markdown)
Chunker->>EmbModel: Text chunks + metadata
EmbModel->>VDB: Vectors + chunk text + metadata
Note over User, LLM: Online — every user query
User->>App: Ask a question
App->>EmbModel: Embed the question
EmbModel-->>App: Query vector
App->>VDB: ANN search (top-k)
VDB-->>App: Relevant chunks + scores
App->>LLM: Prompt = context chunks + question
LLM-->>App: Grounded answer
App-->>User: Response with sources
Phase 1 — Offline Ingestion
Before you can retrieve anything, you need to build the index. Ingestion is a batch process you run whenever your source material changes — on a schedule, triggered by a CI pipeline, or on document upload.
Split documents into chunks
Raw documents — PDFs, Notion pages, Confluence articles, code files — get split into smaller units. Each chunk becomes one row in your vector database. The chunking strategy you choose has an outsized effect on retrieval quality (more on this below).
Generate embeddings
Each chunk is passed through an embedding model that maps it to a dense vector, typically 384–3072 dimensions depending on the model. Popular choices include:
| Model | Provider | Dimensions | Notes |
|---|---|---|---|
text-embedding-3-small | OpenAI | 1536 | Good quality-to-cost ratio; API-based |
text-embedding-3-large | OpenAI | 3072 | Higher accuracy; supports dimension reduction via API param |
BGE-large-en-v1.5 | BAAI (open-source) | 1024 | Strong MTEB benchmark scores; self-hostable |
E5-mistral-7b-instruct | Microsoft (open-source) | 4096 | LLM-based embedder; excellent for long documents |
nomic-embed-text-v1.5 | Nomic (open-source) | 768 | Fully open weights + training data; Matryoshka support |
Store in a vector database
You store three things per record: the embedding vector, the original chunk text (so you can inject it into prompts later), and metadata — the source URL, page number, section heading, or any filter you'll want at query time.
Phase 2 — Query Time Retrieval
When a user asks a question, the application embeds the question using the same model that was used during ingestion. This is critical — mixing models produces vectors in incompatible spaces, and similarity search returns garbage.
The query vector is sent to the vector database, which runs an Approximate Nearest Neighbor (ANN) search and returns the top-k most relevant chunks ranked by cosine similarity (or another distance metric). Typical values of k range from 3 to 10, depending on your context window budget.
The application then constructs a prompt that combines the retrieved chunks with the user's question and sends it to a generative LLM like GPT-4, Claude, or Llama 3.
The embedding model used at query time must match the one used during ingestion. If you switch models, you need to re-embed your entire corpus. This is the single most common "why is my RAG returning irrelevant results?" debugging issue.
Phase 3 — Grounded Generation
The LLM generates its answer using only (or primarily) the retrieved context. Because the relevant source text is right there in the prompt, the model is far less likely to hallucinate. You can further reduce hallucination by instructing the model to cite which chunk it drew from, or to say "I don't know" when the context doesn't contain the answer.
Why Chunking Strategy Matters
Chunking is the most under-discussed part of RAG, yet it has the biggest impact on answer quality. The goal is to create chunks that are semantically self-contained — each one should represent a coherent idea that stands on its own when retrieved out of context.
| Strategy | Chunk Size | Pros | Cons |
|---|---|---|---|
| Fixed-size token split | 256–512 tokens | Simple, predictable | Cuts mid-sentence; loses context |
| Recursive character split | 500–1000 chars | Respects paragraph boundaries | Uneven chunk sizes |
| Semantic chunking | Variable | Preserves meaning boundaries | Slower; requires embedding each sentence |
| Document-structure-aware | Section/heading-based | Natural divisions; great for docs | Sections can be too large or too small |
A good rule of thumb: start with recursive character splitting at ~500 tokens with 50-token overlap between chunks. Measure retrieval quality with a test set of questions, then iterate. Overlap ensures you don't lose information that falls at a boundary.
Too-large chunks dilute the signal — the LLM gets a wall of text where only one sentence is relevant. Too-small chunks lose context — you retrieve a sentence fragment that's meaningless without its surrounding paragraph. Always evaluate empirically on your specific data.
Advanced RAG Patterns
Basic RAG — embed, retrieve, generate — works surprisingly well. But when you need higher accuracy or have tricky data, these patterns push quality further.
Re-ranking
ANN search is fast but approximate. A re-ranker takes the initial top-k results (say, 20) and re-scores them using a more expensive cross-encoder model that looks at the query-document pair jointly, not just their embeddings independently. You then take the top 5 from the re-ranked list. Cohere Rerank and open-source cross-encoders from Sentence Transformers are the most popular options. This two-stage approach (fast recall → precise re-rank) consistently improves answer quality.
Hybrid Search
Vector similarity excels at semantic matching ("How do I authenticate?" matches "login flow documentation"), but it can miss exact keyword matches that BM25 catches trivially (e.g., a specific error code like ERR_CONN_REFUSED). Hybrid search runs both a BM25 keyword search and a vector ANN search in parallel, then merges results using Reciprocal Rank Fusion (RRF). Many vector databases — Weaviate, Qdrant, Milvus — support hybrid search natively.
Multi-Query Retrieval
A single user question might be ambiguous or multi-faceted. Multi-query retrieval uses the LLM to generate 3–5 rephrased versions of the original question, runs each as a separate retrieval query, and unions the results. This improves recall by approaching the same information from multiple semantic angles.
Parent-Document Retrieval
You embed small chunks for precise matching but store a reference to the full parent document (or a larger enclosing section). When a small chunk matches, you retrieve the parent for full context before sending it to the LLM. This gives you the best of both worlds: precise retrieval with rich context.
End-to-End RAG Code Outline
Here's a minimal but complete RAG pipeline using LangChain and ChromaDB. This covers ingestion, retrieval, and generation in under 40 lines.
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
# --- Phase 1: Offline Ingestion ---
loader = TextLoader("docs/product-guide.md")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
chunks, embedding=embeddings, persist_directory="./chroma_db"
)
# --- Phase 2 & 3: Query-Time Retrieval + Generation ---
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context.
If the context doesn't contain the answer, say "I don't know."
Context:
{context}
Question: {question}
""")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
)
response = chain.invoke("How do I reset a user's password?")
print(response.content)
The ingestion phase (lines 8–21) loads a Markdown file, splits it into ~500-token chunks with overlap, embeds each chunk, and stores everything in a local ChromaDB instance. The query phase (lines 24–42) creates a retriever that pulls 5 chunks per query, formats them into a prompt template, and pipes the result to GPT-4o. The entire chain runs with a single .invoke() call.
To add Cohere re-ranking to this pipeline, swap the retriever with ContextualCompressionRetriever wrapping a CohereRerank compressor. Retrieve 20 candidates, re-rank to 5 — you'll often see a measurable jump in answer relevance with no other changes.
AI/ML Use Cases Beyond RAG: Semantic Search, Recommendations, and More
RAG gets most of the attention, but it represents just one pattern in a much broader landscape. Vector databases are general-purpose engines for any task that reduces to "find the nearest things in embedding space." Once you internalize that framing, use cases appear everywhere — search, recommendations, anomaly detection, deduplication, and multimodal retrieval all follow the same core mechanic.
Let's walk through the major categories, with concrete examples of how each one works in practice.
mindmap
root((Vector DB Use Cases))
RAG / LLM Context
Semantic Search
E-commerce
Documentation
Support tickets
Recommendations
Content-based
Collaborative filtering
Image / Multimodal Search
CLIP embeddings
Reverse image search
Anomaly Detection
Fraud detection
Cybersecurity
De-duplication
Near-duplicate docs
Duplicate images
Semantic Search
Traditional keyword search (BM25, TF-IDF) matches on literal token overlap. If a user searches for "comfortable shoes for hiking," they won't find a product described as "lightweight trail boots with cushioned insoles" — none of the query words appear in the product description. Semantic search closes this gap by comparing meaning, not words.
The approach is straightforward: embed every document (product listing, help article, support ticket) into a vector, store those vectors in a vector database, then embed the query at search time and retrieve the nearest neighbors. The embedding model handles the semantic heavy-lifting — it knows that "comfortable hiking shoes" and "cushioned trail boots" live in the same neighborhood of vector space.
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
model = SentenceTransformer("all-MiniLM-L6-v2")
client = QdrantClient(":memory:")
# Create a collection for product embeddings
client.create_collection(
collection_name="products",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
# Product catalog — note: no keyword overlap with the query below
products = [
{"id": 1, "text": "Lightweight trail boots with cushioned insoles"},
{"id": 2, "text": "Waterproof Gore-Tex hiking sneakers"},
{"id": 3, "text": "Formal leather dress shoes for office wear"},
]
# Embed and upsert products
points = [
PointStruct(id=p["id"], vector=model.encode(p["text"]).tolist(),
payload={"description": p["text"]})
for p in products
]
client.upsert(collection_name="products", points=points)
# Semantic query — matches meaning, not keywords
query = "comfortable shoes for hiking"
results = client.query_points(
collection_name="products",
query=model.encode(query).tolist(),
limit=2,
)
for r in results.points:
print(f"{r.score:.3f} {r.payload['description']}")
# 0.725 Lightweight trail boots with cushioned insoles
# 0.648 Waterproof Gore-Tex hiking sneakers
Notice that "comfortable shoes for hiking" matched "lightweight trail boots with cushioned insoles" despite zero keyword overlap. This is exactly the gap that BM25 cannot bridge. Semantic search is now used across e-commerce product discovery, internal documentation search, and customer support ticket routing.
Recommendation Systems
Recommendations and vector search are natural partners. The core idea: represent users and items as vectors in the same embedding space, then finding "what should this user see next?" becomes a nearest-neighbor query. This works with multiple flavors of embeddings:
- Content-based embeddings — embed items by their features (text descriptions, genres, tags). Recommend items that are close to what the user already liked.
- Collaborative filtering embeddings — learn user and item vectors from interaction data (clicks, purchases, ratings) using matrix factorization or neural collaborative filtering. Users who behave similarly end up in similar regions.
- Two-tower models — a user encoder and an item encoder are trained jointly so that matching user–item pairs produce high cosine similarity. At serving time, you encode the user, then do ANN search over pre-encoded item vectors.
This pattern is deployed at scale. Spotify embeds tracks and playlists into a shared vector space to power "Discover Weekly" recommendations. Pinterest uses two-tower models for pin recommendations. E-commerce platforms like Etsy and Amazon use item embeddings for "similar products" and "customers also bought" sections — all backed by approximate nearest-neighbor retrieval from vector indices.
With millions of items, brute-force similarity search is too slow for real-time serving. Vector databases provide sub-millisecond ANN retrieval — you trade a small amount of recall for orders-of-magnitude speed gains, which is the right tradeoff for recommendation latency budgets.
Image and Multimodal Search
Models like OpenAI's CLIP, Google's SigLIP, and Meta's ImageBind embed images and text into a shared vector space. This means the vector for the text "a golden retriever playing in snow" is close to the vector for a photo of exactly that scene. This unlocks powerful cross-modal retrieval:
- Text-to-image search — type a natural language description, retrieve matching images. No manual tagging required.
- Reverse image search — upload a photo, find visually similar products or images in your catalog.
- Image-to-text matching — given an image, find the most relevant text descriptions or captions.
The workflow mirrors text-based semantic search: pre-embed your image catalog using the vision encoder, store the vectors, then at query time embed the input (text or image) with the corresponding encoder and search. Fashion e-commerce uses this heavily — a user can photograph a jacket they like and find similar items for sale, or type "floral summer dress with pockets" and get visually relevant results without relying on product metadata.
Anomaly Detection
If you embed "normal" behavior into vectors, then anomalies reveal themselves as points far from any cluster. The intuition is simple: the embedding model learns a compressed representation of typical patterns — typical network traffic, normal transaction profiles, expected sensor readings. When a new data point lands in an empty region of vector space, distant from all known centroids, it's flagged for review.
In practice, you embed a corpus of known-good behavior, then for each new observation, query the vector database for the k nearest neighbors and measure distance. If the average distance exceeds a threshold, the observation is anomalous. This is used in:
- Fraud detection — embed transaction features (amount, merchant category, time, location). Fraudulent transactions tend to be distant from a user's historical cluster.
- Cybersecurity — embed network flow logs or system call sequences. Novel attack patterns appear as outliers in vector space.
- Manufacturing QA — embed sensor readings from normal production runs. Defective-run patterns deviate measurably.
De-duplication and Clustering
Finding exact duplicates is trivial (hash comparison). Finding near duplicates — an article republished with minor edits, a product image with a different background, a customer record with a misspelled name — requires semantic similarity. Embed each record, then flag pairs whose cosine similarity exceeds a threshold (e.g., 0.95).
Vector databases make this efficient even at scale. Instead of comparing every pair (O(n²)), you query each vector's nearest neighbors above the threshold, reducing the problem to O(n × k). Common applications include deduplicating web crawl data before training LLMs, merging customer records in CRM systems, and identifying reposted content on media platforms.
| Use Case | Input Embeddings | Query Pattern | Real-World Example |
|---|---|---|---|
| Semantic Search | Text (docs, products) | Query → top-k nearest | E-commerce product discovery |
| Recommendations | User + item vectors | User → nearest items | Spotify Discover Weekly |
| Multimodal Search | CLIP image + text | Text → nearest images | Reverse image search |
| Anomaly Detection | Behavior features | Point → distance from cluster | Fraud detection systems |
| De-duplication | Any content type | Each → neighbors above threshold | Web crawl dedup for LLM training |
Across every use case listed above, the quality of your results is bounded by the quality of your embeddings. A vector database retrieves nearest neighbors faithfully — but if your embedding model doesn't place semantically related items near each other, no amount of index tuning will fix it. Invest in evaluating and fine-tuning your embedding model before optimizing your vector DB configuration.
Hands-On: Building Semantic Search with Python
Every vector database follows the same core loop: embed → store → query → rank. The syntax changes between providers, but the pattern never does. In this section, you'll build a working semantic search system four different ways — starting simple and adding control as you go.
Regardless of whether you use ChromaDB, Pinecone, pgvector, Weaviate, or Qdrant — you always: (1) turn text into vectors, (2) store those vectors with metadata, (3) turn your query into a vector, and (4) find the nearest stored vectors. Internalize this pattern and every new vector DB becomes a syntax exercise.
1. ChromaDB — Zero-Config Quickstart
ChromaDB is the fastest path from zero to semantic search. It runs embedded (no server process), handles embedding automatically using its built-in default model (all-MiniLM-L6-v2), and stores everything locally. You don't even need to think about vectors — just pass strings.
Install
pip install chromadb
Complete Example
import chromadb
# Create an in-memory client (no server needed)
client = chromadb.Client()
# Create a collection — ChromaDB auto-embeds with all-MiniLM-L6-v2
collection = client.create_collection(name="articles")
# Add documents — just pass strings and IDs
collection.add(
documents=[
"Python is a versatile programming language used in web development and data science.",
"Machine learning models require large datasets for effective training.",
"PostgreSQL supports advanced indexing strategies for fast query performance.",
"Neural networks are inspired by the structure of the human brain.",
"Docker containers package applications with their dependencies for consistent deployment.",
],
ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
)
# Query with natural language — ChromaDB embeds the query automatically
results = collection.query(
query_texts=["How do neural networks learn?"],
n_results=3,
)
# Inspect results
for i, (doc, distance) in enumerate(zip(results["documents"][0], results["distances"][0])):
print(f"#{i+1} (distance: {distance:.4f}): {doc[:80]}...")
That's it — six meaningful lines of code. ChromaDB embeds both documents and queries behind the scenes, computes cosine distances, and returns the closest matches. The output looks something like:
#1 (distance: 0.8531): Neural networks are inspired by the structure of the human brain....
#2 (distance: 1.0632): Machine learning models require large datasets for effective training....
#3 (distance: 1.3781): Python is a versatile programming language used in web development and...
Lower distance means higher similarity. The query "How do neural networks learn?" correctly surfaces the neural network document first, the ML training document second, and a loosely related programming doc third.
2. ChromaDB with Explicit Embeddings & Metadata Filtering
The auto-embedding convenience is great for prototyping, but production systems need more control. You may want a specific embedding model, custom metadata for filtering, or pre-computed embeddings from an external service. Here's how to take full control while still using ChromaDB as the vector store.
Install Dependencies
pip install chromadb sentence-transformers
Complete Example with Manual Embeddings
import chromadb
from sentence_transformers import SentenceTransformer
# Load a specific embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Documents with rich metadata
documents = [
{"text": "RAG combines retrieval systems with generative models for grounded answers.",
"meta": {"source": "blog", "year": 2024, "category": "ai"}},
{"text": "Kubernetes orchestrates containerized applications across clusters.",
"meta": {"source": "docs", "year": 2023, "category": "devops"}},
{"text": "Fine-tuning LLMs on domain data improves task-specific performance.",
"meta": {"source": "paper", "year": 2024, "category": "ai"}},
{"text": "Terraform enables infrastructure as code for cloud provisioning.",
"meta": {"source": "docs", "year": 2023, "category": "devops"}},
{"text": "Vector databases enable fast approximate nearest neighbor search at scale.",
"meta": {"source": "blog", "year": 2024, "category": "ai"}},
]
# Generate embeddings manually
texts = [d["text"] for d in documents]
embeddings = model.encode(texts).tolist()
# Store in ChromaDB — pass embeddings directly, no auto-embedding
client = chromadb.Client()
collection = client.create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"}, # explicit distance metric
)
collection.add(
embeddings=embeddings,
documents=texts,
metadatas=[d["meta"] for d in documents],
ids=[f"doc{i}" for i in range(len(documents))],
)
# Query: "retrieval augmented generation" — but only 2024 AI articles
query = "retrieval augmented generation"
query_embedding = model.encode([query]).tolist()
results = collection.query(
query_embeddings=query_embedding,
n_results=3,
where={"$and": [{"year": {"$eq": 2024}}, {"category": {"$eq": "ai"}}]},
)
for doc, meta, dist in zip(
results["documents"][0], results["metadatas"][0], results["distances"][0]
):
print(f"[{dist:.4f}] ({meta['source']}, {meta['year']}) {doc[:70]}...")
The where clause filters documents before the similarity ranking. This is critical for multi-tenant systems, date-scoped search, or any scenario where you need to combine semantic similarity with structured constraints. ChromaDB supports $eq, $ne, $gt, $gte, $lt, $lte, $in, and logical operators $and / $or.
3. Pinecone — Managed Cloud Vector Search
Pinecone is a fully managed vector database — you don't run any infrastructure. It's designed for production workloads where you want SLA-backed uptime, auto-scaling, and zero operational overhead. The trade-off is vendor lock-in and cost at scale. The API is straightforward: create an index, upsert vectors, query.
Install
pip install pinecone sentence-transformers
Complete Example
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
import time
# Initialize client
pc = Pinecone(api_key="YOUR_API_KEY")
# Create a serverless index (384 dims = all-MiniLM-L6-v2 output size)
index_name = "semantic-search-demo"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=384,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
time.sleep(10) # wait for index to be ready
index = pc.Index(index_name)
# Prepare data
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
{"id": "doc1", "text": "RAG combines retrieval with generation for grounded answers.",
"source": "blog", "year": 2024},
{"id": "doc2", "text": "Kubernetes orchestrates containers across clusters.",
"source": "docs", "year": 2023},
{"id": "doc3", "text": "Fine-tuning LLMs improves task-specific performance.",
"source": "paper", "year": 2024},
]
# Upsert vectors with metadata
vectors = []
for doc in documents:
embedding = model.encode(doc["text"]).tolist()
vectors.append({
"id": doc["id"],
"values": embedding,
"metadata": {"text": doc["text"], "source": doc["source"], "year": doc["year"]},
})
index.upsert(vectors=vectors)
# Query with metadata filter — only 2024 documents
query_embedding = model.encode("How does retrieval augmented generation work?").tolist()
results = index.query(
vector=query_embedding,
top_k=3,
include_metadata=True,
filter={"year": {"$eq": 2024}},
)
for match in results["matches"]:
print(f"[{match['score']:.4f}] {match['metadata']['text'][:70]}...")
Key differences from ChromaDB: Pinecone requires you to specify the vector dimension upfront, returns a score (similarity, higher is better) rather than a distance (lower is better), and uses top_k instead of n_results. The metadata filter syntax is nearly identical.
4. pgvector — Vector Search in PostgreSQL
If you already run PostgreSQL, pgvector lets you add vector search without introducing a new database. Your embeddings live alongside your relational data, you get full SQL for filtering, and you keep your existing backup/replication infrastructure. This is a compelling choice when you don't need billion-scale vector search but do need transactional consistency.
Setup
# Install the pgvector extension (varies by OS)
# Ubuntu/Debian: sudo apt install postgresql-16-pgvector
# macOS (Homebrew): brew install pgvector
pip install psycopg2-binary sentence-transformers
Complete Example
import psycopg2
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
conn = psycopg2.connect("dbname=vectordb user=postgres host=localhost")
cur = conn.cursor()
# Enable the vector extension and create table
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
CREATE TABLE IF NOT EXISTS articles (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
source VARCHAR(50),
year INTEGER,
embedding vector(384) -- 384 dims for all-MiniLM-L6-v2
);
""")
conn.commit()
# Insert documents with embeddings
documents = [
("RAG combines retrieval with generation for grounded answers.", "blog", 2024),
("Kubernetes orchestrates containers across clusters.", "docs", 2023),
("Fine-tuning LLMs improves task-specific performance.", "paper", 2024),
("Terraform enables infrastructure as code for cloud provisioning.", "docs", 2023),
("Vector databases enable fast approximate nearest neighbor search.", "blog", 2024),
]
for content, source, year in documents:
embedding = model.encode(content).tolist()
cur.execute(
"INSERT INTO articles (content, source, year, embedding) VALUES (%s, %s, %s, %s::vector)",
(content, source, year, str(embedding)),
)
conn.commit()
# Create an IVFFlat index for faster queries at scale
cur.execute("""
CREATE INDEX IF NOT EXISTS articles_embedding_idx
ON articles USING ivfflat (embedding vector_cosine_ops) WITH (lists = 5);
""")
conn.commit()
# Query: find similar docs from 2024 using cosine distance (<=>)
query = "How does retrieval augmented generation work?"
query_embedding = model.encode(query).tolist()
cur.execute("""
SELECT content, source, year, embedding <=> %s::vector AS distance
FROM articles
WHERE year = 2024
ORDER BY distance
LIMIT 3;
""", (str(query_embedding),))
for content, source, year, distance in cur.fetchall():
print(f"[{distance:.4f}] ({source}, {year}) {content[:70]}...")
cur.close()
conn.close()
The <=> operator computes cosine distance. pgvector also supports <-> for L2 (Euclidean) distance and <#> for negative inner product. Because everything is SQL, you combine vector search with WHERE, JOIN, GROUP BY, and any other PostgreSQL feature you already know.
API Ergonomics Compared
The table below highlights how the same semantic search operations map across all three providers. The concepts are identical — the syntax is what differs.
| Operation | ChromaDB | Pinecone | pgvector |
|---|---|---|---|
| Initialize | chromadb.Client() |
Pinecone(api_key=...) |
psycopg2.connect(...) |
| Create store | create_collection() |
create_index(dimension, metric) |
CREATE TABLE ... vector(384) |
| Insert vectors | collection.add() |
index.upsert() |
INSERT INTO ... VALUES |
| Query | collection.query(n_results=3) |
index.query(top_k=3) |
ORDER BY <=> LIMIT 3 |
| Metadata filter | where={"year": {"$eq": 2024}} |
filter={"year": {"$eq": 2024}} |
WHERE year = 2024 |
| Result ranking | Distance (lower = better) | Score (higher = better) | Distance (lower = better) |
| Auto-embedding | Yes (built-in) | No | No |
| Infrastructure | Embedded / self-hosted | Fully managed cloud | Your PostgreSQL instance |
Prototyping or learning? Start with ChromaDB — zero config, instant results. Already running Postgres? pgvector keeps your stack simple. Need managed production infrastructure? Pinecone (or similar managed services like Weaviate Cloud or Qdrant Cloud) removes operational burden. You can always migrate later — the embed → store → query → rank pattern stays the same.
The Pattern That Never Changes
Strip away the provider-specific syntax and every example above follows the exact same four steps:
-
Embed — Turn text into vectors
Use a model like
all-MiniLM-L6-v2(or let ChromaDB do it for you) to convert documents and queries into fixed-length numerical arrays. The embedding model determines what "similar" means — choose it carefully. -
Store — Persist vectors with metadata
Insert your embeddings into the vector store alongside any metadata you'll need for filtering later (dates, categories, user IDs, source URLs). This is your only chance to attach structured data — do it at insert time.
-
Query — Embed the question, search the space
Embed the user's query with the same model used for documents (this is non-negotiable — mismatched models produce meaningless distances). Send the query vector to the store and request the top-K nearest neighbors.
-
Rank — Return results by similarity
The vector store returns results ordered by distance or score. In a RAG pipeline, these results become the context passed to the LLM. In a search UI, they become the ranked result list your users see.
Once this mental model clicks, you'll find that learning a new vector database takes minutes, not hours. The hard problems — choosing the right embedding model, chunking strategies, and index tuning — live outside the API layer entirely.
Best Practices, Performance, and Production Considerations
Moving from a working prototype to a production vector search system involves a series of decisions that compound on each other. The quality of your retrieval depends less on which database you pick and more on how you chunk your documents, which embedding model you use, and how you tune the search parameters. This section covers each of those decisions with concrete guidance.
Chunking Strategies
Chunking is the process of splitting source documents into smaller pieces before embedding. The chunk size directly affects retrieval quality: too large and you dilute the semantic signal with irrelevant context; too small and you lose the context needed to answer a question. There is no universally correct chunk size — but there are well-understood trade-offs.
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Fixed-size with overlap | Split every N tokens, overlap by M tokens | Simple, predictable chunk count, easy to reason about | Splits mid-sentence/mid-paragraph, ignores document structure |
| Recursive / semantic splitting | Split on headings, paragraphs, then sentences — recursively until under size limit | Respects document structure, keeps related content together | Variable chunk sizes, more complex to implement |
| Sentence-level | Each sentence (or 2-3 sentences) is its own chunk | Fine-grained retrieval, precise matching | Very noisy results, loses broader context, high vector count |
In practice, fixed-size chunking with overlap is the best starting point for most use cases. It is deterministic, easy to debug, and performs surprisingly well. Recursive splitting is worth the complexity when your documents have clear structure (technical docs, legal contracts, research papers).
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Start here: 512 tokens ≈ 2000 chars, 10-20% overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=200, # ~10% overlap
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
For most RAG applications, 256–512 token chunks with 10–20% overlap provide the best balance of precision and context. Benchmark with your actual queries before going smaller or larger — chunk size changes can swing recall by 10–15%.
Embedding Model Selection
Your embedding model determines the quality ceiling of your entire retrieval pipeline. No amount of index tuning can compensate for embeddings that don't capture the semantics your queries need. Here are the models worth considering in 2024–2025:
| Model | Dimensions | Cost | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02 / 1M tokens | Best cost/quality ratio for most teams. Supports Matryoshka dimension reduction. |
| OpenAI text-embedding-3-large | 3072 | $0.13 / 1M tokens | Highest quality from OpenAI. Use when retrieval quality is critical and cost is secondary. |
| BGE-large-en-v1.5 | 1024 | Free (self-hosted) | Open-source, strong MTEB scores. Good default for self-hosted pipelines. |
| E5-mistral-7b-instruct | 4096 | Free (self-hosted) | LLM-based embeddings. Top-tier quality but requires GPU for inference. |
| nomic-embed-text-v1.5 | 768 | Free (self-hosted) | Competitive quality, small footprint. Supports Matryoshka dimensions. Runs on CPU. |
Use the MTEB Leaderboard (Massive Text Embedding Benchmark) to compare models across retrieval, classification, and clustering tasks. Filter by your specific use case — a model that tops "Retrieval" may underperform on "STS" (semantic textual similarity) and vice versa.
from openai import OpenAI
client = OpenAI()
# Embed with dimension reduction (Matryoshka)
response = client.embeddings.create(
model="text-embedding-3-small",
input="How does HNSW indexing work?",
dimensions=512 # reduce from 1536 → 512 with minimal quality loss
)
embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}") # 512
Metadata Filtering
Raw vector similarity search is rarely enough in production. You almost always need to filter results by metadata — restricting search to a specific user's documents, a date range, or a document category. Store metadata alongside every vector at insert time, not as an afterthought.
from qdrant_client import QdrantClient, models
client = QdrantClient(url="http://localhost:6333")
# Upsert vectors with rich metadata
client.upsert(
collection_name="documents",
points=[
models.PointStruct(
id=1,
vector=embedding,
payload={
"source": "engineering-handbook.pdf",
"category": "infrastructure",
"user_id": "tenant_42",
"created_at": "2024-11-15T10:30:00Z",
"chunk_index": 7,
},
)
],
)
# Search with metadata filter — enables multi-tenancy
results = client.query_points(
collection_name="documents",
query=query_embedding,
query_filter=models.Filter(
must=[
models.FieldCondition(
key="user_id",
match=models.MatchValue(value="tenant_42"),
),
models.FieldCondition(
key="category",
match=models.MatchValue(value="infrastructure"),
),
]
),
limit=5,
)
Useful metadata fields to always consider: source (file name or URL), created_at or updated_at (for recency filtering), category or doc_type (for scoping search), user_id or org_id (for multi-tenancy), and chunk_index (for reconstructing surrounding context after retrieval).
Hybrid Search: Dense + Sparse
Pure vector search excels at finding semantically similar content, but it struggles with exact keyword matches — especially proper nouns, product names, error codes, or domain-specific acronyms. A query for "CUDA error 11" might retrieve generic GPU troubleshooting content instead of the exact error. Hybrid search solves this by combining dense vector search with sparse keyword search (BM25).
The idea is straightforward: run both a vector similarity search and a BM25 keyword search, then merge the results using Reciprocal Rank Fusion (RRF) or a learned re-ranker. Several databases support this natively — you don't need to build the merging logic yourself.
# Weaviate hybrid search — combines BM25 + vector in one query
import weaviate
client = weaviate.connect_to_local()
collection = client.collections.get("Document")
results = collection.query.hybrid(
query="CUDA error 11 out of memory",
alpha=0.5, # 0 = pure keyword, 1 = pure vector, 0.5 = equal blend
limit=10,
return_metadata=weaviate.classes.query.MetadataQuery(score=True),
)
for obj in results.objects:
print(f"{obj.properties['title']} — score: {obj.metadata.score:.3f}")
Hybrid search provides the biggest lift when your queries contain specific identifiers — error codes, product SKUs, person names, API endpoints. If your queries are mostly natural-language questions, pure dense retrieval often performs comparably. Test both with your real query distribution before adding complexity.
Performance Tuning: Recall vs. Latency
Approximate Nearest Neighbor (ANN) indexes trade a small amount of recall for dramatic speed improvements. The key is understanding which knobs to turn and what they cost you.
HNSW Tuning
HNSW is the most widely used ANN index. Its two critical parameters are ef_construction (build-time quality — set once) and ef_search (query-time quality — tune per-query). Higher ef_search explores more of the graph, improving recall but increasing latency.
# Qdrant: configure HNSW parameters
client.create_collection(
collection_name="production_docs",
vectors_config=models.VectorParams(
size=1536,
distance=models.Distance.COSINE,
),
hnsw_config=models.HnswConfigDiff(
m=16, # connections per node (default 16, increase for higher recall)
ef_construct=200, # build quality (higher = better graph, slower build)
),
# Quantization: reduce memory 4x with scalar quantization
quantization_config=models.ScalarQuantization(
scalar=models.ScalarQuantizationConfig(
type=models.ScalarType.INT8,
always_ram=True,
),
),
)
# At query time, control recall/speed tradeoff
results = client.query_points(
collection_name="production_docs",
query=query_embedding,
search_params=models.SearchParams(
hnsw_ef=128, # higher = better recall, slower (default 64-128)
exact=False, # set True for brute-force when recall matters most
),
limit=10,
)
Quantization
Quantization compresses vectors to reduce memory usage, often by 4–8× with minimal recall loss. Three approaches exist, each with different trade-offs:
| Type | Memory Reduction | Recall Impact | Best For |
|---|---|---|---|
| Scalar (INT8) | ~4× | Very low (1–2%) | Default choice — easy win with negligible quality loss |
| Binary | ~32× | Moderate (5–10%) | Massive datasets where memory is the primary constraint |
| Product (PQ) | ~8–16× | Low–moderate (2–5%) | Large-scale systems; use with re-scoring for best results |
Matryoshka Embeddings
OpenAI's text-embedding-3-* models and some open-source models (like Nomic) support Matryoshka representation learning. This means you can truncate the embedding dimensions (e.g., 1536 → 512 or 256) and still retain most of the retrieval quality. Fewer dimensions means less memory, smaller indexes, and faster distance calculations. Test your specific workload, but truncating to 512 dimensions typically retains 95%+ of the original recall.
Batch Upserts
Never insert vectors one at a time. Batch your upserts in groups of 100–500 to amortize network and indexing overhead. Most client libraries support this directly:
import itertools
def batched(iterable, n):
it = iter(iterable)
while batch := list(itertools.islice(it, n)):
yield batch
# Batch upsert — 200 points per request
for batch in batched(all_points, 200):
client.upsert(
collection_name="production_docs",
points=batch,
)
print(f"Upserted {len(batch)} points")
Scaling: When to Grow and How
The right infrastructure depends on your dataset size, query volume, and latency requirements. The most common mistake is over-engineering early — a single-node setup handles more than most teams expect.
graph TD
START([Start]) --> PG{"Already on
PostgreSQL?"}
PG -->|Yes| PGVEC["Consider pgvector first
Simplest ops, good to ~1M vectors"]
PG -->|No| SIZE{"Dataset size?"}
PGVEC --> SIZE
SIZE -->|"Under 100K vectors"| EMB["Embedded
ChromaDB / FAISS
In-process, zero ops"]
SIZE -->|"100K – 10M vectors"| SINGLE["Single-node
Qdrant / Weaviate / pgvector
~$50-200/mo on cloud"]
SIZE -->|"10M – 1B vectors"| DIST["Distributed / Managed
Milvus cluster / Pinecone
Sharding + replication"]
SIZE -->|"Over 1B vectors"| MEGA["Distributed + Quantization
+ DiskANN / on-disk indexes
Careful capacity planning required"]
EMB --> QPS{"High QPS or
HA needed?"}
QPS -->|Yes| SINGLE
QPS -->|No| DONE([Ship it])
SINGLE --> GROWTH{"Outgrowing
single node?"}
GROWTH -->|Yes| DIST
GROWTH -->|No| DONE
DIST --> DONE
MEGA --> DONE
style START fill:#4a9eff,stroke:#2d7cd6,color:#fff
style DONE fill:#22c55e,stroke:#16a34a,color:#fff
style EMB fill:#f0f9ff,stroke:#4a9eff
style SINGLE fill:#f0f9ff,stroke:#4a9eff
style DIST fill:#fef9f0,stroke:#f59e0b
style MEGA fill:#fef2f2,stroke:#ef4444
style PGVEC fill:#f0fdf4,stroke:#22c55e
Key Scaling Concepts
- Sharding — Splits your index across multiple nodes. Each node holds a subset of vectors and searches its portion in parallel. Required when your data no longer fits in a single machine's RAM. Milvus, Qdrant (cluster mode), and Weaviate support automatic sharding.
- Replication — Copies of each shard on multiple nodes. Improves read throughput (queries can hit any replica) and provides fault tolerance. Add replicas when read QPS exceeds what one node can handle, not before.
- Memory vs. disk — HNSW indexes must fit in RAM for fast search. For a 10M-vector collection with 1536-dimensional float32 vectors, the raw data alone is ~57 GB. Quantization (INT8) brings this to ~14 GB. DiskANN-style indexes trade latency for the ability to keep vectors on SSD.
A single Qdrant or Weaviate node on a 16 GB machine can handle 2–5 million 1536-dimension vectors with quantization, serving queries in under 20ms. Start here. Move to a distributed setup only when you have measured evidence that a single node is insufficient — not because you think you might need it someday.
When Embedded (In-Process) Solutions Are Enough
ChromaDB and FAISS run inside your application process with no separate server. This is ideal for prototypes, single-user tools, CLI applications, and datasets under 100K vectors. The moment you need concurrent access from multiple services, high availability, or persistent storage that survives process restarts, move to a client-server database like Qdrant or Weaviate.