Engineering Insights

Why the Choice Matters

The vector database is the backbone of any RAG system. It stores your embeddings, executes similarity searches at query time, and — in production — needs to handle concurrent queries, metadata filtering, index updates, and backups without falling over. Choosing wrong means either paying 10x more than necessary or discovering latency problems when your biggest client starts using the system.

We have deployed production RAG systems on Pinecone, pgvector (via PostgreSQL), and Qdrant. This post shares our benchmarks and recommendations based on real deployment experience, not synthetic tests.

Test Methodology

We benchmarked all three systems on a standardized workload modeled after our LEXIS AI contract analysis pipeline:

Corpus size: 500,000 vectors (1024 dimensions, float32)
Query patterns: Single vector search with metadata filter, batch retrieval, concurrent load
Metadata: Each vector has 5 metadata fields (document_type, date_range, client_id, jurisdiction, clause_type)
Hardware: pgvector on RDS db.r6g.xlarge (4 vCPU, 32GB RAM), Qdrant on 4 vCPU / 16GB RAM EC2 instance, Pinecone serverless (p2 pods)
Query volume: 50 queries/second sustained, burst to 200 queries/second

All tests use the same embedding vectors generated by our fine-tuned E5-large model. We report p50 and p99 latencies at sustained load.

Results: Raw Performance

Query Latency (sustained 50 QPS, top-10 retrieval)

System	p50 (ms)	p99 (ms)	p50 filtered (ms)	p99 filtered (ms)
Pinecone serverless	18	45	22	58
Qdrant (HNSW, m=16)	4	12	8	24
pgvector (HNSW, ef_search=100)	23	85	41	180

Qdrant is fastest by a significant margin, especially under filtered queries. Pinecone offers consistent (if slightly higher) latencies with no operational overhead. pgvector is the slowest, and its filtered query performance degrades notably — pgvector applies filters after the vector search, meaning it sometimes needs to over-fetch and filter, increasing latency.

Throughput at Burst (200 QPS for 60 seconds)

System	Avg latency (ms)	Max latency (ms)	Error rate
Pinecone serverless	34	210	0%
Qdrant	11	85	0%
pgvector	120	1,400	2.3%

Under burst load, pgvector struggles. The PostgreSQL connection pool saturates, queries queue up, and tail latencies explode. Pinecone handles the burst gracefully — serverless auto-scales with no configuration. Qdrant maintains excellent performance if pre-scaled.

Index Build Time (500K vectors)

System	Initial index	Incremental insert (1K vectors)
Pinecone	~8 minutes	~2 seconds
Qdrant	~12 minutes	~3 seconds
pgvector	~45 minutes	~15 seconds*

*pgvector HNSW index builds are particularly slow. During index building, query performance degrades significantly. We recommend building indexes during maintenance windows.

Cost Analysis (Monthly)

For our reference workload (500K vectors, 50 QPS average):

Pinecone serverless: ~$70/month. No infrastructure to manage. Scales automatically. This is the cost of the managed service — no compute instances, no maintenance.

Qdrant (self-hosted on EC2): ~$145/month (c6i.xlarge instance + EBS). You manage the instance, upgrades, backups, and monitoring. Qdrant Cloud managed service would be ~$95/month.

pgvector (RDS): ~$280/month (db.r6g.xlarge). But if you already have a PostgreSQL instance for your application, the incremental cost is the additional memory and CPU needed for vector operations — potentially $0-$50/month.

The pgvector cost advantage only materializes if you are already running PostgreSQL. If you are provisioning a dedicated instance for vector search, it is the most expensive option with the worst performance.

Filtering: The Hidden Differentiator

In production RAG, you almost never want a pure similarity search. You want "find the most similar clauses in M&A agreements from the New York jurisdiction signed after 2023." This is filtered vector search, and the three systems handle it very differently.

Pinecone applies metadata filters during the search, not after. Filters narrow the candidate set before the ANN search runs. This means filtered queries are nearly as fast as unfiltered queries. Filter syntax is straightforward:

python
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "document_type": {"$eq": "mna"},
        "jurisdiction": {"$eq": "ny"},
        "date": {"$gte": "2023-01-01"},
    }
)

Qdrant also supports pre-filtering with payload indexes. You define which fields should be indexed, and Qdrant builds efficient filter structures. Performance is excellent, and the filter language is expressive (supports nested conditions, geo filters, and full-text match).

pgvector applies filters via standard SQL WHERE clauses, but the filter is applied after the vector search. The query planner retrieves more candidates than requested, filters them, and returns the top-k. For selective filters (matching <5% of the corpus), this means the system may need to scan significantly more vectors than expected, causing latency spikes.

sql
-- pgvector filtered query
-- Performance depends heavily on filter selectivity
SELECT id, content, metadata,
       embedding <=> $1::vector AS distance
FROM clauses
WHERE document_type = 'mna'
  AND jurisdiction = 'ny'
  AND signed_date >= '2023-01-01'
ORDER BY embedding <=> $1::vector
LIMIT 10;

For workloads with complex, selective filters, pgvector's post-filtering approach is a significant limitation.

Operational Complexity

Pinecone is fully managed. There is nothing to operate. Index creation, scaling, backups, and upgrades are handled automatically. The tradeoff is vendor lock-in and less control over infrastructure. If Pinecone has an outage, you have an outage.

Qdrant is operationally moderate. The Docker-based deployment is straightforward. It supports replication and sharding for high availability. Upgrades require rolling restarts. You need to manage backups (snapshots to S3), monitoring (Prometheus metrics are exposed), and capacity planning.

pgvector leverages existing PostgreSQL operational tooling — which is a huge advantage if your team already manages PostgreSQL. Backups, replication, monitoring, and scaling are handled by your existing database infrastructure. The disadvantage is that vector workloads compete with transactional workloads for PostgreSQL resources, which can cause mutual performance degradation under load.

Our Recommendations

Based on our deployment experience:

Use Pinecone when: You want zero operational overhead, your query patterns are relatively simple (similarity + metadata filtering), and you are comfortable with managed service pricing. This is our default recommendation for startups and teams without dedicated infrastructure engineers. We use Pinecone for LEXIS AI.

Use Qdrant when: You need the lowest possible latency, have complex filtering requirements, want to self-host for data sovereignty or compliance reasons, and have the engineering capacity to manage infrastructure. We use Qdrant for HALO's real-time anomaly correlation where sub-10ms query latency is critical.

Use pgvector when: You already run PostgreSQL, your vector corpus is under 1 million vectors, your query volume is moderate (<20 QPS sustained), and you want to avoid adding another database to your stack. pgvector is a reasonable choice for prototypes and smaller-scale applications. We use it for internal tools and development environments.

Do not use pgvector when: You need consistent sub-50ms filtered queries, your corpus exceeds 2 million vectors, or you have bursty query patterns. pgvector's performance characteristics make it unsuitable for high-throughput, latency-sensitive RAG pipelines.

The Migration Path

A practical pattern we use: start with pgvector during development and early production (simple, no additional infrastructure), migrate to Pinecone or Qdrant when query volume or latency requirements exceed pgvector's capabilities. The migration is straightforward — re-embed your corpus (or export and re-import vectors), update the retrieval service configuration, and validate with your evaluation framework.

The abstraction layer that makes this migration painless:

python
from abc import ABC, abstractmethod

class VectorStore(ABC):
    @abstractmethod
    async def upsert(self, vectors: list[VectorRecord]) -> None: ...

    @abstractmethod
    async def query(
        self, vector: list[float], top_k: int, filter: dict | None = None
    ) -> list[SearchResult]: ...

    @abstractmethod
    async def delete(self, ids: list[str]) -> None: ...

class PineconeStore(VectorStore): ...
class QdrantStore(VectorStore): ...
class PgVectorStore(VectorStore): ...

Build to an interface from day one, and the vector database becomes a swappable infrastructure decision rather than an architectural commitment.

Vector DatabasePineconepgvectorQdrantRAG

Vector Databases in Production: Pinecone vs pgvector vs Qdrant

Why the Choice Matters

Test Methodology

Results: Raw Performance

Query Latency (sustained 50 QPS, top-10 retrieval)

Throughput at Burst (200 QPS for 60 seconds)

Index Build Time (500K vectors)

Cost Analysis (Monthly)

Filtering: The Hidden Differentiator

Operational Complexity

Our Recommendations

The Migration Path

Related Articles

Building Production RAG Systems: Lessons from 50,000 Legal Documents

Agentic AI in the Enterprise: From Chatbots to Autonomous Workflows

Why RAG Beats Fine-Tuning for 90% of Enterprise Use Cases

Want to build something similar?