The Signal — AI News & Takes

The Starting Point: $23K/Month

When we launched our first enterprise AI system, the monthly bill was terrifying: $14K in LLM API costs, $4K in infrastructure, $3K in vector database, and $2K in monitoring. For a system serving 50K queries/month, that's $0.46 per query. The client's budget was $0.08.

Here's what we did, in order of impact:

1. Semantic Caching ($14K -> $6.2K)

55% of queries were semantically similar to previous queries. We built a semantic cache using embedding similarity — if a new query is > 0.95 cosine similarity to a cached query, serve the cached response. Hit rate: 48%. API costs dropped by 56%.

2. Model Routing ($6.2K -> $3.8K)

Not every query needs Claude Opus. We built a lightweight classifier that routes queries to the appropriate model: simple factual lookups go to Haiku ($0.25/1M tokens), standard analysis goes to Sonnet ($3/1M tokens), complex reasoning goes to Opus ($15/1M tokens). Distribution: 40% Haiku, 45% Sonnet, 15% Opus.

3. Prompt Optimization ($3.8K -> $2.9K)

We audited every prompt. Removed redundant instructions, compressed few-shot examples, eliminated verbose system prompts. Average prompt length dropped from 2,100 tokens to 1,340 tokens. Same output quality — we validated against our eval suite.

4. Infrastructure Right-Sizing ($4K -> $1.2K)

Moved from over-provisioned EC2 instances to Fargate with auto-scaling. Our traffic is bursty — heavy during business hours, dead at night. Fargate scales to zero. Also switched from Redis Cluster to a single ElastiCache node with connection pooling.

5. Vector DB Optimization ($3K -> $0.8K)

Reduced embedding dimensions from 1536 to 768 using Matryoshka embeddings. Same retrieval quality (within 0.3% on our benchmark), half the storage. Also implemented metadata filtering to reduce the search space before vector similarity.

End result: $3.4K/month. Same throughput, better latency (thanks to caching), and the client is happy. The 85% reduction came from engineering discipline, not magic.

Cost OptimizationInfrastructureCachingAWSProduction

How We Cut Our AI Infrastructure Costs by 85%

The Starting Point: $23K/Month

1. Semantic Caching ($14K -> $6.2K)

2. Model Routing ($6.2K -> $3.8K)

3. Prompt Optimization ($3.8K -> $2.9K)

4. Infrastructure Right-Sizing ($4K -> $1.2K)

5. Vector DB Optimization ($3K -> $0.8K)

Subscribe to The Signal

More from The Signal

We Built an AI That Reviews Contracts 160x Faster. Here's What Went Wrong First.

The Real Cost of Running AI in Production: Our Numbers

Claude 4.5 Changes Everything for Enterprise RAG