The Starting Point: $23K/Month
When we launched our first enterprise AI system, the monthly bill was terrifying: $14K in LLM API costs, $4K in infrastructure, $3K in vector database, and $2K in monitoring. For a system serving 50K queries/month, that's $0.46 per query. The client's budget was $0.08.
Here's what we did, in order of impact:
1. Semantic Caching ($14K -> $6.2K)
55% of queries were semantically similar to previous queries. We built a semantic cache using embedding similarity — if a new query is > 0.95 cosine similarity to a cached query, serve the cached response. Hit rate: 48%. API costs dropped by 56%.
2. Model Routing ($6.2K -> $3.8K)
Not every query needs Claude Opus. We built a lightweight classifier that routes queries to the appropriate model: simple factual lookups go to Haiku ($0.25/1M tokens), standard analysis goes to Sonnet ($3/1M tokens), complex reasoning goes to Opus ($15/1M tokens). Distribution: 40% Haiku, 45% Sonnet, 15% Opus.
3. Prompt Optimization ($3.8K -> $2.9K)
We audited every prompt. Removed redundant instructions, compressed few-shot examples, eliminated verbose system prompts. Average prompt length dropped from 2,100 tokens to 1,340 tokens. Same output quality — we validated against our eval suite.
4. Infrastructure Right-Sizing ($4K -> $1.2K)
Moved from over-provisioned EC2 instances to Fargate with auto-scaling. Our traffic is bursty — heavy during business hours, dead at night. Fargate scales to zero. Also switched from Redis Cluster to a single ElastiCache node with connection pooling.
5. Vector DB Optimization ($3K -> $0.8K)
Reduced embedding dimensions from 1536 to 768 using Matryoshka embeddings. Same retrieval quality (within 0.3% on our benchmark), half the storage. Also implemented metadata filtering to reduce the search space before vector similarity.
End result: $3.4K/month. Same throughput, better latency (thanks to caching), and the client is happy. The 85% reduction came from engineering discipline, not magic.