Question 1

How much does it cost to build a RAG system?

Accepted Answer

For a typical 'chat with your PDFs' bot — 1,000 documents and 10,000 questions a month on GPT-4o-mini — expect roughly $6 per month (about Rs 1,800) plus a one-time indexing charge of around $0.02. Cost scales with query volume and the generation model you pick, not with knowledge-base size. The calculator above prices your exact numbers.

Question 2

Is RAG cheaper than fine-tuning a model?

Accepted Answer

Usually yes for getting started. RAG has near-zero setup cost — you pay a few cents to embed your documents, then per-query API fees. Fine-tuning charges for a training run up front (often tens to hundreds of dollars) plus higher per-token inference on the custom model. RAG also lets you update knowledge by re-indexing instead of retraining. Fine-tuning wins mainly on style and format consistency, not factual recall.

Question 3

How much do embeddings cost for a RAG knowledge base?

Accepted Answer

Indexing is the cheapest line in the whole pipeline. At OpenAI text-embedding-3-small's $0.02 per million tokens, embedding a one-million-token corpus (about 1,300 PDF pages) costs $0.02. Even ten million tokens is $0.20 on the small model or $1.30 on text-embedding-3-large. You pay it once, plus a tiny per-query embedding fee on each question.

Question 4

What is the per-query cost of a RAG chatbot?

Accepted Answer

Per query you pay a near-free retrieval embedding plus the LLM generation cost. Generation dominates: it bills the system prompt, the question, and every retrieved chunk as input tokens, then the answer as output tokens. A 5-chunk retrieval with a 500-token chunk size sends ~2,700 input tokens — about $0.0006 per query on GPT-4o-mini, or roughly $0.006 on Claude Sonnet 4.5.

Question 5

Does increasing top-k retrieval increase RAG cost?

Accepted Answer

Yes, directly. Each retrieved chunk is added to the generation model's input as topK × chunkSize tokens. Doubling top-k from 5 to 10 on a 500-token chunk size adds 2,500 input tokens to every query. Since generation is the biggest cost line, retrieval depth is the lever that moves your bill the most. Retrieve only as many chunks as your answer quality actually needs.

Question 6

Why does LLM generation dominate the bill?

Accepted Answer

Indexing is one-time and embeddings are priced at cents per million tokens, while generation runs on every single query and bills the much pricier chat-model rates — often 5 to 100 times the embedding rate per token. Multiply that by thousands of monthly queries, each carrying several retrieved chunks as input, and generation routinely accounts for over 95% of a RAG bill. The breakdown bar in the calculator makes this share explicit.

Question 7

How is the LKR figure calculated?

Accepted Answer

LKR = USD × your chosen USD→LKR rate. The default is the Central Bank of Sri Lanka daily indicative rate (about Rs 300/USD on the verification date) — edit it to match your bank's actual conversion or what Wise or Payoneer quoted. Banks usually settle 1–3% weaker than the CBSL indicative, so budget a small buffer when quoting a client.

Question 8

Does this include vector-database query (read) costs?

Accepted Answer

No. The tool prices vector storage at your chosen $/GB-month (default Pinecone serverless $0.33) but not per-query read/write units, which vary widely by provider and index size. On serverless tiers these read units are typically a small fraction of the LLM generation cost. If your provider charges meaningful read fees, add them on top of the figure here.

Question 9

Are these prices live?

Accepted Answer

No. Every per-token and per-GB price is hand-verified against the vendor's pricing page and stored with its source URL in the calculator's code. Last full verification was 2026-06-09. Because provider pricing changes, every price — including the USD→LKR rate and storage cost — is editable so the estimate stays correct.

Question 10

What's out of scope for this calculator?

Accepted Answer

Application hosting and bandwidth (deployment-specific), reranker models like Cohere Rerank, self-hosted GPU embedding inference, latency or throughput, and semantic-caching savings. Caching in particular can cut real bills substantially but is workload-specific, so the tool leaves it out rather than overstating confidence. Use the self-hosting cost calculator for on-GPU pipelines.

RAG Cost Calculator

What drives the monthly bill

Derived stats

How it works

Worked examples

Frequently asked questions

Sources & references

Related tools

AI Chatbot Cost Calculator

AI Agent Cost Calculator

AI Video Cost Calculator

Comments & feedback