Real RAG with Cloudflare Vectorize + AI Gateway in 80 lines
Most "build RAG" tutorials in 2026 walk you through Pinecone + OpenAI embeddings + LangChain. Pinecone is paid from day 1. OpenAI ada-002 embeddings cost ~$0.02 per million tokens. LangChain adds ~600 KB to your bundle.
The Cloudflare equivalent โ Vectorize + Workers AI bge-m3 + AI Gateway โ is free up to 30M stored vectors and 50M queries/month, runs at the edge with no cold-start fee, and ships in ~80 lines of code total. This post is the wiring.
What you need
- A Cloudflare account (free)
- A Workers/Pages project (already deployed, see our previous post)
- Wrangler CLI
Create the Vectorize index
wrangler vectorize create my-rag --dimensions=1024 --metric=cosine
1024 dims because that's what bge-m3 emits. cosine metric
because that's what bge-m3 was trained on. If you switch embedding models later, the
dim has to match โ recreate the index.
Bind it
Add to wrangler.toml:
[[vectorize]]
binding = "VECTORIZE"
index_name = "my-rag"
Now your Worker has env.VECTORIZE available.
Embedding via Workers AI
const out = await env.AI.run('@cf/baai/bge-m3', { text: ['your query here'] })
const vec = out.data[0] // Float32Array of 1024 values
Workers AI free tier covers a lot of bge-m3 calls โ typically 10k+ per day with no cost. If you ever cross the free tier, it's $0.011 per 1M input tokens. Compare to OpenAI ada-002 at $0.10 per 1M tokens.
Upsert to the index
await env.VECTORIZE.upsert([{
id: 'doc-001',
values: Array.from(vec), // Vectorize wants plain arrays
metadata: { slug: 'doc-001', title: 'Whatever you want to look up later' },
}])
You can include arbitrary metadata. Vectorize returns it on query, so use it to avoid a second roundtrip to KV/D1 for "what was the actual content of this doc".
Query
const matches = await env.VECTORIZE.query(Array.from(qVec), {
topK: 5, returnMetadata: true,
})
// matches.matches = [{ id, score, metadata }, ...]
Vectorize is read-after-write consistent within a few seconds. Same caveat as KV โ don't depend on "I just upserted, query immediately" being available everywhere.
The full RAG pattern
async function rag(env, query, kFromIndex = 3) {
// 1. Embed the query
const out = await env.AI.run('@cf/baai/bge-m3', { text: [query] })
const qVec = out.data[0]
// 2. Retrieve similar past documents
const matches = await env.VECTORIZE.query(Array.from(qVec), {
topK: kFromIndex, returnMetadata: true,
})
const context = matches.matches
.filter(m => m.score >= 0.55)
.map(m => `[${m.metadata.slug}] ${m.metadata.title}`)
.join('\n')
// 3. Generate with LLM (Llama-3.3-70B on Workers AI, free up to N calls/day)
const llm = await env.AI.run('@cf/meta/llama-3.3-70b-instruct-fp8-fast', {
messages: [
{ role: 'system', content: 'Use the prior context as inspiration. Author a fresh response.' },
{ role: 'user', content: `Prior context:\n${context}\n\nUser query: ${query}` },
],
})
// 4. Upsert this query+response back into the index for future hits
await env.VECTORIZE.upsert([{
id: crypto.randomUUID(),
values: Array.from(qVec),
metadata: { slug: query.slice(0, 50), title: llm.response.slice(0, 200) },
}])
return llm.response
}
~25 lines of actual logic. The embedding + index + LLM are all bound to the
same Worker via env. Total external call: 0 (everything runs on
Cloudflare).
AI Gateway in front of it
Cloudflare AI Gateway is a free observability + caching layer. Create a gateway
in the dashboard (AI → AI Gateway), then configure your env.AI.run
calls to route through it:
await env.AI.run('@cf/meta/llama-3.3-70b-instruct-fp8-fast', {
messages,
}, {
gateway: { id: 'my-gateway' }
})
Set cache_ttl=86400 in the gateway settings. Now identical
messages arrays dedupe at the gateway โ saving the LLM call entirely.
Combined with our own KV-based response cache, repeat queries hit in ~5ms.
Cost ceiling
Free tier ceilings, May 2026:
- Workers: 100k requests/day
- Workers AI bge-m3: typically 10k+ embed calls/day
- Workers AI Llama-70B: limited but generous (a few k/day)
- Vectorize: 30M stored vectors, 50M queries/month
- AI Gateway: free, no usage cap on the cache layer itself
For a typical MCP service with ~100 paying users, you'll never approach any of
these. Migrating off later (if you ever need to) is a thin-abstraction job โ
we wrap Vectorize behind a 30-line _vector.ts in the buyer guide
so swapping to Upstash Vector / pgvector is one file.
Get every line of this โ wired into a working MCP server template
The full guide on Gumroad ships the RAG layer above pre-wired into the Worker template, plus the AI Gateway integration, plus SSE streaming for live progress events. $29.
Build Your Own MCP Server โ $29 โ