2026-05-04 · 5 min read

Real RAG with Cloudflare Vectorize + AI Gateway in 80 lines

Most "build RAG" tutorials in 2026 walk you through Pinecone + OpenAI embeddings + LangChain. Pinecone is paid from day 1. OpenAI ada-002 embeddings cost ~$0.02 per million tokens. LangChain adds ~600 KB to your bundle.

The Cloudflare equivalent — Vectorize + Workers AI bge-m3 + AI Gateway — is free up to 30M stored vectors and 50M queries/month, runs at the edge with no cold-start fee, and ships in ~80 lines of code total. This post is the wiring.

What you need

A Cloudflare account (free)
A Workers/Pages project (already deployed, see our previous post)
Wrangler CLI

Create the Vectorize index

wrangler vectorize create my-rag --dimensions=1024 --metric=cosine

1024 dims because that's what bge-m3 emits. cosine metric because that's what bge-m3 was trained on. If you switch embedding models later, the dim has to match — recreate the index.

Bind it

Add to wrangler.toml:

[[vectorize]]
binding    = "VECTORIZE"
index_name = "my-rag"

Now your Worker has env.VECTORIZE available.

Embedding via Workers AI

const out = await env.AI.run('@cf/baai/bge-m3', { text: ['your query here'] })
const vec = out.data[0]   // Float32Array of 1024 values

Workers AI free tier covers a lot of bge-m3 calls — typically 10k+ per day with no cost. If you ever cross the free tier, it's $0.011 per 1M input tokens. Compare to OpenAI ada-002 at $0.10 per 1M tokens.

Upsert to the index

await env.VECTORIZE.upsert([{
  id:       'doc-001',
  values:   Array.from(vec),       // Vectorize wants plain arrays
  metadata: { slug: 'doc-001', title: 'Whatever you want to look up later' },
}])

You can include arbitrary metadata. Vectorize returns it on query, so use it to avoid a second roundtrip to KV/D1 for "what was the actual content of this doc".

Query

const matches = await env.VECTORIZE.query(Array.from(qVec), {
  topK: 5, returnMetadata: true,
})
// matches.matches = [{ id, score, metadata }, ...]

Vectorize is read-after-write consistent within a few seconds. Same caveat as KV — don't depend on "I just upserted, query immediately" being available everywhere.

The full RAG pattern

async function rag(env, query, kFromIndex = 3) {
  // 1. Embed the query
  const out = await env.AI.run('@cf/baai/bge-m3', { text: [query] })
  const qVec = out.data[0]

  // 2. Retrieve similar past documents
  const matches = await env.VECTORIZE.query(Array.from(qVec), {
    topK: kFromIndex, returnMetadata: true,
  })
  const context = matches.matches
    .filter(m => m.score >= 0.55)
    .map(m => `[${m.metadata.slug}] ${m.metadata.title}`)
    .join('\n')

  // 3. Generate with LLM (Llama-3.3-70B on Workers AI, free up to N calls/day)
  const llm = await env.AI.run('@cf/meta/llama-3.3-70b-instruct-fp8-fast', {
    messages: [
      { role: 'system', content: 'Use the prior context as inspiration. Author a fresh response.' },
      { role: 'user',   content: `Prior context:\n${context}\n\nUser query: ${query}` },
    ],
  })

  // 4. Upsert this query+response back into the index for future hits
  await env.VECTORIZE.upsert([{
    id:       crypto.randomUUID(),
    values:   Array.from(qVec),
    metadata: { slug: query.slice(0, 50), title: llm.response.slice(0, 200) },
  }])

  return llm.response
}

~25 lines of actual logic. The embedding + index + LLM are all bound to the same Worker via env. Total external call: 0 (everything runs on Cloudflare).

AI Gateway in front of it

Cloudflare AI Gateway is a free observability + caching layer. Create a gateway in the dashboard (AI → AI Gateway), then configure your env.AI.run calls to route through it:

await env.AI.run('@cf/meta/llama-3.3-70b-instruct-fp8-fast', {
  messages,
}, {
  gateway: { id: 'my-gateway' }
})

Set cache_ttl=86400 in the gateway settings. Now identical messages arrays dedupe at the gateway — saving the LLM call entirely. Combined with our own KV-based response cache, repeat queries hit in ~5ms.

Cost ceiling

Free tier ceilings, May 2026:

Workers: 100k requests/day
Workers AI bge-m3: typically 10k+ embed calls/day
Workers AI Llama-70B: limited but generous (a few k/day)
Vectorize: 30M stored vectors, 50M queries/month
AI Gateway: free, no usage cap on the cache layer itself

For a typical MCP service with ~100 paying users, you'll never approach any of these. Migrating off later (if you ever need to) is a thin-abstraction job — we wrap Vectorize behind a 30-line _vector.ts in the buyer guide so swapping to Upstash Vector / pgvector is one file.

Get every line of this — wired into a working MCP server template

The full guide on Gumroad ships the RAG layer above pre-wired into the Worker template, plus the AI Gateway integration, plus SSE streaming for live progress events. $29.

Build Your Own MCP Server — $29 →