Cloudflare AI
Explained

Serverless AI inference, agents, and RAG — running on the network that powers 20% of the internet

89+ Models Available 200+ GPU Cities $0.011 / 1K Neurons 10K Free Neurons/Day

AI inference at the edge. No GPU clusters required.

Cloudflare AI is a suite of products within Cloudflare's Developer Platform that lets you run open-source AI models — LLMs, image generators, embedding models, speech-to-text, and more — on serverless GPUs distributed across Cloudflare's global network. You write a few lines of code, call an API, and Cloudflare handles provisioning, scaling, and routing requests to the nearest available GPU.

Think of it like this: the same company that already sits between your users and your servers for CDN, DNS, and DDoS protection has strapped NVIDIA H100 GPUs to that network and is selling inference by the request. Your AI workload runs physically close to your users, which means lower latency and fewer hops. No ML engineering team required. No GPU reservation contracts. No idle hardware bills.

The Problem

GPU Clusters Are Expensive and Idle

Average GPU utilisation sits between 20–40% for most organisations. You reserve capacity, pay for it, and watch it sit empty between spikes. Scaling across regions means duplicating infrastructure.

The Solution

Serverless Inference at the Edge

Workers AI runs models on Cloudflare's network. You pay per request via "Neurons" — a unit representing actual GPU compute consumed. No provisioning, no idle cost, no region selection.

The Result

Single-Platform AI Development

Developers call a single API, get low-latency inference in 200+ cities, and combine it with Vectorize, AI Gateway, and R2 for full-stack AI applications — all within one platform.

💻
Your Code
Worker, Page, or REST
🛡️
AI Gateway
Cache, rate limit, log
🧠
Workers AI
GPU inference at edge
📤
Response
Text, image, audio

Edge inference at scale. Why it's different.

Most AI inference today runs in a handful of centralised data centres. Cloudflare's approach is to run inference at the edge — on the same global network that already handles CDN, DNS, and security traffic. When every interaction with an AI model adds network latency on top of compute latency, physical proximity to users reduces total response time.

330+
Cities in Cloudflare's network
89+
Open-source models available
82%
Less CPU overhead vs vLLM (Infire engine)
77%
Cost reduction on internal agent workloads

Why not just use OpenAI / Anthropic / Google APIs directly?

AI Gateway proxies requests to third-party providers (OpenAI, Anthropic, etc.) as well as Workers AI, adding caching, rate limiting, and fallback across any model. This means you can route open-source models for cost-sensitive workloads and proprietary APIs for others through a single control plane. Cloudflare has reported that switching an internal agent workload from a mid-tier proprietary model to Kimi K2.5 on Workers AI reduced costs by 77%.

Six products. One AI stack.

Cloudflare's AI platform isn't a single product — it's a set of integrated building blocks. Each one does a specific job, and they're designed to work together without glue code.

Product 01

Workers AI

The inference engine. Call 89+ open-source models (LLMs, image gen, embeddings, TTS, ASR) via a binding in your Worker or a REST API from anywhere. Models run on NVIDIA H100 GPUs across Cloudflare's network. Pricing is per-Neuron (GPU compute unit), with 10,000 free per day.

// Call Llama from a Worker
const response = await env.AI.run(
  "@cf/meta/llama-4-scout-17b-16e-instruct",
  { messages: [{ role: "user", content: prompt }] }
);

Product 02

AI Gateway

A proxy layer that sits in front of any AI provider — Workers AI, OpenAI, Anthropic, etc. One line of config gives you request logging, analytics, caching (including semantic caching), rate limiting, model fallback, and cost tracking. Works with the OpenAI SDK. Free on all plans.

Product 03

Vectorize

A globally distributed vector database for storing embeddings. Supports metadata filtering (string, number, boolean), integrates natively with Workers AI embedding models, and enables RAG, semantic search, and recommendation workflows. Data stays close to users.

Product 04

AI Search

Formerly called AutoRAG. A fully managed RAG pipeline — connect an R2 bucket or website, and Cloudflare handles chunking, embedding, indexing, and querying automatically. Supports continuous re-crawling, multitenancy via folder-based filters, and streaming responses. Currently in open beta.

Product 05

Agents SDK

A TypeScript SDK for building persistent, stateful AI agents on Durable Objects. Agents maintain state over long sessions, support WebSocket hibernation (zero cost when idle), MCP client/server support, scheduled tasks, and "Code Mode" — where the LLM writes code against a typed SDK instead of individual tool calls, reducing token usage by up to 87.5%.

Engine

Infire

Cloudflare's custom LLM inference engine, written in Rust. Replaces Python-based stacks like vLLM. Uses granular CUDA graphs, JIT kernel compilation, and paged KV caching. Benchmarks show 7% faster inference than vLLM 0.10.0 on H100s, with 82% lower CPU overhead. Powers all Workers AI model serving.

Where Cloudflare AI fits alongside everything else.

Cloudflare's AI products don't exist in isolation. They plug into the broader Developer Platform — Workers for compute, R2 for storage, D1 for SQL, Durable Objects for state, Pages for frontends. The AI layer is designed to be consumed alongside these, not as a standalone service.

Compute

Workers + Durable Objects

Serverless functions (V8 isolates) and persistent stateful objects. The execution environment for all AI application logic. Durable Objects give agents identity, state, and long-lived connections.

Storage

R2 + D1 + KV

R2 provides S3-compatible object storage with zero egress fees — ideal for training data, model weights, and document stores. D1 is serverless SQL. KV is global key-value storage.

Protocol

MCP Servers

Cloudflare provides managed remote MCP servers for its own API (2,500+ endpoints via just two tools) and partners with Anthropic, Stripe, Asana, and others to host their MCP servers on the platform.

When to use what

NeedUse ThisWhy
Run an LLM or image model Workers AI Serverless, pay-per-request, 89+ open-source models at the edge
Monitor & control AI costs AI Gateway Caching, fallback, logging across any provider — including third-party APIs
Build semantic search / RAG Vectorize + AI Search Vector DB for custom RAG; AI Search for zero-config managed RAG pipelines
Build a persistent AI agent Agents SDK Stateful execution on Durable Objects with MCP, scheduling, and Code Mode
Store documents / training data R2 Zero egress fees, S3-compatible, feeds directly into AI Search pipelines

What people actually build with it.

Cloudflare's AI stack is broad enough to support most common AI application patterns. Here's where teams are getting value.

Support

AI-Powered Customer Support

RAG pipelines that ground LLM responses in your knowledge base. AI Search handles ingestion and indexing; Workers AI generates responses. The agent runs on Durable Objects for session continuity.

Search

Semantic Search & Recommendations

Generate embeddings with Workers AI, store them in Vectorize, and query by meaning rather than keywords. Works for product search, content discovery, and anomaly detection.

Content

Image Generation & Media Processing

FLUX.2, Stable Diffusion, and Leonardo.Ai models for text-to-image. Deepgram and Whisper for speech-to-text. MeloTTS for voice synthesis. All serverless, all at the edge.

Agentic

Autonomous AI Agents

Long-running agents built with the Agents SDK that plan, reason, and act. MCP integration for tool use. Code Mode for token-efficient multi-tool orchestration. WebSocket hibernation for cost control.

Security

AI-Enhanced App Security

Cloudflare uses Workers AI internally for DLP false-positive reduction, email threat analysis, and automated code review. Their "Bonk" agent reviews PRs using Kimi K2.5 and processed over 7B tokens daily.

Developer Tools

AI-Powered DevEx

MCP servers that expose your entire API to AI agents. Markdown for Agents auto-converts HTML to markdown. Cloudflare's own MCP server covers 2,500+ API endpoints with just two tools and ~1,000 tokens.

From CDN to AI platform. A fast pivot.

Cloudflare's AI journey has been remarkably compressed. In under three years, they've gone from announcing GPU availability to offering a full-stack AI developer platform with custom inference engines and frontier model support.

SEP 2023

Workers AI Launch

Initial launch with a small set of open-source models. GPU deployment begins in 100+ cities. Partnership with Hugging Face announced. Vectorize vector database launches alongside.

APR 2024

Workers AI Goes GA

General availability with Neuron-based pricing ($0.011/1K). Support for LoRA fine-tuned models and one-click Hugging Face deploys. Python support added to Workers. AI Playground launched.

SEP 2024

GPU Upgrade & Larger Models

H100 NVL GPUs deployed. Support for Llama 3.1 70B and Llama 3.2 family. GPUs in 180+ cities. Vectorize upgraded with metadata filtering. Infire inference engine revealed.

MAR 2025

Agents SDK & MCP

Agents SDK launches for building persistent, stateful AI agents on Durable Objects. Remote MCP server support added. Partnerships with Anthropic, Stripe, Asana, and others for hosted MCP servers.

SEP 2025

AI Week — Infire Deep Dive

Full technical reveal of Infire (Rust-based LLM engine). MCP Server Portals for enterprise security. AI Search (AutoRAG) enters open beta. NLWeb integration for conversational search. Firewall for AI announced.

DEC 2025

Replicate Acquisition

Cloudflare acquires Replicate, the popular model hosting platform. Strengthens the inference platform's model catalogue and developer reach.

MAR 2026

Frontier Models & Code Mode

Kimi K2.5 (256K context, multi-turn tool calling) launches on Workers AI — the first frontier-scale open-source model on the platform. Code Mode reduces MCP token usage by 87.5%. Unified Cloudflare MCP server covers 2,500+ endpoints via two tools. NVIDIA Nemotron 3 and OpenAI GPT-OSS models added.

Is Cloudflare AI right for your project?

Cloudflare's AI platform fits certain use cases well and is less suited to others. Here's a factual breakdown.

✓ Use Cloudflare AI when

You want serverless, pay-per-use inference with no GPU reservation — particularly for variable or spiky workloads.

Latency matters and your users are geographically distributed. Edge inference is available in 200+ cities globally.

You're already on Cloudflare for Workers, Pages, R2, or DNS. The AI products integrate natively with the existing developer platform.

You want to run open-source models (Llama, FLUX, Whisper, etc.) without managing GPU infrastructure, model optimisation, or scaling.

✗ Skip Cloudflare AI when

You need proprietary frontier models (GPT-4o, Claude, Gemini) as your primary — Workers AI only runs open-source models. You can still proxy via AI Gateway, but you're not saving on compute.

You need fine-tuning or custom model training. Cloudflare supports LoRAs but doesn't offer training infrastructure. Dedicated GPU providers or hyperscalers are better here.

You need guaranteed GPU capacity for sustained high-throughput workloads. Serverless means shared resources and potential cold starts for infrequently used models.

You need models Cloudflare doesn't carry. The catalogue is curated, not open-ended. Custom model hosting requires enterprise plans.

What it costs. What you get.

Neuron-based billing

Cloudflare measures AI compute in "Neurons" — a unit representing the GPU resources consumed by a request. Pricing is presented per-model in familiar token-based units, but the backend billing converts everything to Neurons. All limits reset daily at 00:00 UTC.

Free Tier

10,000 Neurons / Day

Available to all Cloudflare accounts, including the Workers Free plan. Access to the full model catalogue. Hard rate limit at 10,000 Neurons — requests fail after that. No credit card required.

Workers Paid

$0.011 / 1,000 Neurons

Requires the Workers Paid plan ($5/month base). Same 10,000 free Neurons per day, then pay-as-you-go above that. Per-model token pricing varies — for example, Llama 3.1 8B runs at $0.045/M input tokens, $0.384/M output tokens. Llama 3.3 70B is $0.293/M input, $2.253/M output.

AI Gateway

Free on All Plans

Core features (caching, rate limiting, analytics, model fallback) are free. Workers Free tier includes 100K gateway logs/month. Workers Paid includes 1M logs/month. Logpush (streaming to external storage) is paid only. No per-request gateway fee.

ProductFree TierPaid Pricing
Workers AI 10K Neurons/day $0.011 / 1K Neurons above free
AI Gateway 100K logs/month 1M logs on Workers Paid; Logpush extra
Vectorize Included in Workers plans Usage-based (request units + storage)
AI Search Open beta (free during beta) TBD — currently free
Agents SDK Durable Objects free tier Durable Objects usage-based pricing

Documentation & references.

Official

Cloudflare AI Resources

Workers AI Docs — Full developer documentation ↗
AI Gateway Docs — Observability and control layer ↗
Vectorize Docs — Vector database documentation ↗
AI Search Docs — Managed RAG pipelines ↗
Agents SDK Docs — Build persistent AI agents ↗
Agents SDK GitHub — SDK source and examples ↗
Cloudflare AI Cloud — Product overview and architecture ↗

Community

Developer Channels

Cloudflare Developer Discord — #workers-ai and #vectorize channels ↗
Cloudflare Community Forum — Support and discussion ↗
Model Catalogue — Browse all 89+ available models ↗
Pricing Reference — Per-model Neuron and token costs ↗

Sources & References

Cloudflare Workers AI Docs · Powering the Agents: Workers AI Large Models (Mar 2026) · Infire Inference Engine Blog · Code Mode: MCP in ~1,000 Tokens · Workers AI Pricing · Agents SDK v0.5.0 (MarkTechPost)

Content validated March 2026. Cloudflare, Workers AI, Vectorize, and AI Gateway are trademarks of Cloudflare, Inc. This is an independent educational explainer by Imbila.AI.