Pay-per-use AI inference built on Cloudflare Workers and Durable Objects. Authenticate via EIP-4361 (SIWE), pay in USDC on Base through the x402 protocol, and start prompting — no signup, no API key.
Most payment-gated APIs bolt a database onto a stateless server — Postgres for balances, Redis for rate-limiting, a separate auth layer for identity. Durable Objects collapse all of that into a single primitive.
Each wallet is routed to its own DO instance via idFromName. The Worker calls typed RPC methods directly on the stub — no HTTP routing inside the DO, just plain function calls with full type safety. On activation, blockConcurrencyWhile() in the constructor loads the wallet state from embedded SQLite once, before any request is served — eliminating per-request persistence overhead. Each instance owns its token balance exclusively — no shared state, no contention, no double-spend. Replay prevention uses a seen_transactions SQL table to atomically check tx hashes and top up the balance, at the edge, with no external infrastructure. The DO alarm handles grace-mode re-verification of provisional deposits when the Base RPC is unreachable, and runs TTL-based cleanup of expired records.
Each wallet's DO instance stores the conversation in an embedded SQLite history table. On every request, the full message array is sent to the model — so follow-up questions and multi-turn reasoning work without the client managing any state. The reply is captured and saved via ctx.waitUntil() after the stream, adding zero latency. Each assistant message stores usage metadata (cost, model) so details persist across reloads. History is capped at 20 messages and tied to the wallet, not the browser.
Each wallet can upload text documents that become part of its personal knowledge base. Documents are chunked (~400 tokens, 50-token overlap), embedded via Workers AI (bge-base-en-v1.5, 768-dim), and stored in a shared Cloudflare Vectorize index with per-wallet metadata filtering. When useRag: true is set on an inference request, the user's prompt is embedded, matched against their documents (top-5 chunks, cosine similarity ≥ 0.45), and the relevant text is injected as a system message. Total input (prompt + history + RAG context) is validated against each model's context window — the server returns 413 if exceeded. RAG context is ephemeral — not stored in history — and regenerated fresh each request. RAG failure is non-fatal: if Vectorize is unreachable, inference proceeds without context. Embedding cost is deducted at document upload; the extra LLM input tokens from retrieved context are captured by the normal billing formula.
SIWE (default): The flow above shows standard Sign-In with Ethereum (EIP-4361). The client requests a nonce, signs a message with their wallet, and receives a session cookie for subsequent authenticated requests.
SIWX alternative: x402-compatible clients can skip the nonce/login steps and pass a SIGN-IN-WITH-X header on POST /infer for single-request wallet auth. The 402 response advertises this via a sign-in-with-x extension.
Public
GET /health liveness probe
GET /payment-info payment address + network details
Auth (SIWE / SIWX)
GET /auth/nonce generate one-time nonce
POST /auth/login verify signature → session cookie (SIWE)
POST /auth/logout clear session cookie
SIGN-IN-WITH-X header for single-request auth (SIWX)
Authenticated (Cookie or SIWX)
POST /infer run inference (post-billed, optional systemPrompt)
POST /deposit top-up balance without inference
GET /balance token balance + usage stats
GET /history conversation messages + meta
DELETE /history clear conversation
RAG Documents (Cookie)
POST /documents upload document for RAG
GET /documents list uploaded documents
DELETE /documents/:id delete document + embeddings
POST /documents/reindex re-upsert all document vectors
Admin (Bearer ADMIN_SECRET)
GET /admin/wallets paginated wallet list
GET /admin/wallets/:wallet/status wallet status + balance
GET /admin/stats aggregate statistics
GET /admin/stale zero-balance inactive wallets
POST /infer with 0 balance → 402 + PAYMENT-REQUIRED header (includes sign-in-with-x extension for x402 clients)POST /deposit with proof → balance topped upPOST /infer requests deduct from balance automatically0.001 USDC → 1,000 tokens Model Context Cost Llama 3.1 8B 7,968 tok ~8–10 tokens/req Llama 3.3 70B 24,000 tok ~9–13 tokens/req Gemma 3 12B 8,000 tok ~8–10 tokens/req Mistral 7B 8,000 tok ~4–8 tokens/req DeepSeek R1 32B 80,000 tok ~10–25 tokens/req
GET /openapi.jsonOpenAPI 3.1 specificationGET /.well-known/agent.jsonA2A agent cardGET /.well-known/agents.jsonAgents.json (multi-step flows)GET /SKILL.mdAgent-readable markdown