Inference, decentralized.
1. TL;DR (a.k.a. Abstract, but Honest)
The AI industry has a curious shape. On one side, OpenAI, Anthropic, AWS Bedrock, and Google Vertex run inference on enormous, expensive datacenter farms and charge accordingly. On the other side, there are millions of underutilized GPUs sitting in gaming PCs, ex-mining rigs, university labs, small datacenters, and the back closets of crypto natives who definitely did not buy these RTX 4090s for just gaming. INFERA connects the two.
That description sounds simple. The hard parts — and the parts most prior projects have either skipped or hand-waved through — are: (1) how do you prove that a GPU operator actually ran the model you asked for, and didn't return cheap garbage from a smaller model? (2) how do you handle latency-sensitive workloads on a permissionless network? And (3) how do you make the developer experience feel less like "deploying a Cosmos validator" and more like "calling an OpenAI endpoint"?
This whitepaper answers all three. We use a hybrid verification model (TEE attestation by default, zkML for high-value jobs, optimistic challenges as a backstop), a reputation-weighted scheduler with hot-pool warm models, and an SDK that ships with an OpenAI-compatible API — yes, the same one your existing Python script already speaks. You change the base URL and an API key. That's the day-one developer pitch.
We are not the first project in this neighborhood. We have enormous respect for Akash, Render, Bittensor, Gensyn, and io.net — they've done a lot of the heavy intellectual lifting. INFERA is specifically focused on the inference layer, with verifiability as a first-class citizen and developer UX as a non-negotiable. If everyone wins, great. If we have to be loud about being faster and cheaper, we will be.
2. The Problem (Or: Why We Bothered)
2.1 Inference is Eating the World, and the Bill Is Coming
Inference — the process of actually running a trained model to produce an output — has quietly become the dominant cost center in modern AI. Training a frontier model is a heroic, one-time(ish) capex event. Inference is a forever-opex line item. Every chatbot reply, every code completion, every image generated, every embedding indexed: that's inference. And it scales linearly with usage, which is to say, it scales with success.
Today, the vast majority of that inference runs through three or four cloud providers. Their margins are healthy. Their latency is good. Their reliability is excellent. Their prices, however, are set by a market with very few sellers, and it shows.
2.2 Meanwhile, in the Real World
There are millions of GPUs in the wild that are not running anything most of the time:
- Gaming rigs: an RTX 4090 used 4 hours a night for Counter-Strike has 20 hours of idle compute per day.
- Ex-mining farms: post-Merge Ethereum miners pivoted, but plenty of facilities have GPU racks looking for purpose.
- Small datacenters and academic clusters: routinely underbooked, especially nights and weekends.
- Web3-native power users with NVIDIA H100s in their garage. Yes, those people exist. We have met them. They are wonderful.
The total addressable supply here is genuinely enormous — credible estimates place idle consumer-grade GPU capacity at multiple exaFLOPS globally. The problem is not supply. The problem is matching, trust, and settlement.
2.3 What Existing Protocols Get Right (and What They Don**'**t)
| Protocol | Strength | Gap for AI Inference |
|---|---|---|
| Akash Network | General compute marketplace, mature bidding system | Optimized for containerized workloads broadly; inference-specific UX (model registry, token-billed APIs, sub-second routing) is not the focus |
| Render Network | Battle-tested at scale, strong creator economy | Built around 3D rendering and offline jobs; latency profile and verification model don't fit inference |
| Bittensor | Brilliant incentive design, real model-serving subnets | Reward distribution is consensus-driven and indirect; pricing is opaque to the average developer |
| Gensyn | Cutting-edge work on verifiable compute | Heavily training-focused; inference latency requirements are different |
| io.net | Excellent supply-side aggregation | Verification of output correctness is largely social/trust-based, not cryptographic |
INFERA is not trying to replace any of these. We are trying to occupy a specific, currently-vacant chair: a permissionless, verifiable, pay-per-token inference layer with a developer experience indistinguishable from a Web2 API. If we do this well, the rest of the stack becomes obvious.
2.4 The Three Hard Problems
Anyone proposing decentralized AI inference has to answer three questions. We will spend most of this paper on these:
- Verifiability: If I send a job to a stranger's GPU, how do I know they actually ran Llama-3.1-70B and didn't just return the output of a tiny 7B model with a confident smile?
- Latency: Real users don't wait 30 seconds for an autocomplete. How do we get sub-second time-to-first-token over a permissionless network?
- Developer Experience: If onboarding requires reading three PhDs' worth of papers, no one will use it. What does "five-minute first call" look like?
3. The Solution, in One Picture and One Page
3.1 The Five Roles in the Network
- Developers (a.k.a. Requesters): Anyone running an app, agent, or product that needs to call an LLM, vision model, embedding model, or other open-source AI. They send jobs and pay in stablecoins.
- Providers (a.k.a. Operators): Anyone with a capable GPU and a stable internet connection. They register, stake $INFR, advertise capabilities, and earn fees.
- Routers: Off-chain coordinators that match jobs to providers in milliseconds, cryptographically commit to assignments on-chain, and stake $INFR against misrouting.
- Validators (a.k.a. Verifiers): Independent nodes that probabilistically re-check outputs, verify TEE attestations, and serve as challengers in the optimistic verification game.
- $INFR Holders / Governors: Whoever holds the token participates in upgrades, parameter changes, and treasury allocation. We will get into this later, with appropriate skepticism about the word "DAO."
3.2 The Lifecycle of a Single Inference Call
In the time it takes you to read this section, the following has happened approximately 1,000 times on the network. (Future-tense disclaimer: assuming we ship.)
- A developer's app calls api.infera.network/v1/chat/completions with an OpenAI-shaped JSON body.
- The request hits a Router, which checks the developer's prepaid balance, signs a job ticket, and dispatches it to the best-matched Provider in the hot pool for the requested model.
- The Provider, running the model inside a TEE (or with zkML proof generation enabled, depending on tier), executes the inference and returns the output along with an attestation signature.
- The Router streams the response back to the developer (sub-second time-to-first-token in the common case).
- The job ticket and attestation are batched and committed on-chain in the next settlement window. Provider gets paid. Router gets a tiny cut. Validators randomly sample 1–5% of jobs for re-verification.
- If a Validator catches a bad output, the Provider's stake gets slashed and the developer is reimbursed. The Provider's reputation score takes a noticeable hit.
That's the whole show. The rest of this paper is how each of those steps actually works, and why it doesn't fall apart under adversarial conditions.
4. System Architecture
4.1 The Layered View
INFERA is built as four loosely-coupled layers. Each layer can be reasoned about, audited, and replaced independently — which is important because a few of these components (especially the verification layer) are still active research areas, and we want to ship without painting ourselves into a corner.
| Layer | Lives On | Responsibility |
|---|---|---|
| Settlement Layer | Base (Ethereum L2) | Stake registry, payments, slashing, governance, dispute resolution |
| Coordination Layer | Off-chain (with on-chain commitments) | Job routing, scheduling, hot-pool management, latency optimization |
| Execution Layer | Provider hardware | Actual model inference inside TEE / with zkML witness generation |
| Application Layer | Developer's stack | OpenAI-compatible SDK, dashboard, billing, observability |
4.2 Why Base?
We deploy the Settlement Layer on Base, Coinbase's Ethereum L2. Reasons, ranked by importance to a working engineer:
- Fees: paying $0.30 in gas to settle a $0.0008 inference call is, mathematically, not the move. Base routinely settles transactions for fractions of a cent.
- Ethereum security: Base is an OP-stack rollup posting to Ethereum L1. We get the security guarantees of mainnet without paying mainnet gas. This matters because slashing is a security-critical operation.
- EVM-compatible: Solidity, Foundry, every audit firm in the world. No exotic VMs, no language risk.
- Distribution: Base has a real, growing population of users with USDC in their wallets. Our customers and providers are already there.
- Bridging: stablecoin liquidity flows freely between Base, Ethereum L1, and other L2s via canonical and third-party bridges.
We are explicitly chain-aware but not chain-maxi. The contracts are designed to be re-deployable on any EVM L2 (Arbitrum, Optimism, Polygon, etc.) and we plan to multi-chain by Phase 2. Long term, the protocol could live anywhere with cheap fast finality. Short term, Base is the right call.
4.3 The Hot Pool: Why Models Don**'**t Cold-Start Per Job
Loading a 70B-parameter LLM into GPU VRAM takes 30–90 seconds. If every inference call had to do that, latency would be unusable. INFERA's coordination layer solves this with a Hot Pool: a continuously-maintained set of Provider nodes that are warm-loaded with specific models, advertise their loadout on-chain, and earn a small idle subsidy for keeping models resident.
Routers prefer Hot Pool providers for any matching request. If demand spikes for a particular model, the Router signals the network, and idle Providers compete to spin up that model and join the pool. If demand drops, providers gracefully exit. The mechanism is a continuous-auction model demand market — providers bid for the right to be in the hot pool for popular models because being in the pool means more job flow.
4.4 The Network Diagram, In Words
Picture a circle. In the center is the Base L2, with INFERA's smart contracts. Around the outside, in three concentric rings:
- Inner ring — Routers and Validators: small number, well-staked, latency-sensitive infrastructure.
- Middle ring — Providers: many, geographically distributed, varying hardware tiers.
- Outer ring — Developers and end users: even more numerous, calling in via SDK or HTTP.
Information and value flow in both directions: jobs and payments going outward, results and proofs coming back. Every edge of every interaction has at least one cryptographic commitment behind it. Which brings us to the part that is, frankly, the most fun.
5. Verifiable Inference: The Hard, Interesting Part
If you remember nothing else from this paper, remember this: a decentralized inference network is only as valuable as its ability to detect cheating. Without verification, every Provider has an economic incentive to run the smallest, cheapest possible model and pretend it was the big expensive one you paid for. Solve that, and everything else follows.
There is no single magic answer. Verification has a cost vs. assurance vs. latency triangle, and you can pick at most two corners. INFERA therefore offers three verification tiers, and developers pick per-call.
5.1 Tier 1 — TEE Attestation (Default)
Modern data-center GPUs (NVIDIA H100, H200, B100) and certain CPU/GPU pairings (AMD SEV-SNP, Intel TDX, NVIDIA Confidential Compute) support Trusted Execution Environments — hardware-enforced enclaves where code runs in isolation and the hardware itself can sign attestations proving:
- Which exact code (hash) was loaded
- Which exact model weights (hash) were loaded
- That the output was produced by that code on that data
The attestation is a signed blob from the chip manufacturer's root of trust. INFERA validates it on-chain via a precompile or attestation oracle, and the developer gets cryptographic assurance that the inference happened as advertised, with negligible runtime overhead (typically <5%).
This is our default tier. Fast, cheap, and good enough for the overwhelming majority of consumer and enterprise AI workloads. The trust assumption is: "NVIDIA's signing keys are not compromised." That's a reasonable assumption for everything short of nation-state adversaries.
5.2 Tier 2 — Optimistic Verification with Fraud Proofs
For Provider hardware that doesn't support TEE (older H100s in non-confidential mode, RTX 4090s, etc.), we use an optimistic model borrowed from rollup design:
- Provider executes the inference and posts the output plus a commitment to the full computation trace.
- Output is delivered to the developer immediately. (Optimistic = fast.)
- During a challenge window (default: 90 seconds for chat, 30 minutes for batch), any Validator can re-execute the inference on the same model and inputs and dispute the result on-chain.
- If a dispute is raised, an interactive bisection protocol identifies the exact differing computation step. The chain adjudicates that single step. The losing party (Provider or challenger) gets slashed.
This is how Arbitrum settles disputes for general computation, and we use the same intellectual lineage. The catch: it requires deterministic execution. We pin model versions, CUDA driver versions, and sampling seeds explicitly per job, which gets us bit-identical reproducibility for the vast majority of workloads.
5.3 Tier 3 — zkML Proofs (Premium)
For the highest-stakes applications — financial inference, medical decision support, regulated workflows — INFERA supports zero-knowledge machine learning proofs. The Provider generates a SNARK proving that running model M on input x yields output y, and the proof can be verified on-chain in milliseconds.
Let us be honest about where this technology is in 2026: zkML for full-size LLM inference is expensive. State-of-the-art proving systems (EZKL, Risc Zero zkVM, Jolt, and our own optimizations) can handle smaller models (<3B parameters) in real-time and larger models (up to 70B) in batch mode at single-digit-cents-per-call cost. We expect this to improve roughly 10x per year through 2028, at which point Tier 3 becomes the default and the world looks quite different.
For now, Tier 3 is offered as a premium option for the workloads where it economically makes sense. Pricing reflects the actual cost of proof generation, currently 5–50x base inference cost depending on model size.
5.4 Validator Sampling: The Glue
Independent of which tier a developer chooses, INFERA Validators randomly sample 1–5% of all jobs (parameter is governed) and re-execute them. The sampling rate scales inversely with the Provider's reputation: a brand-new Provider gets ~10% of jobs sampled; a long-tenured high-reputation Provider gets <0.5%.
Validators themselves stake $INFR and earn rewards for honest checks. If a Validator falsely accuses a Provider, they get slashed. This makes verification a permissionless economic game, not a privileged role.
5.5 Putting It All Together
| Tier | Trust Assumption | Latency Overhead | Cost Premium | Best For |
|---|---|---|---|---|
| 1 — TEE | Hardware vendor not compromised | <5% | 0% (baseline) | Most apps: chatbots, agents, RAG, generation |
| 2 — Optimistic | ≥1 honest Validator exists | 0% to user; settlement delayed | +10% | Cost-sensitive batch workloads |
| 3 — zkML | SNARK soundness (cryptographic) | Significant | +500% to +5000% | Regulated, financial, medical |
6. Smart Contract Architecture
All on-chain logic lives in a small set of audited Solidity contracts on Base. We follow OpenZeppelin patterns where possible, use UUPS proxies for upgradability behind a 7-day timelock, and have intentionally kept the contract surface small. Less code is less attack surface.
6.1 The Contract Set
| Contract | Purpose | Approx LoC |
|---|---|---|
| StakeRegistry | Provider, Router, Validator stake; slashing; withdrawals | ~400 |
| JobRegistry | Commits job tickets, tracks state, emits settlement events | ~350 |
| PaymentVault | Holds developer prepaid balances (USDC), releases on settlement | ~250 |
| DisputeManager | Optimistic-tier challenge game; bisection; final adjudication | ~500 |
| AttestationVerifier | Verifies TEE attestations; whitelists vendor root keys | ~300 |
| zkVerifier | On-chain SNARK verification for Tier 3 jobs | ~200 |
| Reputation | On-chain reputation score (EWMA over verification outcomes) | ~150 |
| Governor | Standard OZ Governor + Timelock for $INFR voting | OZ stock |
| INFR Token | ERC-20 with permit, snapshots, and burn hooks | OZ stock |
6.2 Job Lifecycle, On-Chain
Here is what actually gets written to the chain for a single inference job. (Most happens in batches — see §6.5 — but conceptually, this is the per-job state machine.)
6.3 Stake and Slashing
Every Provider deposits $INFR into StakeRegistry before serving traffic. The minimum stake scales with claimed throughput: a Provider claiming 100 tokens/sec on Llama-3.1-70B needs more stake than one claiming 10 tokens/sec, because their potential to harm is higher.
Slashing is triggered by:
- Verified incorrect output (Tier 1: invalid attestation; Tier 2: lost dispute; Tier 3: invalid SNARK)
- Failure to deliver after accepting a job (proven by Router signature + timeout)
- Documented downtime exceeding the SLA the Provider committed to
Slash amounts are graduated. First offense for low-severity issues might be 1% of stake. Provable fraud — returning bad outputs maliciously — is up to 100%. Slashed funds split: 40% to harmed developer (refund + damages), 40% to challenger/validator (bounty), 20% burned. Burning matters because it gives $INFR an actual deflationary pressure tied to bad behavior.
6.4 Payment Flow
- Deposit. Developer deposits USDC into PaymentVault. Balance shows up in the dashboard. They get a per-account API key.
- Reserve. When a job is dispatched, an estimated max cost is reserved (escrowed) from the developer's balance.
- Settle. On settlement (after challenge window for Tier 2; immediately for Tier 1/3), actual cost based on real input/output tokens is computed. Developer is refunded any reserve overage. Provider is paid in USDC. Router takes its 1.5% cut. Protocol takes 1.5% (governed). Validators paid pro-rata from a separate inflation budget.
- Withdraw. Provider can withdraw earned USDC at any time. Developer can withdraw remaining deposit balance at any time.
6.5 Why Batch Settlement (and How It Stays Cheap)
If we wrote one transaction per inference call, even on Base we would burn meaningful gas. Instead, Routers batch up to 10,000 settled jobs into a single Merkle root and commit it on-chain every 60 seconds. Individual jobs are auditable via Merkle proof. Disputes can still be raised against any leaf during the challenge window.
Net effect: per-job on-chain cost amortizes to under $0.0001, which is well below the typical fee of an inference call. The economics work.
6.6 A Note on Upgradability
The contracts are upgradable through a Timelock-controlled UUPS pattern, with a 7-day delay between governance approval and execution. This is non-negotiable: the verification layer in particular will evolve as zkML matures, and a frozen contract set is a contract set that becomes obsolete. Critics of upgradability are right to be suspicious; we mitigate by publishing all proposed upgrades on-chain, auditing them with at least two independent firms before any vote, and sizing the timelock so users have real time to exit if they disagree.
7. Scheduling, Routing, and Latency
The blockchain handles settlement and slashing. It does not, and should not, handle real-time matching of milliseconds-old requests to specific GPUs across the planet. For that we have Routers.
7.1 What a Router Does
- Maintains a real-time index of registered Providers, their advertised models (Hot Pool membership), recent latency, recent error rates, and current load.
- Receives developer requests via standard HTTPS (with API-key auth backed by an on-chain registry).
- Picks the best Provider for the request using a scoring function (more on this below).
- Stream-proxies the response back to the developer.
- Signs and submits the settled job ticket to the JobRegistry contract within the next batch window.
7.2 The Scoring Function (Plain English Version)
For each candidate Provider, the Router computes:
The exact weights are governance parameters. The point is: cheaper, closer, more reliable Providers naturally win more traffic. A Provider can't just undercut on price and dominate — they also need uptime and a clean reputation history. Conversely, a Provider with great uptime and reputation can charge a small premium and still win.
7.3 Are Routers Trusted?
Routers are not trusted with funds — they never custody developer USDC. They are trusted with routing decisions, in the sense that a malicious Router could systematically favor a specific Provider to extract bribes. We mitigate this in three ways:
- Routers stake $INFR and can be slashed for provably biased routing (statistical anomaly detection across the whole network).
- Multiple Routers compete. A developer's SDK can use any Router; if one is unreliable or expensive, the SDK fails over.
- Routers publish their scoring function and recent decisions in a tamper-evident log, anchored on-chain.
Long-term, Routers may become further decentralized via a small consensus protocol (think DA committees), but the v1 model — multiple staked, mutually competitive Routers — is sufficient for launch and avoids overengineering.
7.4 Latency Numbers We Care About
| Phase | Target P50 | Target P99 | Notes |
|---|---|---|---|
| DNS + TLS to Router | 20 ms | 60 ms | Standard CDN/Anycast |
| Router → Provider dispatch | 10 ms | 40 ms | Geographic routing |
| Time-to-first-token (TEE, hot pool, 70B model) | 350 ms | 900 ms | Comparable to hyperscaler |
| Per-token throughput (70B, single H100) | 60 tok/s | — | Driven by hardware |
| Settlement finality (Tier 1) | ~60 s | ~120 s | Next batch + Base block time |
| Settlement finality (Tier 2) | ~150 s | ~300 s | Includes challenge window |
8. The $INFR Token
8.1 What $INFR Is For
- Staking. Providers, Routers, and Validators must stake $INFR to participate. This is real, locked utility.
- Slashing collateral. Bad behavior burns or reallocates $INFR.
- Governance. Parameter changes, upgrades, treasury allocation.
- Protocol fee buyback. A portion of protocol fees (paid in USDC) is used to buy and burn $INFR, linking real usage to token supply.
Importantly, developers do not need to hold $INFR to use the network. They pay in USDC. This is deliberate. Forcing customers to hold a volatile token to use your product is a tax on adoption, and adoption is the only thing that matters in year one.
8.2 Supply
Total supply: 1,000,000,000 INFR, fixed. No inflation. No hidden mints. Supply schedule below.
| Allocation | Amount | % of Supply | Vesting |
|---|---|---|---|
| Ecosystem Incentives (provider bootstrap, validator rewards, grants) | 350,000,000 | 35% | Released over 5 years on a programmatic curve tied to network usage |
| Core Team & Future Hires | 180,000,000 | 18% | 12-month cliff, 36-month linear vest |
| Treasury (governance-controlled) | 150,000,000 | 15% | Locked 12 months, then governance-released |
| Investors | 150,000,000 | 15% | 12-month cliff, 24-month linear vest |
| Public Launch (LBP / community sale) | 80,000,000 | 8% | Unlocked at TGE |
| Liquidity & Market Making | 50,000,000 | 5% | Locked in protocol-owned liquidity |
| Foundation Operating Reserve | 40,000,000 | 4% | Released as needed for ops, audited annually |
8.3 Buyback-and-Burn Mechanics
Of the 1.5% protocol fee on every settled job:
- 0.5% goes to the Treasury (governance-controlled, USDC).
- 0.5% goes to Validator rewards (paid in USDC).
- 0.5% goes to a public-market USDC → INFR buyback executed weekly via a TWAP. Bought INFR is sent to a burn address.
This means: if INFERA does $1B in annual inference volume, roughly $5M of $INFR is bought and burned per year. Whether that's bullish for the token depends entirely on whether the network achieves volume. Which it should, if the product is good.
8.4 What $INFR Is Not
It is not a security under our analysis (consult your own counsel — this is not legal advice). It is not yield-bearing by default; staking yields come from real protocol fees and validation rewards, not from inflation. It is not a payment token for end users. It is, fundamentally, a governance and security-bond token tied to a real, productive network. We hope it's also a good investment over time, but that's a function of execution, not promises.
9. Developer Experience
If you have ever called the OpenAI API, you can use INFERA in less than five minutes. The SDK is intentionally a drop-in replacement.
9.1 The Five-Minute Onboarding
- Connect a wallet (or create a hosted account with email + Coinbase Commerce on-ramp).
- Deposit USDC. Anything from $5 to $50,000. Receive an API key bound to your account.
- Point your code at https://api.infera.network/v1. Done.
Python (using the openai library):
TypeScript:
9.2 Supported Models at Launch
We launch with the following open-weight models hot-loaded across the Provider network. Adding a new model is a governance proposal that any community member can submit, with hot-pool incentives auto-activating once approved.
- Llama 3.1 — 8B, 70B, 405B Instruct
- Mistral / Mixtral families (7B, 8x7B, 8x22B)
- Qwen 2.5 (7B, 32B, 72B)
- DeepSeek-V3 and DeepSeek-R1
- Stable Diffusion XL, FLUX.1
- Whisper-large-v3 (transcription)
- BAAI bge embedding family
- All other major open-weight models added on community demand
9.3 Native Web3 Features
Beyond the OpenAI-compatible surface, INFERA offers Web3-native features that hyperscalers literally cannot:
- Pay-per-call from a smart contract: Your on-chain agent can call INFERA directly via a precompiled adapter, no off-chain relayer needed.
- Verifiable inference receipts: Every response includes a cryptographic receipt your dApp can post on-chain to prove an AI output is real.
- Privacy-preserving inference: TEE mode means even the Provider can't see your prompts or outputs.
- Streaming usage logs: Every job is logged on-chain. No vendor controls your usage history; no surprise bills.
10. Provider Onboarding
This section is for the people with the GPUs. If you are reading this and you have an unused 4090, 5090, A100, H100, or anything Apple-Silicon-fancy, this is for you.
10.1 Hardware Tiers
| Tier | Example Hardware | Suitable For | Approx. Earnings (USD/day at 80% util) |
|---|---|---|---|
| Consumer | RTX 4090, RTX 5090, M3 Ultra | Models up to ~13B params, embeddings, image generation | $8 – $25 |
| Prosumer | RTX 6000 Ada, A100 40GB, dual 4090s with NVLink | Up to ~70B with quantization | $30 – $90 |
| Datacenter | H100, H200, B100, B200, MI300X | Full-fat 70B+ at FP16, 405B with multi-GPU | $150 – $600 |
| Confidential Datacenter | H100/H200 in CC mode, AMD MI300X with SEV-SNP | Tier-1 verified inference at premium rates | $200 – $800 |
Numbers are illustrative. Actual earnings depend on utilization, your bid price, your model loadout, and how much you stake. The honest version: at network maturity, well-run datacenter-class providers should comfortably outperform spot-market rentals on equivalent hardware. Consumer-tier providers will earn meaningful pocket money but probably not retire.
10.2 The Provider Stack
We ship a Docker-Compose-based Provider node that runs alongside your existing GPU workloads. It includes:
- infera-agent: registers with the network, signs job tickets, manages the model cache.
- vLLM, TGI, or SGLang as the inference runtime (your choice; we benchmark all three).
- TEE attestation helpers for compatible hardware.
- Optional: zkML proof generation worker.
- Prometheus metrics, Grafana dashboard, the works.
Quick start:
10.3 Operational Reality Check
Running a Provider is real ops work. You need a reliable internet connection (gigabit recommended; symmetric upload genuinely matters), a stable power situation, decent cooling, and the willingness to actually monitor a service. We have automated as much as possible — the agent handles model downloads, hot-pool management, graceful shutdown, etc. — but if your home internet drops twice a day, your reputation score will reflect that, and the network will route around you.
Conversely, if you run a tight ship, the network will reward you. Reputation compounds. Long-tenured Providers with good track records get the cushiest jobs.
11. Governance
We are skeptical of governance theater. Most "DAOs" are either disguised oligarchies, ungovernable mobs, or both. We have tried to design something that is actually functional.
11.1 What Governance Controls
- Protocol fee percentages (within bounded ranges)
- Slashing parameters and minimum stakes
- Adding/removing supported models from the official registry
- Treasury allocation
- Smart contract upgrades (always behind a 7-day timelock)
- Whitelisting new TEE vendor root keys
11.2 What Governance Does Not Control
- The total supply of $INFR (immutably 1B)
- Slashing a specific user retroactively (no clawbacks)
- Custody of developer funds (they cannot be reassigned by vote)
- Censorship of specific addresses or models (the protocol is permissionless)
11.3 Voting Mechanics
Standard OpenZeppelin Governor with a few INFERA-specific tweaks:
- Vote-escrowed $INFR ("veINFR"): you can lock INFR for up to 4 years for boosted voting power. Lockers also get amplified rewards from the Validator program.
- Quadratic dampening: voting power is sqrt(veINFR) for parameter changes (not for upgrades, where flat voting applies). This reduces whale dominance for routine governance.
- Quorum: 4% of circulating supply for parameter changes; 8% for upgrades.
- All votes are on Base, with standard OZ Governor + Tally UI.
11.4 The Foundation
INFERA Foundation is a Cayman foundation company that holds the protocol's intellectual property and treasury. The Foundation is not a substitute for governance — it executes governance decisions and handles legal/operational matters that do not lend themselves to on-chain votes (like signing a contract with an audit firm). The Foundation's annual budget is approved by $INFR governance.
12. Roadmap
This is what we are building, in order. The honest sub-text: software estimates are aspirational, and ours are no exception. We commit to public retrospectives every quarter regardless of how the quarter went.
12.1 Phase 0 — Devnet (Q2 2026)
- Smart contracts on Base Sepolia
- Provider client and Router reference implementation
- Closed alpha with ~50 selected developers and ~20 selected providers
- OpenAI-compatible API working end-to-end with TEE attestation on H100
12.2 Phase 1 — Mainnet Launch (Q4 2026)
- Audited contracts on Base mainnet
- Public Provider onboarding (permissionless after 30-day allowlist warmup)
- Public developer access; deposits in USDC
- $INFR TGE concurrent with mainnet
- Tier 1 (TEE) and Tier 2 (Optimistic) verification live
- Initial supported models: Llama 3.1 family, Mixtral, Qwen 2.5, SDXL
12.3 Phase 2 — Scale and zkML (2027)
- Tier 3 zkML support for models up to 13B params (real-time) and 70B (batch)
- Multi-chain deployment: Arbitrum, Optimism, Polygon zkEVM
- Native fiat on-ramps for non-crypto-native developers
- Enterprise SLAs: dedicated capacity, BYO model, private deployments
- Provider mobile app (yes, monitor your stake from the bus)
12.4 Phase 3 — Full Verifiability (2028+)
- zkML real-time for 70B+ models (depends on proving system advances)
- Cross-chain settlement via canonical bridges and CCIP
- Native support for fine-tuned and proprietary models with encrypted weights
- Decentralization of Routers via consensus protocol
- Provider hardware federation: pool small operators behind a single interface
13. Risks and Limitations
13.1 Technical Risks
- TEE compromise. If a hardware vendor's signing keys leak, Tier 1 verification breaks for that vendor until keys rotate. We mitigate by supporting multiple TEE vendors and falling back to Tier 2 (Optimistic) for affected providers.
- zkML proving cost. Tier 3 cost projections assume continued progress in zkML systems. If progress stalls, Tier 3 remains a niche premium product longer than expected. The protocol still works; it just has a smaller premium tier.
- Smart contract bugs. We will be audited by at least two top-tier firms before mainnet, run a Code4rena contest, and maintain a continuous bug bounty (up to $1M for critical issues). None of this guarantees zero bugs. Insurance protocols (Nexus Mutual, etc.) will be supported.
- Determinism for fraud proofs. Tier 2 requires bit-identical reproducibility. Some inference stacks have non-deterministic kernels. We pin versions tightly, but edge cases will surface and require fixes.
13.2 Economic and Market Risks
- Hyperscaler price wars. AWS, Azure, and GCP could cut inference prices aggressively. Our cost advantage shrinks but does not disappear — we still aggregate idle hardware they cannot access.
- Cold-start chicken-and-egg. Networks like ours need both supply and demand from day one. We address this with Phase 0 incentives funded by the ecosystem allocation; the risk is that incentives expire before organic flywheels start.
- $INFR price volatility. Provider stakes are denominated in INFR. A sharp drop in token price reduces absolute slashing collateral. We address this with USDC-denominated minimums recalculated weekly.
13.3 Regulatory Risks
- Token classification. Securities law in the US and elsewhere is evolving. We have structured $INFR as a utility/governance token with significant in-protocol use, and have engaged outside counsel in multiple jurisdictions, but we cannot guarantee how regulators will act.
- AI regulation. The EU AI Act, US state-level rules, and emerging international frameworks may impose obligations on inference providers. We are designing the protocol to support compliance flags (e.g., region restrictions, content filtering at the SDK layer) without compromising permissionlessness at the base layer.
13.4 Misuse Risks
A permissionless inference network can, in principle, serve harmful workloads. We take this seriously. Mitigations:
- SDK-level safety filters that developers can opt into (with transparent allow/block logs).
- Provider-side opt-out: any Provider can refuse jobs matching their declared content policy. Their reputation is unaffected by such refusals.
- Standard CSAM hash detection at the input/output level for image models, run on-Provider. Non-negotiable.
- Compliance with valid legal process directed at the Foundation, recognizing the Foundation does not control individual Providers.
We are not naive about the tension between permissionlessness and safety. We will continue to publish our thinking and update mechanisms publicly.
14. Conclusion
Three propositions sit underneath this entire document:
- AI inference is becoming the largest compute workload in human history, and concentrating it in three or four cloud providers is bad for prices, bad for innovation, and bad for resilience.
- There is enormous idle GPU capacity in the world, and pulling it into a coherent global market is a tractable engineering problem — not a moonshot.
- Verifiability — the ability to cryptographically prove an inference was honest — is the missing piece that turns "random GPUs talking to each other" into a real, trust-minimized marketplace.
INFERA is our attempt to make all three of those true at once. We've designed it to work today (with TEE attestation and OpenAI-compatible APIs), to scale tomorrow (with zkML as it matures), and to remain useful forever (because the underlying need for cheap, trustworthy inference is not going anywhere).
If you're a developer: come build on it. The first $50 of inference is on us.
If you have GPUs: come provide. We have made onboarding painless on purpose.
If you're a researcher: come argue with us. The hardest problems in this design are open, and we publish.
And if you're someone who has read this far on a Tuesday: thank you. That's a real investment of attention. We hope the protocol returns it.
— The INFERA Core Contributors
Base · Ethereum · Open Source · April 2026
Appendix A — Glossary
| Term | Plain-English Definition |
|---|---|
| Inference | Running a trained AI model on input to produce output. The everyday work of an AI. |
| TEE | Trusted Execution Environment. Hardware-enforced secure enclave that can prove what code ran inside it. |
| zkML | Zero-Knowledge Machine Learning. Mathematically proving an AI ran correctly without re-running it. |
| Optimistic Verification | Assume honest by default; allow challenges within a window; punish whoever was wrong. |
| TTFT | Time-to-first-token. How long after you press enter before the AI starts replying. |
| Hot Pool | Set of Provider nodes with a model already loaded and ready to serve. |
| Slashing | Losing some or all of your staked tokens for misbehavior. Existential, intentionally. |
| Router | Off-chain matchmaker that pairs developer requests with the best Provider in milliseconds. |
| Validator | Independent node that re-checks Provider outputs and challenges bad ones. |
| L2 / Rollup | A blockchain that runs on top of Ethereum, inheriting its security but with cheaper fees. |
| UUPS Proxy | An upgrade pattern for smart contracts. Lets the team fix bugs without breaking user state. |
Appendix B — Selected References
This is not an academic paper, but credit where due. The architecture above stands on the shoulders of:
- EZKL, Risc Zero, and the broader zkML community for proving-system advances.
- Akash Network and Render Network, for early demonstrations that decentralized compute markets can work at scale.
- Bittensor, for the cleanest existing thinking on incentivized model serving.
- Optimism and Arbitrum, for fraud-proof system designs we have shamelessly learned from.
- OpenZeppelin, for governance and access-control primitives that just work.
- vLLM, SGLang, and TGI, for making high-throughput LLM serving a solved problem.
- NVIDIA, AMD, and Intel TEE engineering teams, for hardware-level confidential computing.
Read more, contribute, or yell at us at: infera.network · github.com/infera-protocol