INFERA Protocol · Whitepaper

Inference, decentralized.

A Decentralized Marketplace for Verifiable AI Inference

Idle GPUs in. Cheap, trustworthy AI tokens out.

Versionv1.0 DateApril 2026 AuthorsINFERA Core Contributors Webinfera.network

1. TL;DR (a.k.a. Abstract, but Honest)

If you only read one box, read this one. INFERA is an open protocol that lets anyone with a capable GPU rent it out for AI inference, and lets any developer pay per token to run open-source models on that hardware — at a fraction of hyperscaler prices. Jobs are scheduled on-chain, executed inside Trusted Execution Environments (or proven with zkML for high-stakes workloads), and settled in stablecoins. The protocol is deployed on Base, secured by Ethereum.

The AI industry has a curious shape. On one side, OpenAI, Anthropic, AWS Bedrock, and Google Vertex run inference on enormous, expensive datacenter farms and charge accordingly. On the other side, there are millions of underutilized GPUs sitting in gaming PCs, ex-mining rigs, university labs, small datacenters, and the back closets of crypto natives who definitely did not buy these RTX 4090s for just gaming. INFERA connects the two.

That description sounds simple. The hard parts — and the parts most prior projects have either skipped or hand-waved through — are: (1) how do you prove that a GPU operator actually ran the model you asked for, and didn't return cheap garbage from a smaller model? (2) how do you handle latency-sensitive workloads on a permissionless network? And (3) how do you make the developer experience feel less like "deploying a Cosmos validator" and more like "calling an OpenAI endpoint"?

This whitepaper answers all three. We use a hybrid verification model (TEE attestation by default, zkML for high-value jobs, optimistic challenges as a backstop), a reputation-weighted scheduler with hot-pool warm models, and an SDK that ships with an OpenAI-compatible API — yes, the same one your existing Python script already speaks. You change the base URL and an API key. That's the day-one developer pitch.

We are not the first project in this neighborhood. We have enormous respect for Akash, Render, Bittensor, Gensyn, and io.net — they've done a lot of the heavy intellectual lifting. INFERA is specifically focused on the inference layer, with verifiability as a first-class citizen and developer UX as a non-negotiable. If everyone wins, great. If we have to be loud about being faster and cheaper, we will be.

2. The Problem (Or: Why We Bothered)

2.1 Inference is Eating the World, and the Bill Is Coming

Inference — the process of actually running a trained model to produce an output — has quietly become the dominant cost center in modern AI. Training a frontier model is a heroic, one-time(ish) capex event. Inference is a forever-opex line item. Every chatbot reply, every code completion, every image generated, every embedding indexed: that's inference. And it scales linearly with usage, which is to say, it scales with success.

Today, the vast majority of that inference runs through three or four cloud providers. Their margins are healthy. Their latency is good. Their reliability is excellent. Their prices, however, are set by a market with very few sellers, and it shows.

2.2 Meanwhile, in the Real World

There are millions of GPUs in the wild that are not running anything most of the time:

Gaming rigs: an RTX 4090 used 4 hours a night for Counter-Strike has 20 hours of idle compute per day.
Ex-mining farms: post-Merge Ethereum miners pivoted, but plenty of facilities have GPU racks looking for purpose.
Small datacenters and academic clusters: routinely underbooked, especially nights and weekends.
Web3-native power users with NVIDIA H100s in their garage. Yes, those people exist. We have met them. They are wonderful.

The total addressable supply here is genuinely enormous — credible estimates place idle consumer-grade GPU capacity at multiple exaFLOPS globally. The problem is not supply. The problem is matching, trust, and settlement.

2.3 What Existing Protocols Get Right (and What They Don't)

Protocol	Strength	Gap for AI Inference
Akash Network	General compute marketplace, mature bidding system	Optimized for containerized workloads broadly; inference-specific UX (model registry, token-billed APIs, sub-second routing) is not the focus
Render Network	Battle-tested at scale, strong creator economy	Built around 3D rendering and offline jobs; latency profile and verification model don't fit inference
Bittensor	Brilliant incentive design, real model-serving subnets	Reward distribution is consensus-driven and indirect; pricing is opaque to the average developer
Gensyn	Cutting-edge work on verifiable compute	Heavily training-focused; inference latency requirements are different
io.net	Excellent supply-side aggregation	Verification of output correctness is largely social/trust-based, not cryptographic

INFERA is not trying to replace any of these. We are trying to occupy a specific, currently-vacant chair: a permissionless, verifiable, pay-per-token inference layer with a developer experience indistinguishable from a Web2 API. If we do this well, the rest of the stack becomes obvious.

2.4 The Three Hard Problems

Anyone proposing decentralized AI inference has to answer three questions. We will spend most of this paper on these:

Verifiability: If I send a job to a stranger's GPU, how do I know they actually ran Llama-3.1-70B and didn't just return the output of a tiny 7B model with a confident smile?
Latency: Real users don't wait 30 seconds for an autocomplete. How do we get sub-second time-to-first-token over a permissionless network?
Developer Experience: If onboarding requires reading three PhDs' worth of papers, no one will use it. What does "five-minute first call" look like?

3. The Solution, in One Picture and One Page

The mental model. Think of INFERA as Uber for AI inference, except the drivers are GPUs, the rides are model calls, the dispatch system is a smart contract on Base, and the receipts are cryptographic proofs. And nobody has to talk to the driver.

3.1 The Five Roles in the Network

Developers (a.k.a. Requesters): Anyone running an app, agent, or product that needs to call an LLM, vision model, embedding model, or other open-source AI. They send jobs and pay in stablecoins.
Providers (a.k.a. Operators): Anyone with a capable GPU and a stable internet connection. They register, stake $INFR, advertise capabilities, and earn fees.
Routers: Off-chain coordinators that match jobs to providers in milliseconds, cryptographically commit to assignments on-chain, and stake $INFR against misrouting.
Validators (a.k.a. Verifiers): Independent nodes that probabilistically re-check outputs, verify TEE attestations, and serve as challengers in the optimistic verification game.
$INFR Holders / Governors: Whoever holds the token participates in upgrades, parameter changes, and treasury allocation. We will get into this later, with appropriate skepticism about the word "DAO."

3.2 The Lifecycle of a Single Inference Call

In the time it takes you to read this section, the following has happened approximately 1,000 times on the network. (Future-tense disclaimer: assuming we ship.)

A developer's app calls api.infera.network/v1/chat/completions with an OpenAI-shaped JSON body.
The request hits a Router, which checks the developer's prepaid balance, signs a job ticket, and dispatches it to the best-matched Provider in the hot pool for the requested model.
The Provider, running the model inside a TEE (or with zkML proof generation enabled, depending on tier), executes the inference and returns the output along with an attestation signature.
The Router streams the response back to the developer (sub-second time-to-first-token in the common case).
The job ticket and attestation are batched and committed on-chain in the next settlement window. Provider gets paid. Router gets a tiny cut. Validators randomly sample 1–5% of jobs for re-verification.
If a Validator catches a bad output, the Provider's stake gets slashed and the developer is reimbursed. The Provider's reputation score takes a noticeable hit.

That's the whole show. The rest of this paper is how each of those steps actually works, and why it doesn't fall apart under adversarial conditions.

4. System Architecture

4.1 The Layered View

INFERA is built as four loosely-coupled layers. Each layer can be reasoned about, audited, and replaced independently — which is important because a few of these components (especially the verification layer) are still active research areas, and we want to ship without painting ourselves into a corner.

Layer	Lives On	Responsibility
Settlement Layer	Base (Ethereum L2)	Stake registry, payments, slashing, governance, dispute resolution
Coordination Layer	Off-chain (with on-chain commitments)	Job routing, scheduling, hot-pool management, latency optimization
Execution Layer	Provider hardware	Actual model inference inside TEE / with zkML witness generation
Application Layer	Developer's stack	OpenAI-compatible SDK, dashboard, billing, observability

4.2 Why Base?

We deploy the Settlement Layer on Base, Coinbase's Ethereum L2. Reasons, ranked by importance to a working engineer:

Fees: paying $0.30 in gas to settle a $0.0008 inference call is, mathematically, not the move. Base routinely settles transactions for fractions of a cent.
Ethereum security: Base is an OP-stack rollup posting to Ethereum L1. We get the security guarantees of mainnet without paying mainnet gas. This matters because slashing is a security-critical operation.
EVM-compatible: Solidity, Foundry, every audit firm in the world. No exotic VMs, no language risk.
Distribution: Base has a real, growing population of users with USDC in their wallets. Our customers and providers are already there.
Bridging: stablecoin liquidity flows freely between Base, Ethereum L1, and other L2s via canonical and third-party bridges.

We are explicitly chain-aware but not chain-maxi. The contracts are designed to be re-deployable on any EVM L2 (Arbitrum, Optimism, Polygon, etc.) and we plan to multi-chain by Phase 2. Long term, the protocol could live anywhere with cheap fast finality. Short term, Base is the right call.

4.3 The Hot Pool: Why Models Don't Cold-Start Per Job

Loading a 70B-parameter LLM into GPU VRAM takes 30–90 seconds. If every inference call had to do that, latency would be unusable. INFERA's coordination layer solves this with a Hot Pool: a continuously-maintained set of Provider nodes that are warm-loaded with specific models, advertise their loadout on-chain, and earn a small idle subsidy for keeping models resident.

Routers prefer Hot Pool providers for any matching request. If demand spikes for a particular model, the Router signals the network, and idle Providers compete to spin up that model and join the pool. If demand drops, providers gracefully exit. The mechanism is a continuous-auction model demand market — providers bid for the right to be in the hot pool for popular models because being in the pool means more job flow.

A small concession to physics. INFERA cannot — and will not pretend to — beat a co-located GPU on absolute latency. If your application requires <10ms time-to-first-token from a model running in the same datacenter as your app server, you should buy that. INFERA is competing on the 50ms–500ms TTFT band, where 90% of real-world AI applications actually live, and where our cost advantage is enormous.

4.4 The Network Diagram, In Words

Picture a circle. In the center is the Base L2, with INFERA's smart contracts. Around the outside, in three concentric rings:

Inner ring — Routers and Validators: small number, well-staked, latency-sensitive infrastructure.
Middle ring — Providers: many, geographically distributed, varying hardware tiers.
Outer ring — Developers and end users: even more numerous, calling in via SDK or HTTP.

Information and value flow in both directions: jobs and payments going outward, results and proofs coming back. Every edge of every interaction has at least one cryptographic commitment behind it. Which brings us to the part that is, frankly, the most fun.

5. Verifiable Inference: The Hard, Interesting Part

If you remember nothing else from this paper, remember this: a decentralized inference network is only as valuable as its ability to detect cheating. Without verification, every Provider has an economic incentive to run the smallest, cheapest possible model and pretend it was the big expensive one you paid for. Solve that, and everything else follows.

There is no single magic answer. Verification has a cost vs. assurance vs. latency triangle, and you can pick at most two corners. INFERA therefore offers three verification tiers, and developers pick per-call.

5.1 Tier 1 — TEE Attestation (Default)

Modern data-center GPUs (NVIDIA H100, H200, B100) and certain CPU/GPU pairings (AMD SEV-SNP, Intel TDX, NVIDIA Confidential Compute) support Trusted Execution Environments — hardware-enforced enclaves where code runs in isolation and the hardware itself can sign attestations proving:

Which exact code (hash) was loaded
Which exact model weights (hash) were loaded
That the output was produced by that code on that data

The attestation is a signed blob from the chip manufacturer's root of trust. INFERA validates it on-chain via a precompile or attestation oracle, and the developer gets cryptographic assurance that the inference happened as advertised, with negligible runtime overhead (typically <5%).

This is our default tier. Fast, cheap, and good enough for the overwhelming majority of consumer and enterprise AI workloads. The trust assumption is: "NVIDIA's signing keys are not compromised." That's a reasonable assumption for everything short of nation-state adversaries.

5.2 Tier 2 — Optimistic Verification with Fraud Proofs

For Provider hardware that doesn't support TEE (older H100s in non-confidential mode, RTX 4090s, etc.), we use an optimistic model borrowed from rollup design:

Provider executes the inference and posts the output plus a commitment to the full computation trace.
Output is delivered to the developer immediately. (Optimistic = fast.)
During a challenge window (default: 90 seconds for chat, 30 minutes for batch), any Validator can re-execute the inference on the same model and inputs and dispute the result on-chain.
If a dispute is raised, an interactive bisection protocol identifies the exact differing computation step. The chain adjudicates that single step. The losing party (Provider or challenger) gets slashed.

This is how Arbitrum settles disputes for general computation, and we use the same intellectual lineage. The catch: it requires deterministic execution. We pin model versions, CUDA driver versions, and sampling seeds explicitly per job, which gets us bit-identical reproducibility for the vast majority of workloads.

5.3 Tier 3 — zkML Proofs (Premium)

For the highest-stakes applications — financial inference, medical decision support, regulated workflows — INFERA supports zero-knowledge machine learning proofs. The Provider generates a SNARK proving that running model M on input x yields output y, and the proof can be verified on-chain in milliseconds.

Let us be honest about where this technology is in 2026: zkML for full-size LLM inference is expensive. State-of-the-art proving systems (EZKL, Risc Zero zkVM, Jolt, and our own optimizations) can handle smaller models (<3B parameters) in real-time and larger models (up to 70B) in batch mode at single-digit-cents-per-call cost. We expect this to improve roughly 10x per year through 2028, at which point Tier 3 becomes the default and the world looks quite different.

For now, Tier 3 is offered as a premium option for the workloads where it economically makes sense. Pricing reflects the actual cost of proof generation, currently 5–50x base inference cost depending on model size.

5.4 Validator Sampling: The Glue

Independent of which tier a developer chooses, INFERA Validators randomly sample 1–5% of all jobs (parameter is governed) and re-execute them. The sampling rate scales inversely with the Provider's reputation: a brand-new Provider gets ~10% of jobs sampled; a long-tenured high-reputation Provider gets <0.5%.

Validators themselves stake $INFR and earn rewards for honest checks. If a Validator falsely accuses a Provider, they get slashed. This makes verification a permissionless economic game, not a privileged role.

5.5 Putting It All Together

Tier	Trust Assumption	Latency Overhead	Cost Premium	Best For
1 — TEE	Hardware vendor not compromised	<5%	0% (baseline)	Most apps: chatbots, agents, RAG, generation
2 — Optimistic	≥1 honest Validator exists	0% to user; settlement delayed	+10%	Cost-sensitive batch workloads
3 — zkML	SNARK soundness (cryptographic)	Significant	+500% to +5000%	Regulated, financial, medical

6. Smart Contract Architecture

All on-chain logic lives in a small set of audited Solidity contracts on Base. We follow OpenZeppelin patterns where possible, use UUPS proxies for upgradability behind a 7-day timelock, and have intentionally kept the contract surface small. Less code is less attack surface.

6.1 The Contract Set

Contract	Purpose	Approx LoC
StakeRegistry	Provider, Router, Validator stake; slashing; withdrawals	~400
JobRegistry	Commits job tickets, tracks state, emits settlement events	~350
PaymentVault	Holds developer prepaid balances (USDC), releases on settlement	~250
DisputeManager	Optimistic-tier challenge game; bisection; final adjudication	~500
AttestationVerifier	Verifies TEE attestations; whitelists vendor root keys	~300
zkVerifier	On-chain SNARK verification for Tier 3 jobs	~200
Reputation	On-chain reputation score (EWMA over verification outcomes)	~150
Governor	Standard OZ Governor + Timelock for $INFR voting	OZ stock
INFR Token	ERC-20 with permit, snapshots, and burn hooks	OZ stock

6.2 Job Lifecycle, On-Chain

Here is what actually gets written to the chain for a single inference job. (Most happens in batches — see §6.5 — but conceptually, this is the per-job state machine.)

// Simplified: the actual contract uses packed structs + bitfields for gas enum JobStatus { Pending, Executed, Settled, Disputed, Slashed } struct Job { bytes32 jobId; // keccak256(developer, nonce, modelHash, inputHash) address developer; // who paid address provider; // who served address router; // who matched bytes32 modelHash; // exact model weights identifier uint64 inputTokens; uint64 outputTokens; uint128 priceUSDC; // total job price in USDC (6 decimals) uint8 tier; // 1 = TEE, 2 = Optimistic, 3 = zkML bytes32 outputCommitment; // keccak256(output) for fraud-proof anchoring uint64 executedAt; uint64 challengeUntil; // 0 for Tier 1 and Tier 3 (instant finality) JobStatus status; } event JobCommitted(bytes32 indexed jobId, address indexed provider, uint8 tier); event JobSettled(bytes32 indexed jobId, uint128 paid); event JobDisputed(bytes32 indexed jobId, address challenger); event JobSlashed(bytes32 indexed jobId, address slashedParty, uint128 amount);

6.3 Stake and Slashing

Every Provider deposits $INFR into StakeRegistry before serving traffic. The minimum stake scales with claimed throughput: a Provider claiming 100 tokens/sec on Llama-3.1-70B needs more stake than one claiming 10 tokens/sec, because their potential to harm is higher.

Slashing is triggered by:

Verified incorrect output (Tier 1: invalid attestation; Tier 2: lost dispute; Tier 3: invalid SNARK)
Failure to deliver after accepting a job (proven by Router signature + timeout)
Documented downtime exceeding the SLA the Provider committed to

Slash amounts are graduated. First offense for low-severity issues might be 1% of stake. Provable fraud — returning bad outputs maliciously — is up to 100%. Slashed funds split: 40% to harmed developer (refund + damages), 40% to challenger/validator (bounty), 20% burned. Burning matters because it gives $INFR an actual deflationary pressure tied to bad behavior.

6.4 Payment Flow

Deposit. Developer deposits USDC into PaymentVault. Balance shows up in the dashboard. They get a per-account API key.
Reserve. When a job is dispatched, an estimated max cost is reserved (escrowed) from the developer's balance.
Settle. On settlement (after challenge window for Tier 2; immediately for Tier 1/3), actual cost based on real input/output tokens is computed. Developer is refunded any reserve overage. Provider is paid in USDC. Router takes its 1.5% cut. Protocol takes 1.5% (governed). Validators paid pro-rata from a separate inflation budget.
Withdraw. Provider can withdraw earned USDC at any time. Developer can withdraw remaining deposit balance at any time.

6.5 Why Batch Settlement (and How It Stays Cheap)

If we wrote one transaction per inference call, even on Base we would burn meaningful gas. Instead, Routers batch up to 10,000 settled jobs into a single Merkle root and commit it on-chain every 60 seconds. Individual jobs are auditable via Merkle proof. Disputes can still be raised against any leaf during the challenge window.

Net effect: per-job on-chain cost amortizes to under $0.0001, which is well below the typical fee of an inference call. The economics work.

6.6 A Note on Upgradability

The contracts are upgradable through a Timelock-controlled UUPS pattern, with a 7-day delay between governance approval and execution. This is non-negotiable: the verification layer in particular will evolve as zkML matures, and a frozen contract set is a contract set that becomes obsolete. Critics of upgradability are right to be suspicious; we mitigate by publishing all proposed upgrades on-chain, auditing them with at least two independent firms before any vote, and sizing the timelock so users have real time to exit if they disagree.

7. Scheduling, Routing, and Latency

The blockchain handles settlement and slashing. It does not, and should not, handle real-time matching of milliseconds-old requests to specific GPUs across the planet. For that we have Routers.

7.1 What a Router Does

Maintains a real-time index of registered Providers, their advertised models (Hot Pool membership), recent latency, recent error rates, and current load.
Receives developer requests via standard HTTPS (with API-key auth backed by an on-chain registry).
Picks the best Provider for the request using a scoring function (more on this below).
Stream-proxies the response back to the developer.
Signs and submits the settled job ticket to the JobRegistry contract within the next batch window.

7.2 The Scoring Function (Plain English Version)

For each candidate Provider, the Router computes:

score = (model_match_quality) (geographic_proximity_factor) (provider_reputation) (1 / current_load) (1 / advertised_price_per_token) * (recent_uptime_fraction)

The exact weights are governance parameters. The point is: cheaper, closer, more reliable Providers naturally win more traffic. A Provider can't just undercut on price and dominate — they also need uptime and a clean reputation history. Conversely, a Provider with great uptime and reputation can charge a small premium and still win.

7.3 Are Routers Trusted?

Routers are not trusted with funds — they never custody developer USDC. They are trusted with routing decisions, in the sense that a malicious Router could systematically favor a specific Provider to extract bribes. We mitigate this in three ways:

Routers stake $INFR and can be slashed for provably biased routing (statistical anomaly detection across the whole network).
Multiple Routers compete. A developer's SDK can use any Router; if one is unreliable or expensive, the SDK fails over.
Routers publish their scoring function and recent decisions in a tamper-evident log, anchored on-chain.

Long-term, Routers may become further decentralized via a small consensus protocol (think DA committees), but the v1 model — multiple staked, mutually competitive Routers — is sufficient for launch and avoids overengineering.

7.4 Latency Numbers We Care About

Phase	Target P50	Target P99	Notes
DNS + TLS to Router	20 ms	60 ms	Standard CDN/Anycast
Router → Provider dispatch	10 ms	40 ms	Geographic routing
Time-to-first-token (TEE, hot pool, 70B model)	350 ms	900 ms	Comparable to hyperscaler
Per-token throughput (70B, single H100)	60 tok/s	—	Driven by hardware
Settlement finality (Tier 1)	~60 s	~120 s	Next batch + Base block time
Settlement finality (Tier 2)	~150 s	~300 s	Includes challenge window

8. The $INFR Token

We will be honest about this section. Most token-economic sections in whitepapers are written to look complicated and inevitable. Most of them are neither. We've tried to keep ours boring, useful, and grounded in actual demand the protocol generates.

8.1 What $INFR Is For

Staking. Providers, Routers, and Validators must stake $INFR to participate. This is real, locked utility.
Slashing collateral. Bad behavior burns or reallocates $INFR.
Governance. Parameter changes, upgrades, treasury allocation.
Protocol fee buyback. A portion of protocol fees (paid in USDC) is used to buy and burn $INFR, linking real usage to token supply.

Importantly, developers do not need to hold $INFR to use the network. They pay in USDC. This is deliberate. Forcing customers to hold a volatile token to use your product is a tax on adoption, and adoption is the only thing that matters in year one.

8.2 Supply

Total supply: 1,000,000,000 INFR, fixed. No inflation. No hidden mints. Supply schedule below.

Allocation	Amount	% of Supply	Vesting
Ecosystem Incentives (provider bootstrap, validator rewards, grants)	350,000,000	35%	Released over 5 years on a programmatic curve tied to network usage
Core Team & Future Hires	180,000,000	18%	12-month cliff, 36-month linear vest
Treasury (governance-controlled)	150,000,000	15%	Locked 12 months, then governance-released
Investors	150,000,000	15%	12-month cliff, 24-month linear vest
Public Launch (LBP / community sale)	80,000,000	8%	Unlocked at TGE
Liquidity & Market Making	50,000,000	5%	Locked in protocol-owned liquidity
Foundation Operating Reserve	40,000,000	4%	Released as needed for ops, audited annually

8.3 Buyback-and-Burn Mechanics

Of the 1.5% protocol fee on every settled job:

0.5% goes to the Treasury (governance-controlled, USDC).
0.5% goes to Validator rewards (paid in USDC).
0.5% goes to a public-market USDC → INFR buyback executed weekly via a TWAP. Bought INFR is sent to a burn address.

This means: if INFERA does $1B in annual inference volume, roughly $5M of $INFR is bought and burned per year. Whether that's bullish for the token depends entirely on whether the network achieves volume. Which it should, if the product is good.

8.4 What $INFR Is Not

It is not a security under our analysis (consult your own counsel — this is not legal advice). It is not yield-bearing by default; staking yields come from real protocol fees and validation rewards, not from inflation. It is not a payment token for end users. It is, fundamentally, a governance and security-bond token tied to a real, productive network. We hope it's also a good investment over time, but that's a function of execution, not promises.

9. Developer Experience

If you have ever called the OpenAI API, you can use INFERA in less than five minutes. The SDK is intentionally a drop-in replacement.

9.1 The Five-Minute Onboarding

Connect a wallet (or create a hosted account with email + Coinbase Commerce on-ramp).
Deposit USDC. Anything from $5 to $50,000. Receive an API key bound to your account.
Point your code at https://api.infera.network/v1. Done.

Python (using the openai library):

from openai import OpenAI client = OpenAI( api_key="infera_sk_live_...", base_url="https://api.infera.network/v1" ) response = client.chat.completions.create( model="meta-llama/Llama-3.1-70B-Instruct", messages=[{"role": "user", "content": "Hello, decentralized world."}], extra_headers={"X-Infera-Tier": "tee"} # optional: tee │ optimistic │ zkml ) print(response.choices[0].message.content)

TypeScript:

import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.INFERA_API_KEY, baseURL: "https://api.infera.network/v1" }); const stream = await client.chat.completions.create({ model: "mistralai/Mixtral-8x22B-Instruct", messages: [{ role: "user", content: "Stream me something useful." }], stream: true }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0]?.delta?.content ?? ""); }

9.2 Supported Models at Launch

We launch with the following open-weight models hot-loaded across the Provider network. Adding a new model is a governance proposal that any community member can submit, with hot-pool incentives auto-activating once approved.

Llama 3.1 — 8B, 70B, 405B Instruct
Mistral / Mixtral families (7B, 8x7B, 8x22B)
Qwen 2.5 (7B, 32B, 72B)
DeepSeek-V3 and DeepSeek-R1
Stable Diffusion XL, FLUX.1
Whisper-large-v3 (transcription)
BAAI bge embedding family
All other major open-weight models added on community demand

9.3 Native Web3 Features

Beyond the OpenAI-compatible surface, INFERA offers Web3-native features that hyperscalers literally cannot:

Pay-per-call from a smart contract: Your on-chain agent can call INFERA directly via a precompiled adapter, no off-chain relayer needed.
Verifiable inference receipts: Every response includes a cryptographic receipt your dApp can post on-chain to prove an AI output is real.
Privacy-preserving inference: TEE mode means even the Provider can't see your prompts or outputs.
Streaming usage logs: Every job is logged on-chain. No vendor controls your usage history; no surprise bills.

10. Provider Onboarding

This section is for the people with the GPUs. If you are reading this and you have an unused 4090, 5090, A100, H100, or anything Apple-Silicon-fancy, this is for you.

10.1 Hardware Tiers

Tier	Example Hardware	Suitable For	Approx. Earnings (USD/day at 80% util)
Consumer	RTX 4090, RTX 5090, M3 Ultra	Models up to ~13B params, embeddings, image generation	$8 – $25
Prosumer	RTX 6000 Ada, A100 40GB, dual 4090s with NVLink	Up to ~70B with quantization	$30 – $90
Datacenter	H100, H200, B100, B200, MI300X	Full-fat 70B+ at FP16, 405B with multi-GPU	$150 – $600
Confidential Datacenter	H100/H200 in CC mode, AMD MI300X with SEV-SNP	Tier-1 verified inference at premium rates	$200 – $800

Numbers are illustrative. Actual earnings depend on utilization, your bid price, your model loadout, and how much you stake. The honest version: at network maturity, well-run datacenter-class providers should comfortably outperform spot-market rentals on equivalent hardware. Consumer-tier providers will earn meaningful pocket money but probably not retire.

10.2 The Provider Stack

We ship a Docker-Compose-based Provider node that runs alongside your existing GPU workloads. It includes:

infera-agent: registers with the network, signs job tickets, manages the model cache.
vLLM, TGI, or SGLang as the inference runtime (your choice; we benchmark all three).
TEE attestation helpers for compatible hardware.
Optional: zkML proof generation worker.
Prometheus metrics, Grafana dashboard, the works.

Quick start:

# Pull and run the provider node curl -sSL https://get.infera.network/install.sh │ bash # Initialize with your wallet infera init --wallet 0xYourAddress... --models llama-3.1-70b,mixtral-8x22b # Stake (you'll be prompted for amount; minimum varies by claimed throughput) infera stake # Go live infera serve # Check earnings infera earnings --since 7d

10.3 Operational Reality Check

Running a Provider is real ops work. You need a reliable internet connection (gigabit recommended; symmetric upload genuinely matters), a stable power situation, decent cooling, and the willingness to actually monitor a service. We have automated as much as possible — the agent handles model downloads, hot-pool management, graceful shutdown, etc. — but if your home internet drops twice a day, your reputation score will reflect that, and the network will route around you.

Conversely, if you run a tight ship, the network will reward you. Reputation compounds. Long-tenured Providers with good track records get the cushiest jobs.

11. Governance

We are skeptical of governance theater. Most "DAOs" are either disguised oligarchies, ungovernable mobs, or both. We have tried to design something that is actually functional.

11.1 What Governance Controls

Protocol fee percentages (within bounded ranges)
Slashing parameters and minimum stakes
Adding/removing supported models from the official registry
Treasury allocation
Smart contract upgrades (always behind a 7-day timelock)
Whitelisting new TEE vendor root keys

11.2 What Governance Does Not Control

The total supply of $INFR (immutably 1B)
Slashing a specific user retroactively (no clawbacks)
Custody of developer funds (they cannot be reassigned by vote)
Censorship of specific addresses or models (the protocol is permissionless)

11.3 Voting Mechanics

Standard OpenZeppelin Governor with a few INFERA-specific tweaks:

Vote-escrowed $INFR ("veINFR"): you can lock INFR for up to 4 years for boosted voting power. Lockers also get amplified rewards from the Validator program.
Quadratic dampening: voting power is sqrt(veINFR) for parameter changes (not for upgrades, where flat voting applies). This reduces whale dominance for routine governance.
Quorum: 4% of circulating supply for parameter changes; 8% for upgrades.
All votes are on Base, with standard OZ Governor + Tally UI.

11.4 The Foundation

INFERA Foundation is a Cayman foundation company that holds the protocol's intellectual property and treasury. The Foundation is not a substitute for governance — it executes governance decisions and handles legal/operational matters that do not lend themselves to on-chain votes (like signing a contract with an audit firm). The Foundation's annual budget is approved by $INFR governance.

12. Roadmap

This is what we are building, in order. The honest sub-text: software estimates are aspirational, and ours are no exception. We commit to public retrospectives every quarter regardless of how the quarter went.

12.1 Phase 0 — Devnet (Q2 2026)

Smart contracts on Base Sepolia
Provider client and Router reference implementation
Closed alpha with ~50 selected developers and ~20 selected providers
OpenAI-compatible API working end-to-end with TEE attestation on H100

12.2 Phase 1 — Mainnet Launch (Q4 2026)

Audited contracts on Base mainnet
Public Provider onboarding (permissionless after 30-day allowlist warmup)
Public developer access; deposits in USDC
$INFR TGE concurrent with mainnet
Tier 1 (TEE) and Tier 2 (Optimistic) verification live
Initial supported models: Llama 3.1 family, Mixtral, Qwen 2.5, SDXL

12.3 Phase 2 — Scale and zkML (2027)

Tier 3 zkML support for models up to 13B params (real-time) and 70B (batch)
Multi-chain deployment: Arbitrum, Optimism, Polygon zkEVM
Native fiat on-ramps for non-crypto-native developers
Enterprise SLAs: dedicated capacity, BYO model, private deployments
Provider mobile app (yes, monitor your stake from the bus)

12.4 Phase 3 — Full Verifiability (2028+)

zkML real-time for 70B+ models (depends on proving system advances)
Cross-chain settlement via canonical bridges and CCIP
Native support for fine-tuned and proprietary models with encrypted weights
Decentralization of Routers via consensus protocol
Provider hardware federation: pool small operators behind a single interface

13. Risks and Limitations

We will not pretend this part doesn't exist. Every protocol whitepaper has a risks section. Most of them say nothing useful. We've tried to write one that is actually useful — both to potential users and to ourselves, as a forcing function for thinking clearly about what could break.

13.1 Technical Risks

TEE compromise. If a hardware vendor's signing keys leak, Tier 1 verification breaks for that vendor until keys rotate. We mitigate by supporting multiple TEE vendors and falling back to Tier 2 (Optimistic) for affected providers.
zkML proving cost. Tier 3 cost projections assume continued progress in zkML systems. If progress stalls, Tier 3 remains a niche premium product longer than expected. The protocol still works; it just has a smaller premium tier.
Smart contract bugs. We will be audited by at least two top-tier firms before mainnet, run a Code4rena contest, and maintain a continuous bug bounty (up to $1M for critical issues). None of this guarantees zero bugs. Insurance protocols (Nexus Mutual, etc.) will be supported.
Determinism for fraud proofs. Tier 2 requires bit-identical reproducibility. Some inference stacks have non-deterministic kernels. We pin versions tightly, but edge cases will surface and require fixes.

13.2 Economic and Market Risks

Hyperscaler price wars. AWS, Azure, and GCP could cut inference prices aggressively. Our cost advantage shrinks but does not disappear — we still aggregate idle hardware they cannot access.
Cold-start chicken-and-egg. Networks like ours need both supply and demand from day one. We address this with Phase 0 incentives funded by the ecosystem allocation; the risk is that incentives expire before organic flywheels start.
$INFR price volatility. Provider stakes are denominated in INFR. A sharp drop in token price reduces absolute slashing collateral. We address this with USDC-denominated minimums recalculated weekly.

13.3 Regulatory Risks

Token classification. Securities law in the US and elsewhere is evolving. We have structured $INFR as a utility/governance token with significant in-protocol use, and have engaged outside counsel in multiple jurisdictions, but we cannot guarantee how regulators will act.
AI regulation. The EU AI Act, US state-level rules, and emerging international frameworks may impose obligations on inference providers. We are designing the protocol to support compliance flags (e.g., region restrictions, content filtering at the SDK layer) without compromising permissionlessness at the base layer.

13.4 Misuse Risks

A permissionless inference network can, in principle, serve harmful workloads. We take this seriously. Mitigations:

SDK-level safety filters that developers can opt into (with transparent allow/block logs).
Provider-side opt-out: any Provider can refuse jobs matching their declared content policy. Their reputation is unaffected by such refusals.
Standard CSAM hash detection at the input/output level for image models, run on-Provider. Non-negotiable.
Compliance with valid legal process directed at the Foundation, recognizing the Foundation does not control individual Providers.

We are not naive about the tension between permissionlessness and safety. We will continue to publish our thinking and update mechanisms publicly.

14. Conclusion

Three propositions sit underneath this entire document:

AI inference is becoming the largest compute workload in human history, and concentrating it in three or four cloud providers is bad for prices, bad for innovation, and bad for resilience.
There is enormous idle GPU capacity in the world, and pulling it into a coherent global market is a tractable engineering problem — not a moonshot.
Verifiability — the ability to cryptographically prove an inference was honest — is the missing piece that turns "random GPUs talking to each other" into a real, trust-minimized marketplace.

INFERA is our attempt to make all three of those true at once. We've designed it to work today (with TEE attestation and OpenAI-compatible APIs), to scale tomorrow (with zkML as it matures), and to remain useful forever (because the underlying need for cheap, trustworthy inference is not going anywhere).

If you're a developer: come build on it. The first $50 of inference is on us.

If you have GPUs: come provide. We have made onboarding painless on purpose.

If you're a researcher: come argue with us. The hardest problems in this design are open, and we publish.

And if you're someone who has read this far on a Tuesday: thank you. That's a real investment of attention. We hope the protocol returns it.

— The INFERA Core Contributors

Base · Ethereum · Open Source · April 2026

Appendix A — Glossary

Term	Plain-English Definition
Inference	Running a trained AI model on input to produce output. The everyday work of an AI.
TEE	Trusted Execution Environment. Hardware-enforced secure enclave that can prove what code ran inside it.
zkML	Zero-Knowledge Machine Learning. Mathematically proving an AI ran correctly without re-running it.
Optimistic Verification	Assume honest by default; allow challenges within a window; punish whoever was wrong.
TTFT	Time-to-first-token. How long after you press enter before the AI starts replying.
Hot Pool	Set of Provider nodes with a model already loaded and ready to serve.
Slashing	Losing some or all of your staked tokens for misbehavior. Existential, intentionally.
Router	Off-chain matchmaker that pairs developer requests with the best Provider in milliseconds.
Validator	Independent node that re-checks Provider outputs and challenges bad ones.
L2 / Rollup	A blockchain that runs on top of Ethereum, inheriting its security but with cheaper fees.
UUPS Proxy	An upgrade pattern for smart contracts. Lets the team fix bugs without breaking user state.

Appendix B — Selected References

This is not an academic paper, but credit where due. The architecture above stands on the shoulders of:

EZKL, Risc Zero, and the broader zkML community for proving-system advances.
Akash Network and Render Network, for early demonstrations that decentralized compute markets can work at scale.
Bittensor, for the cleanest existing thinking on incentivized model serving.
Optimism and Arbitrum, for fraud-proof system designs we have shamelessly learned from.
OpenZeppelin, for governance and access-control primitives that just work.
vLLM, SGLang, and TGI, for making high-throughput LLM serving a solved problem.
NVIDIA, AMD, and Intel TEE engineering teams, for hardware-level confidential computing.

Read more, contribute, or yell at us at: infera.network · github.com/infera-protocol

— End of Whitepaper · v1.0 —

Inference decentralized.
Idle GPUs in
Trustworthy AI tokens out.

Three trillion tokens a day, three companies.

The protocol, in real time.

Six steps from prompt to payment.

Three tiers of trust.

TEE attestation

Optimistic verification

zkML proofs

Five minutes to first call.

Your idle GPU, at work.

$INFR. Boring, on purpose.

What we shipped, what's next.

Devnet on Base Sepolia

Mainnet & $INFR launch

zkML and scale

Full verifiability

Build on the
honest network.

The whitepaper. Thirty pages, no theatrics.

Inference decentralized. Idle GPUs in Trustworthy AI tokens out.