We ran 842 tasks across 6 AI providers. Here's what the data says.

March 2026 — No synthetic benchmarks. Real spec writing, real implementations, real code reviews. Thompson Sampling picked the provider. The data picked the winner.

The setup

We needed to run hundreds of tasks through an automated pipeline: write specs, implement features, write tests, review code. Instead of picking one AI provider, we asked: what if the system learned which provider works best for each task type?

Six providers, all running on the same machine, same codebase, same task descriptions. Thompson Sampling — a classic multi-armed bandit algorithm — selected which provider got each task based on historical success rates, with a recency bias so the system reacts quickly to changes.

The providers

Provider	Success	Runs	Avg Speed
Claude Code	96%	108	121s
Cursor Agent	96%	94	125s
OpenAI Codex	91%	169	38s
Gemini CLI	83%	35	214s
Ollama Local	100%	18	294s
Ollama Cloud	100%	15	8s

What surprised us

Codex is the speed champion

At 38 seconds average with 91% success, Codex handled the most volume. It got 169 runs because Thompson Sampling kept selecting it — fast + reliable = high selection probability. Its failures were mostly CLI argument issues that we fixed and never recurred.

Ollama Cloud was the sleeper

8 seconds average. 100% success. GLM-5 via Ollama Cloud was the fastest provider by far. It got fewer runs (15) because it started with no data and Thompson Sampling needed time to discover its quality. But once it had 5+ samples, its selection probability climbed rapidly.

Gemini needed a one-character fix

Gemini had 0% success for its first 6 runs. All timeouts. No output captured. We were about to write it off. Then we discovered the root cause: the -y flag (auto-approve tool use) was missing. Without it, Gemini would try to use tools, wait for interactive approval that never came, and hang until timeout. One flag. 0% to 83%.

False positives are worse than failures

Ollama and OpenRouter reported "success" on implementation tasks — but they have no tools. They generated confident text describing the files they "created" without actually creating anything. We added git-diff validation: after every impl/spec/test task, the runner checks if files actually changed. Text-only providers are now restricted to review tasks where text output IS the deliverable.

Timeouts should be data-driven

We started with a flat 300-second timeout for everything. But Codex finishes specs in 20 seconds while Claude needs 180 seconds for complex implementations. Now each provider gets a timeout of 2.5x its p90 duration, per task type. A Codex spec gets 50 seconds. A Claude impl gets 450 seconds. Tight enough to catch real hangs, loose enough to not kill slow-but-working tasks.

How Thompson Sampling works here

Each provider is a "slot" in a multi-armed bandit. For every task, the system draws a random sample from each provider's Beta distribution (shaped by its success/failure history) and picks the highest draw. This naturally balances exploration (trying under-sampled providers) with exploitation (favoring proven winners).

We added recency weighting: the last 5 runs count for 60% of the signal, all-time history for 40%. This means if a provider degrades (rate limit hit, API change, model update), the system reacts within a few runs instead of being anchored by old data.

The infrastructure

Everything runs through a single abstraction called SlotSelector. It works for provider selection, prompt variant testing, model selection within a provider — any decision point where you want data to pick the winner instead of a human.

Measurements are stored locally per node and pushed to a federation hub. Multiple machines can run tasks independently, and the hub aggregates their data. A Mac running 6 providers and a VPS running 2 providers both contribute to the same picture.

Try it yourself

The entire system is open source. Clone the repo, run the local runner, and your machine joins the network. Thompson Sampling starts learning from your providers immediately.

git clone --recurse-submodules https://github.com/seeker71/Coherence-Network.git
cd Coherence-Network/api
pip install -e .
python scripts/local_runner.py --timeout 300

Auto-detects your providers. No config needed.

Or install the skill in any agent that supports the AgentSkills standard:

clawhub install coherence-network

The data is at coherencycoin.com/automation. The ideas are at coherencycoin.com/ideas. The code is at github.com/seeker71/Coherence-Network.

We ran 842 tasks across 6 AI providers. Here's what the data says.

March 2026 — No synthetic benchmarks. Real spec writing, real implementations, real code reviews. Thompson Sampling picked the provider. The data picked the winner.

The setup

The providers

Provider	Success	Runs	Avg Speed
Claude Code	96%	108	121s
Cursor Agent	96%	94	125s
OpenAI Codex	91%	169	38s
Gemini CLI	83%	35	214s
Ollama Local	100%	18	294s
Ollama Cloud	100%	15	8s

What surprised us

Codex is the speed champion

Ollama Cloud was the sleeper

Gemini needed a one-character fix

False positives are worse than failures

Timeouts should be data-driven

How Thompson Sampling works here

The infrastructure

Try it yourself

The entire system is open source. Clone the repo, run the local runner, and your machine joins the network. Thompson Sampling starts learning from your providers immediately.

git clone --recurse-submodules https://github.com/seeker71/Coherence-Network.git
cd Coherence-Network/api
pip install -e .
python scripts/local_runner.py --timeout 300

Auto-detects your providers. No config needed.

Or install the skill in any agent that supports the AgentSkills standard:

clawhub install coherence-network

The data is at coherencycoin.com/automation. The ideas are at coherencycoin.com/ideas. The code is at github.com/seeker71/Coherence-Network.

For communities, individuals, and services anywhere

Three days of silence at a Buddhist temple

Ana walks the field

We ran 842 tasks across 6 AI providers. Here's what the data says.

The setup

The providers

What surprised us

Codex is the speed champion

Ollama Cloud was the sleeper

Gemini needed a one-character fix

False positives are worse than failures

Timeouts should be data-driven

How Thompson Sampling works here

The infrastructure

Try it yourself

For communities, individuals, and services anywhere

Three days of silence at a Buddhist temple

Ana walks the field

We ran 842 tasks across 6 AI providers. Here's what the data says.

The setup

The providers

What surprised us

Codex is the speed champion

Ollama Cloud was the sleeper

Gemini needed a one-character fix

False positives are worse than failures

Timeouts should be data-driven

How Thompson Sampling works here

The infrastructure

Try it yourself