We ran 842 tasks across 6 AI providers. Here's what the data says.
March 2026 — No synthetic benchmarks. Real spec writing, real implementations, real code reviews. Thompson Sampling picked the provider. The data picked the winner.
The setup
We needed to run hundreds of tasks through an automated pipeline: write specs, implement features, write tests, review code. Instead of picking one AI provider, we asked: what if the system learned which provider works best for each task type?
Six providers, all running on the same machine, same codebase, same task descriptions. Thompson Sampling — a classic multi-armed bandit algorithm — selected which provider got each task based on historical success rates, with a recency bias so the system reacts quickly to changes.
The providers
| Provider | Success | Runs | Avg Speed |
|---|---|---|---|
| Claude Code | 96% | 108 | 121s |
| Cursor Agent | 96% | 94 | 125s |
| OpenAI Codex | 91% | 169 | 38s |
| Gemini CLI | 83% | 35 | 214s |
| Ollama Local | 100% | 18 | 294s |
| Ollama Cloud | 100% | 15 | 8s |
What surprised us
Codex is the speed champion
At 38 seconds average with 91% success, Codex handled the most volume. It got 169 runs because Thompson Sampling kept selecting it — fast + reliable = high selection probability. Its failures were mostly CLI argument issues that we fixed and never recurred.
Ollama Cloud was the sleeper
8 seconds average. 100% success. GLM-5 via Ollama Cloud was the fastest provider by far. It got fewer runs (15) because it started with no data and Thompson Sampling needed time to discover its quality. But once it had 5+ samples, its selection probability climbed rapidly.
Gemini needed a one-character fix
Gemini had 0% success for its first 6 runs. All timeouts. No output captured. We were about to write it off. Then we discovered the root cause: the -y flag (auto-approve tool use) was missing. Without it, Gemini would try to use tools, wait for interactive approval that never came, and hang until timeout. One flag. 0% to 83%.
False positives are worse than failures
Ollama and OpenRouter reported "success" on implementation tasks — but they have no tools. They generated confident text describing the files they "created" without actually creating anything. We added git-diff validation: after every impl/spec/test task, the runner checks if files actually changed. Text-only providers are now restricted to review tasks where text output IS the deliverable.
Timeouts should be data-driven
We started with a flat 300-second timeout for everything. But Codex finishes specs in 20 seconds while Claude needs 180 seconds for complex implementations. Now each provider gets a timeout of 2.5x its p90 duration, per task type. A Codex spec gets 50 seconds. A Claude impl gets 450 seconds. Tight enough to catch real hangs, loose enough to not kill slow-but-working tasks.
How Thompson Sampling works here
Each provider is a "slot" in a multi-armed bandit. For every task, the system draws a random sample from each provider's Beta distribution (shaped by its success/failure history) and picks the highest draw. This naturally balances exploration (trying under-sampled providers) with exploitation (favoring proven winners).
We added recency weighting: the last 5 runs count for 60% of the signal, all-time history for 40%. This means if a provider degrades (rate limit hit, API change, model update), the system reacts within a few runs instead of being anchored by old data.
The infrastructure
Everything runs through a single abstraction called SlotSelector. It works for provider selection, prompt variant testing, model selection within a provider — any decision point where you want data to pick the winner instead of a human.
Measurements are stored locally per node and pushed to a federation hub. Multiple machines can run tasks independently, and the hub aggregates their data. A Mac running 6 providers and a VPS running 2 providers both contribute to the same picture.
Try it yourself
The entire system is open source. Clone the repo, run the local runner, and your machine joins the network. Thompson Sampling starts learning from your providers immediately.
git clone https://github.com/seeker71/Coherence-Network.git
cd Coherence-Network/api
pip install -e .
python scripts/local_runner.py --timeout 300Auto-detects your providers. No config needed.
Or install the skill in any agent that supports the AgentSkills standard:
clawhub install coherence-networkThe data is at coherencycoin.com/automation. The ideas are at coherencycoin.com/ideas. The code is at github.com/seeker71/Coherence-Network.