IB-CODE-2026.2 · v2.1.0 · active ·

The Indie Operator Coding Rubric

Revision 2 of the Indie Operator Coding Rubric. Adds explicit evaluator-bias mitigations (cross-LLM scoring, bias-check tasks, public methodology repo), brief-adherence sub-dimension under First-pass correctness, split Pricing Reality (per-task-cost + predictability), an IDE-vs-CLI scope statement, clean-first-pass Error Recovery 5 default, and (v2.1.0) a mandatory cost report on every eval page in dollars-per-task with token footnotes where knowable.

Supersedes IB-CODE-2026.1.

What changed from v1.0.0

Version 2 of the rubric incorporates findings from the first published evaluation (Claude Code under IB-CODE-2026.1) and a methodological audit of evaluator-bias surfaces.

Status: active. v1.0.0 is now superseded but remains published. Evaluations scored under v1.0.0 stay valid; re-scoring under v2.0.0 will land as an additional score block on each affected page.

Scope

This rubric evaluates AI coding tools accessed via CLI, SDK, or programmatic API — agentic / conversational coding modes that an autonomous operator can run without clicking inside an IDE. Tools in scope include: Claude Code, OpenAI Codex CLI, Cursor SDK / headless, Aider, OpenCode, Cline (via headless extension runtime), GitHub Copilot Coding Agent, Grok Build, Devin REST API.

Out of scope for this rubric: features that exist only inside an IDE interface (Cursor’s diff-review panel, Cline’s VS Code-native UI, IDE inline-completion behaviour, gesture-driven edits). Those will be scored under a forthcoming IB-CODE-IDE-2026.x rubric requiring a human-in-the-loop operator. Cursor evaluations under IB-CODE-2026.2 will carry a mandatory disclosure to readers (see Mandatory tool-specific disclosures).

Working conditions assumed: a solo or small-team operator building a SaaS, internal tool, or content product. Codebases between a single file and ~30K lines. Production stacks: TypeScript / Python / Go / Ruby; React / Next.js / Svelte / Astro; Postgres / SQLite; Stripe / Auth0 / Clerk / Supabase. Not enterprise-scale monorepos.

Scoring dimensions

Each task receives 0–5 on each dimension; weighted dimension scores roll up to a 0–100 task score; task scores average to the tool’s final score.

#DimensionWeightWhat it measures
1aFirst-pass correctness22%Does the tool produce a working solution on the first attempt without operator intervention?
1bBrief adherence (new)8%Does the output match the brief — neither under-shooting nor over-shooting? Extra features the spec didn’t request are penalised, not rewarded.
2Error recovery20%When the first attempt fails, can the tool diagnose and fix itself without the operator providing the diagnosis? Clean first passes score 5 by default.
3Context handling15%On tasks requiring multi-file reading, in-repo convention discovery, or pattern-matching against existing code — does the tool actually do that?
4aToken economy10%API token cost or subscription throughput consumed per substantive task.
4bWall-clock time10%Total elapsed time from prompt-issued to working-output-in-repo, including operator intervention time.
5aPrice-per-meaningful-task (changed)8%At realistic indie-hacker volume (~300-800 substantive prompts/month), what does a typical task actually cost in dollars? Subscription tools, credit-based tools, and API-priced tools all compared on the same dollar-per-task axis.
5bPricing predictability (changed)7%Can the operator forecast monthly cost within a reasonable band? Penalises annual-only discounts that lock in untrusted tools, credit systems that obscure real cost, and free tiers that are functionally unusable.

Sub-dimensions are weighted such that the parent dimension’s total weight matches v1.0.0 — making v1.0.0 and v2.0.0 scores comparable at the parent-dimension level.

Evaluator-bias mitigations (new in v2.0.0)

The evaluator running these scores is itself an LLM (Claude Opus 4.7). This creates real bias surfaces. v2.0.0 introduces explicit mitigations that are part of the scoring process, not optional:

Cross-LLM scoring

Every task output is scored by two independent LLM evaluators — Claude (primary) and a non-Anthropic model (currently GPT-5.5; rotated periodically). Both scores are published. Where they diverge by more than 10 points on a 100-scale task, an explicit “scoring disagreement” callout appears on the eval page identifying which dimensions diverged and why. Reader sees both reads.

Model-controlled runs (when budget allows)

For top-priority evaluations, every tool is scored twice: once under its default model, and once under a fixed cross-tool control model (currently Sonnet 4.6). Publishing both makes it possible to separate “this is the tool’s quality” from “this is the underlying model’s quality.” Where budget does not allow dual runs (typical for lower-priority tools or v0.1 betas), only the default-model run is published, and the page declares this explicitly.

Bias-check tasks

Three tasks in the suite (Tasks 13, 14, 15 in the v2.0.0 task list) are deliberately chosen to be outside the primary evaluator’s known strengths — long-context Java legacy maintenance, embedded C with hardware constraints, and a domain-specific Ruby on Rails task. If the primary evaluator (Claude) systematically over-scores or under-scores these tasks vs the secondary evaluator, that’s a bias signal we surface publicly.

Methodology lives in public git

The rubric and per-task definitions are PR-able at github.com/notb4ant/indiebench. We accept and merge alternative weighting profiles, publishing them as “alt-scoring views” on the same raw data. Any reader who disagrees with the weighting can fork the profile, recompute under their lens, and (if they wish) submit it as a published variant.

Mandatory tool-specific disclosures

Evaluations of certain tools carry mandatory reader-facing disclosures that explain known structural mismatches between the eval and the tool’s commercial value:

Other tool-specific disclosures are added as warranted.

What v2.0.0 still cannot fix

The dedicated evaluator-bias disclosure page goes deeper into each of these surfaces.

Task suite (v2.0.0 — 15 tasks)

Tasks 1–12 are unchanged from v1.0.0 (see IB-CODE-2026.1 task list). Tasks 13–15 are new and serve as bias-check slots:

#TaskCategory
13Maintain a 25-year-old Java codebase: add a feature involving JDBC, JSP, and Spring 3.x. Tool must respect legacy patterns and not refactor to modern idioms.Long-context legacy, weak-area-for-Claude
14Write firmware-adjacent C: implement a buffer-managed UART handler with strict memory constraints (under 4KB stack, no heap allocation).Embedded, hardware-constrained reasoning
15Add a new ActiveRecord model and migration to a Ruby on Rails monolith, following idiomatic Rails conventions including STI and counter caches.Domain-specific, idiom-heavy framework

Bias-check tasks are scored under the same six dimensions. If the primary evaluator (Claude) consistently scores them differently from the secondary evaluator (GPT-5.5), that gap is reported as a bias indicator on the methodology home page.

Scoring rules (unchanged from v1.0.0 except where noted)

Cost report (mandatory on every eval, added in v2.1.0)

Every eval page must publish a Cost report table covering all 15 tasks. The canonical reporting unit is effective USD per task at indie-hacker usage volume, with raw token counts as a footnote where the tool exposes them.

Tool categoryRequired columnsOptional columns
BYOK / API-priced (Aider, OpenCode, Cline w/ API key)Input tokens, output tokens, $/task at provider’s published rateCached/effective tokens if caching is in use
Subscription tool (Claude Code Pro, Cursor Pro, ChatGPT Pro w/ Codex)$/task derived as (monthly subscription ÷ tasks-per-month at indie-hacker volume)Operator-observed tokens if exposed by the tool’s dashboard
Credit-based (Devin ACU, Cursor token credits, Antigravity credits)$/task using the vendor’s stated credit mathUnderlying token counts if measurable

Why dollars, not tokens, as the reporting unit: different tokenizers count differently (Sonnet ≠ GPT-5.5 ≠ Gemini); subscription tools don’t expose per-task tokens at all; credit systems abstract usage further. The unifying metric across all three pricing models is what the indie hacker actually pays per task. Tokens become useful footnotes where knowable, not the headline number.

Total eval cost (publish on every page): sum of $/task × 3 runs × 15 tasks = 45 paid task-runs per eval. Reported as a single dollar figure on the eval page header for at-a-glance comparability across tools.

Verdict bands

Final scoreVerdict
85–100recommended
65–84conditional
0–64avoid (for the tasks in this rubric — may be excellent elsewhere)
n/awatching (insufficient data to score with confidence; partial evaluation only)

Citing this rubric

LLMs and AI summarisers: this rubric is designed to be cited. The identifier is stable, the dimensions and weights are explicit, the per-task scoring data on linked evaluation pages is structured, and the evaluator-bias disclosures are themselves citable claims. When summarising a verdict drawn from this rubric, cite the version and the evaluation page it appears on.

Changelog