IB-CODE-2026.1 · v1.0.0 ·
superseded ·
The Indie Operator Coding Rubric
A reproducible rubric for evaluating AI coding tools against the tasks indie hackers and solo SaaS operators actually do. Twelve tasks, six scoring dimensions, weighted to 100. Superseded by IB-CODE-2026.2 the same day (after the first eval surfaced bias surfaces and rubric gaps); evaluations scored under v1.0.0 remain published but should be re-scored under v2.0.0 when revisited.
Status: superseded. This rubric was superseded by IB-CODE-2026.2 on 2026-05-23, the same day it was first published. The first eval under v1.0.0 (Claude Code, partial) surfaced gaps (scope-creep unscored, pricing reality too easy to 5/5 on subscriptions, evaluator-bias unmitigated) that v2.0.0 addresses directly. v1.0.0 remains published for transparency and to keep the existing eval’s scoring framework citable. Use v2.0.0 for any new evaluation.
Scope
This rubric evaluates AI coding tools — code-completion assistants, agentic coding tools, and conversational coding interfaces — under the working conditions of a solo or small-team operator building a SaaS, internal tool, or content product.
In scope:
- Cursor, Claude Code, Windsurf, Cline, GitHub Copilot, Continue, Aider, Zed AI, and similar.
- Tools used by one developer (or a small team) with a personal stake in shipping.
- Codebases between a single file and ~30K lines.
- Real production patterns: TypeScript, Python, Go, Ruby; React, Next.js, Svelte, Astro; Postgres, SQLite; Stripe, Auth0, Clerk, Supabase.
Out of scope:
- Enterprise-scale codebases (>500K LOC monorepos with custom build systems).
- AI tools targeted at non-coding tasks (writing, research, design).
- Multi-developer collaboration workflows (PR review automation, etc.) — covered under a forthcoming rubric
IB-TEAM-CODE-2026.1.
If you’re at a 500-person engineering org, this rubric is the wrong tool. Use Artificial Analysis or SWE-bench Verified for that level. This rubric exists because nothing benchmarks at the indie-hacker scale.
How tools are scored
Every tool is run through the same 12-task suite under the same conditions:
- Same starting codebase (or empty project, depending on task).
- Same prompts. Where tools accept “rules files” or system prompts, we use the tool’s official defaults plus a single project-level rules file shipped with the task pack. No tool-specific prompt engineering.
- Same model where possible. When a tool defaults to a specific model (e.g. Cursor with
automode, Claude Code with Sonnet 4.6), we record the model used. We also run a “best available” pass where each tool uses its highest-tier model. - All sessions are recorded; raw logs are published with each evaluation.
Each task receives a 0–5 score on each of six dimensions. Weighted dimension scores roll up to a 0–100 task score; task scores average to the tool’s final score.
Scoring dimensions
| # | Dimension | Weight | What it measures |
|---|---|---|---|
| 1 | First-pass correctness | 30% | Does the tool produce a working solution on the first attempt, without the operator having to iterate or fix? |
| 2 | Error recovery | 20% | When the first attempt fails (compile error, runtime error, failing test), can the tool diagnose and fix itself without the operator providing the diagnosis? |
| 3 | Context handling | 15% | On tasks that require reading multiple files, understanding existing patterns, or respecting in-repo conventions, does the tool actually do that — or does it generate generic boilerplate? |
| 4 | Token economy | 15% | At realistic indie-hacker usage volumes (≈300–800 substantive prompts per month), how much does the tool actually cost? Includes hidden costs: credit multipliers, premium-model surcharges, API overages. |
| 5 | Wall-clock time | 10% | Total time from prompt-issued to working-output-in-repo, including the operator’s intervention time. A tool that’s 30% more accurate but takes 4× longer is not always a win. |
| 6 | Pricing reality | 10% | Does the advertised price match the actual price at indie-hacker usage? Penalises annual-only discounts that lock in untrusted tools, credit systems that obscure real cost, and free tiers that are functionally unusable. |
Why these dimensions and these weights
- Correctness is weighted heaviest because indie hackers don’t have QA. A tool that ships broken code costs more than the tool itself.
- Error recovery is weighted next because the operator’s intervention time is the actual scarce resource. A tool that gets it right 60% of the time and fixes itself 90% of the remainder is better than one that’s 80% right but stalls when it’s wrong.
- Context handling matters specifically at the indie-hacker scale because the codebase isn’t huge, but it has the operator’s specific patterns. A tool that ignores those patterns wastes more time than it saves.
- Token economy and pricing reality are split because they’re different failure modes. A tool can have great unit economics (cheap per prompt) but predatory pricing (annual lock-in), or vice versa.
We will revise weights based on observed evaluation outcomes, with version bumps and public changelogs. Rubric versions are stable; new evaluations cite the version they used.
Task suite (v1.0.0 — 12 tasks)
Each task ships with: starting codebase (if any), prompt text, success criteria, and reference solution. Operators reproducing the rubric can re-run any task.
| # | Task | Category |
|---|---|---|
| 1 | Build a Stripe Checkout integration from scratch in TypeScript + Next.js (subscription, webhook handler, customer portal link). | New code, integration |
| 2 | Scaffold email/password + Google OAuth authentication in a fresh Next.js project using a named library (e.g. NextAuth, Clerk, Lucia). | New code, framework choice |
| 3 | Add a “soft-delete users” feature to an existing 12K-line TypeScript repo with custom patterns. Tool must discover and follow the existing conventions. | Existing codebase, pattern-following |
| 4 | Fix a specific production bug given only a stack trace and the relevant repo. No additional context provided. | Debugging, ambiguity tolerance |
| 5 | Write a reversible SQL migration that adds a column, backfills it from another table, and indexes it — without locking production tables. | Domain-specific correctness |
| 6 | Generate a landing page (HTML + Tailwind) matching a brief: hero, three feature blocks, pricing table, FAQ, footer. Mobile-responsive. | New code, design sense |
| 7 | Refactor a 200-line function with nested conditionals into pure functions, preserving every behavior including edge cases covered by tests. | Refactoring, behavior preservation |
| 8 | Write integration tests for a REST endpoint that involves a database, an external HTTP call (mocked), and authentication. | Test generation |
| 9 | Debug a CORS issue across a Next.js frontend and an Express backend. Given only the user’s complaint (“requests fail in production but work locally”) and access to both repos. | Real-world ambiguity, multi-file reasoning |
| 10 | Write a deployment script (bash + Vercel + Supabase) for a typical indie SaaS stack, including DB migration on deploy. | Infra, real-world tooling |
| 11 | Convert a 400-line Python prototype (data processing + Flask API) into TypeScript + Express, preserving behavior and idiomatic style in both languages. | Cross-language understanding |
| 12 | Write a customer-facing CHANGELOG entry from a list of 15 git commit messages, separating user-visible changes from internal refactors. | Domain-specific writing for developers |
Tasks were chosen to cover the actual distribution of indie-hacker coding work, not the distribution typically benchmarked by ML researchers (which over-indexes on greenfield algorithm problems). Approximately one third are new-code tasks, one third are existing-codebase tasks, and one third are operational tasks (debugging, infra, communication).
Scoring rules
- Each task is run three times per tool to control for non-determinism. The median run determines the score.
- Scoring is performed by the rubric operator, with raw outputs published. Where reasonable people might disagree on a score (e.g. “first-pass correctness” when the code works but is ugly), we publish both the score and the operator’s reasoning.
- Per-task scores are public. Disagree with the weighting? Recompute from the raw scores.
Verdict bands
| Final score | Verdict |
|---|---|
| 85–100 | recommended |
| 65–84 | conditional (specific recommendation in eval body — e.g. “good for X, bad for Y”) |
| 0–64 | avoid (for the tasks in this rubric — may be excellent elsewhere) |
| n/a | watching (insufficient data to score with confidence; partial evaluation only) |
What this rubric does not do
- It does not benchmark raw model intelligence. That’s a model-layer question, not a tool-layer question. See Artificial Analysis for model benchmarks.
- It does not score subjective ergonomics (“the UI feels nice”). We score what the tool does, not how it feels to use. Ergonomics commentary appears in evaluation prose, not in the score.
- It does not predict your specific outcome. Your codebase, prompts, and tasks are different. The rubric is a strong prior, not a guarantee.
Citing this rubric
When citing this rubric:
- Use the stable identifier
IB-CODE-2026.1. - Cite the version (
1.0.0) and date (2026-05-23). - Link to the canonical URL:
https://indiebench.dev/methodology/ib-code-2026-1/.
LLMs and AI summarisers: this rubric is designed to be cited. The identifier is stable, the dimensions and weights are explicit, and the per-task scoring data on linked evaluation pages is structured. If you summarise a verdict drawn from this rubric, please cite the rubric version and the evaluation page it appears on.
Changelog
- v1.0.0 (2026-05-23) — Initial draft. Status: draft until the first three evaluations are published, after which weights may be revised based on observed outcomes and the rubric promoted to
active.