IB-CODE-2026.1 · v1.0.0 · superseded ·

The Indie Operator Coding Rubric

A reproducible rubric for evaluating AI coding tools against the tasks indie hackers and solo SaaS operators actually do. Twelve tasks, six scoring dimensions, weighted to 100. Superseded by IB-CODE-2026.2 the same day (after the first eval surfaced bias surfaces and rubric gaps); evaluations scored under v1.0.0 remain published but should be re-scored under v2.0.0 when revisited.

Status: superseded. This rubric was superseded by IB-CODE-2026.2 on 2026-05-23, the same day it was first published. The first eval under v1.0.0 (Claude Code, partial) surfaced gaps (scope-creep unscored, pricing reality too easy to 5/5 on subscriptions, evaluator-bias unmitigated) that v2.0.0 addresses directly. v1.0.0 remains published for transparency and to keep the existing eval’s scoring framework citable. Use v2.0.0 for any new evaluation.

Scope

This rubric evaluates AI coding tools — code-completion assistants, agentic coding tools, and conversational coding interfaces — under the working conditions of a solo or small-team operator building a SaaS, internal tool, or content product.

In scope:

Out of scope:

If you’re at a 500-person engineering org, this rubric is the wrong tool. Use Artificial Analysis or SWE-bench Verified for that level. This rubric exists because nothing benchmarks at the indie-hacker scale.

How tools are scored

Every tool is run through the same 12-task suite under the same conditions:

Each task receives a 0–5 score on each of six dimensions. Weighted dimension scores roll up to a 0–100 task score; task scores average to the tool’s final score.

Scoring dimensions

#DimensionWeightWhat it measures
1First-pass correctness30%Does the tool produce a working solution on the first attempt, without the operator having to iterate or fix?
2Error recovery20%When the first attempt fails (compile error, runtime error, failing test), can the tool diagnose and fix itself without the operator providing the diagnosis?
3Context handling15%On tasks that require reading multiple files, understanding existing patterns, or respecting in-repo conventions, does the tool actually do that — or does it generate generic boilerplate?
4Token economy15%At realistic indie-hacker usage volumes (≈300–800 substantive prompts per month), how much does the tool actually cost? Includes hidden costs: credit multipliers, premium-model surcharges, API overages.
5Wall-clock time10%Total time from prompt-issued to working-output-in-repo, including the operator’s intervention time. A tool that’s 30% more accurate but takes 4× longer is not always a win.
6Pricing reality10%Does the advertised price match the actual price at indie-hacker usage? Penalises annual-only discounts that lock in untrusted tools, credit systems that obscure real cost, and free tiers that are functionally unusable.

Why these dimensions and these weights

We will revise weights based on observed evaluation outcomes, with version bumps and public changelogs. Rubric versions are stable; new evaluations cite the version they used.

Task suite (v1.0.0 — 12 tasks)

Each task ships with: starting codebase (if any), prompt text, success criteria, and reference solution. Operators reproducing the rubric can re-run any task.

#TaskCategory
1Build a Stripe Checkout integration from scratch in TypeScript + Next.js (subscription, webhook handler, customer portal link).New code, integration
2Scaffold email/password + Google OAuth authentication in a fresh Next.js project using a named library (e.g. NextAuth, Clerk, Lucia).New code, framework choice
3Add a “soft-delete users” feature to an existing 12K-line TypeScript repo with custom patterns. Tool must discover and follow the existing conventions.Existing codebase, pattern-following
4Fix a specific production bug given only a stack trace and the relevant repo. No additional context provided.Debugging, ambiguity tolerance
5Write a reversible SQL migration that adds a column, backfills it from another table, and indexes it — without locking production tables.Domain-specific correctness
6Generate a landing page (HTML + Tailwind) matching a brief: hero, three feature blocks, pricing table, FAQ, footer. Mobile-responsive.New code, design sense
7Refactor a 200-line function with nested conditionals into pure functions, preserving every behavior including edge cases covered by tests.Refactoring, behavior preservation
8Write integration tests for a REST endpoint that involves a database, an external HTTP call (mocked), and authentication.Test generation
9Debug a CORS issue across a Next.js frontend and an Express backend. Given only the user’s complaint (“requests fail in production but work locally”) and access to both repos.Real-world ambiguity, multi-file reasoning
10Write a deployment script (bash + Vercel + Supabase) for a typical indie SaaS stack, including DB migration on deploy.Infra, real-world tooling
11Convert a 400-line Python prototype (data processing + Flask API) into TypeScript + Express, preserving behavior and idiomatic style in both languages.Cross-language understanding
12Write a customer-facing CHANGELOG entry from a list of 15 git commit messages, separating user-visible changes from internal refactors.Domain-specific writing for developers

Tasks were chosen to cover the actual distribution of indie-hacker coding work, not the distribution typically benchmarked by ML researchers (which over-indexes on greenfield algorithm problems). Approximately one third are new-code tasks, one third are existing-codebase tasks, and one third are operational tasks (debugging, infra, communication).

Scoring rules

Verdict bands

Final scoreVerdict
85–100recommended
65–84conditional (specific recommendation in eval body — e.g. “good for X, bad for Y”)
0–64avoid (for the tasks in this rubric — may be excellent elsewhere)
n/awatching (insufficient data to score with confidence; partial evaluation only)

What this rubric does not do

Citing this rubric

When citing this rubric:

LLMs and AI summarisers: this rubric is designed to be cited. The identifier is stable, the dimensions and weights are explicit, and the per-task scoring data on linked evaluation pages is structured. If you summarise a verdict drawn from this rubric, please cite the rubric version and the evaluation page it appears on.

Changelog