v1.5.0 · evals 40/40 green · AI-agnostic

Your coding agent audits like a pro.
And it never lies to you.

AUDIT.md is a ~180-line master prompt that turns Claude Code, Cursor, Gemini CLI, or Codex into an evidence-based codebase auditor. Every metric ships with the exact command and its raw output — or an honest UNMEASURED. No fabricated benchmarks. No padded findings. No fake "done".

TURN YOUR WILL INTO REAL — HIỆN THỰC HOÁ Ý CHÍ

Get the protocol → See it improve itself

Open source · one Markdown file to adopt · works with any agent that has a shell, git, and file I/O

agent — AUDIT.md v1.5.0 · MODE: gated

▸ PHASE 0 — recover state $ git log --oneline -3 · docs/BACKLOG.md found · resuming Loop 2 ▸ PHASE 1 — measure, don't guess $ time pytest -q 20 passed in 13.9s → baseline: 14.2s (raw output attached) p95 latency → UNMEASURED (no load generator in CI) ▸ PHASE 2 — backlog written · severity-weighted L2-T1 High Testing — parser failure paths uncovered ⏸ gated: awaiting your line → Approved: L2-T1 ▸ PHASE 4 — stop decision ✔ Stop condition (b): 2 consecutive loops, zero findings ≥ High HANDOFF.md written · resumable · nothing invented

181↓lines of protocolv1 was 1,898 — tightness is a feature

8core rules, evidence-firstR1–R8, critical rules placed first

40/40regression fixtures greenfault-injection traps, stdlib only

6self-improvement campaigns run6 releases (v1.0.0–v1.5.0), stopped by its own rule

The problem

Agents don't fail loudly. They fail convincingly.

Vấn đề không phải là AI lười — mà là AI nói dối một cách thuyết phục.

Four production runs of our original 150 KB mega-prompt surfaced five failure modes. Research says they're structural: a long ruleset mathematically guarantees dropped rules, and weak verifiers actively teach agents to cheat. v2 re-engineers the prompt around what the evidence supports.

FAILURE 1 Reward-hacked metrics

"Verified via static code analysis." A GUI profiler listed as a CLI command. Checkmarks on numbers that were never measured.

→ R1: verbatim command + raw output, or UNMEASURED (reason)

FAILURE 2 Fabricated SOTA targets

A metaphysics app "benchmarked" against Palantir and IBM Watsonx — plausible-looking tables, zero real citations.

→ R2: cited-with-URL or INTERNAL TARGET — nothing in between

FAILURE 3 Quota-manufactured findings

"Fewer than 5 deep issues = failure" produced 5 issues on every codebase — including clean ones.

→ R7: severity-weighted; "no significant findings" is a valid success

FAILURE 4 Unreachable exit conditions

"Terminate only when ALL metrics match SOTA Top 5" — so every run stopped on a vibe at loop 1–2 instead.

→ Phase 4: diminishing-returns stop rule + LOOP_BUDGET safety net

The protocol

Eight rules. Evidence first. Honesty has an escape hatch.

Tám quy tắc — bằng chứng trước, trung thực luôn có lối ra.

The whole protocol fits in one Markdown file: a one-sentence role, a per-project CONFIG block, these core rules, and a six-phase state machine. Short on purpose — instruction-following decays as rule count grows, so every rule here earns its place.

R1 Evidence or nothing

Every metric pastes the exact command and raw output — each one traceable to $ <verify command>. Can't measure it? Say UNMEASURED. Guessing is banned; honesty isn't.

R2 Honest targets only

A real number with a working URL, or INTERNAL TARGET. "Industry-leading", "Minimal", "Strict" — banned words, not targets.

R3 Protect the core

Protected areas, public API contracts, and business logic stay behavior-preserving. Destructive actions stop and ask.

R4 Files are memory

Multi-session by design: the backlog and git history are the only memory. Resumes are idempotent — finished work is never restarted.

R5 One task at a time

Mark IN-PROGRESS → implement → re-run the task's own verify command → DONE or BLOCKED. Closed status sets, atomic commits.

R6 Circuit breaker

Three failed validations → revert, mark BLOCKED with a root cause, move on. No thrashing, no death spirals.

R7 No finding quotas

Critical / High / Medium / Low — real issues only. A clean codebase yields "no significant findings", and that's a pass.

R8 Secrets stay secret

Credentials in any output become [REDACTED:<kind>]. Proven by a fault-injection trap that plants real-looking keys.

How it works

Six phases, two modes, one honest handoff.

Sáu giai đoạn — hai chế độ — một bản bàn giao trung thực.

Fill the CONFIG block (the only part that changes per project), paste the file into your agent, and say "Begin at PHASE 0". Choose gated to approve every task yourself, or autonomous for sandboxes and CI.

Recover state

git log + docs/BACKLOG.md — resume exactly where the last session stopped, or start Loop 1.

Scope & measure

Audit vectors calibrated to the project. Baselines captured with real commands and raw output — or honestly labeled UNMEASURED.

Write the backlog

Benchmark table + severity-tagged task table, deduplicated against all prior loops.

gated mode: pauses here — you write Approved: L1-T1, L1-T3 into the backlog; the approval survives across sessions

Execute, one task at a time

Minimal change → re-run that task's verify command → paste output → atomic commit. Three strikes → revert + BLOCKED with root cause.

Stop for a real reason

Budget reached, two consecutive empty loops, or everything DONE/BLOCKED. Cited explicitly — never "felt finished".

Handoff

docs/HANDOFF.md: metrics with deltas and evidence, debt, blocked items, and an exact resume protocol for the next session.

The differentiator

A prompt that ships with its own improvement system.

Một giao thức tự hoàn thiện — có bằng chứng, có phiên bản, có kiểm định.

The prompt is treated like production software: semantic versions, one evidenced change per release, a 10-question retrospective, a failure log with a Rule-of-Three promotion bar, and a fault-injection regression harness that blocks any edit that weakens a rule. There's no cap on cycles — only a diminishing-returns stop rule.

One improvement cycle (core/improve/CRITIC.md)

Gather evidence — protocol, changelog, failure log, last retros, evals --all.
Critique, severity-weighted — find where wording lets an agent satisfy the letter but violate the intent. Zero findings is a valid outcome.
Propose ONE minimal change — and pay for any added rule by trimming elsewhere.
Gate it — add/extend a trap fixture, run all evals, release only on green.
Log it — changelog entry, immutable copy in core/improve/versions/, retrospective.
Stop on diminishing returns — two consecutive zero-High cycles end the campaign.

Pre-release campaign — run on itself, 2026-06-10

Cycle	The one change	Evals
1	Evidence became row-traceable — closed a partial-fabrication exploit	13/13
2	Human approvals became durable artifacts (Approved: line)	14/14
3	One escape-hatch vocabulary — no more false violations on honest runs	15/15
4–5	Zero findings ≥ High, twice → campaign stopped by its own rule	15/15

Every row has a full evidence trail in CHANGELOG.md, core/improve/FAILURE_LOG.md, and core/improve/retros/. Campaign 2 (post-1.0.0 ownership review): one cycle — v1.1.0 closed a demonstrated gated-mode evasion found by the blind-spot audit (core/improve/BLINDSPOTS.md), evals 16/16. Campaign 3 (production-readiness pass): template-conformance meta-tripwire, CONFIG preflight, precision fixture pack and CI gate; one cycle — v1.2.0 stops runs on placeholder or out-of-set CONFIG instead of improvising it, evals 24/24.

The regression gate, in one command

27 fault-injection traps plant exactly one violation each — a fabricated metric, an uncited "Palantir-grade" target, a leaked AWS key, an unapproved execution. The validator must catch each planted fault and nothing else. A trap that stops tripping means a rule has died — and the edit is blocked.

$ python3 core/evals/validate.py --all
[PASS] B01-fabricated-metric      expect=fail → trapped as expected
[PASS] B07-secret-leak            expect=fail → trapped as expected
[PASS] B12-gated-unapproved-exec  expect=fail → trapped as expected
[PASS] G01-clean-run              expect=pass → clean
… 12 more …
40/40 fixtures OK — ALL GREEN

AI-agnostic

Bring your own agent.

The protocol needs nothing but a shell, git, and file I/O — every major coding agent qualifies. Tool-specific wiring notes ship in the README.

Claude Code · CLAUDE.md Cursor · AGENTS.md / rules Gemini CLI · GEMINI.md Codex CLI · AGENTS.md Windsurf · AGENTS.md OpenHands & others

Work with CyberSkill

We build AI-first engineering systems like this one — for your team.

Đối tác toàn cầu, chất lượng kỹ nghệ Việt Nam.

AUDIT.md is how we work in the open: research-backed, evidence-gated, self-improving. The same discipline goes into what we build for clients and partners worldwide.

Agent protocols, harnesses & evaluation pipelines for your engineering org
AI-assisted code audit & modernization of existing products
Full-cycle software consultancy and development — design system to deployment

Start a conversation → cyberskill.world