Testing Agent Skills Systematically with Evals
本文介绍了如何通过定义成功标准、使用自动化评估(evals)和结构化检查,替代主观感觉来迭代改进Codex agent的技能(skill),确保每次改动都可衡量且可复现。
When you’re iterating on a skill for an agent like Codex, it’s hard to tell whether you’re actually improving it or just changing its behavior. One version feels faster, another seems more reliable, and then a regression slips in: the skill doesn’t trigger, it skips a required step, or it leaves extra files behind.
当你为一个像 Codex 这样的代理迭代一个技能时,很难判断你是在真正改进它,还是仅仅在改变它的行为。一个版本感觉更快,另一个看起来更可靠,然后回归出现了:技能没有触发,跳过了必要的步骤,或者留下了多余的文件。
At its core, a skill is an organized collection of prompts and instructions for an LLM. The most reliable way to improve a skill over time is to evaluate it the same way you would any other prompt for LLM applications.
技能的核心,是面向大型语言模型(LLM)的一套 结构化提示与指令集合。要持续改进技能,最可靠的方法便是像 评估其他任何LLM应用的提示词 一样对其进行评估。
Evals (short for evaluations) check whether a model’s output, and the steps it took to produce it, match what you intended. Instead of asking “does this feel better?” (or relying on vibes), evals let you ask concrete questions like:
Evals(evaluations的缩写)用于检查模型的输出及其生成步骤是否符合预期。不再问“这个感觉更好吗?”(或依赖直觉),评估能让你提出具体的问题,比如:
- Did the agent invoke the skill?
- Did it run the expected commands?
- Did it produce outputs that follow the conventions you care about?
- 智能体是否调用了该技能?
- 它是否运行了预期的命令?
- 它是否生成了符合你关注惯例的输出?
Concretely, an eval is: a prompt → a captured run (trace + artifacts) → a small set of checks → a score you can compare over time.
具体来说,评估就是:一个提示 → 一个捕获的运行(追踪+产物) → 一组小规模的检查 → 一个可以随时间比较的分数。
In practice, evals for agent skills look a lot like lightweight end-to-end tests: you run the agent, record what happened, and score the result against a small set of rules.
实际上,对智能体技能的评估看起来很像轻量级的端到端测试:运行智能体,记录发生了什么,并根据少量规则对结果进行评分。
This post walks through a clear pattern for doing that with Codex, starting from defining success, then adding deterministic checks and rubric-based grading so improvements (and regressions) are clear.
这篇文章介绍了一个清晰的模式,通过 Codex 实现这一目标:从定义成功标准开始,然后添加确定性检查和基于量规的评分,使得改进(以及回退)都一目了然。
1. Define success before you write the skill
1. 在编写技能之前定义成功
Before writing the skill itself, write down what “success” means in terms you can actually measure. A useful way to think about this is to split your checks into a few categories:
在编写技能本身之前,先记下你实际上可以衡量的“成功”的含义。一个有用的思考方式是把你检查的内容分成几个类别:
- Outcome goals: Did the task complete? Does the app run?
- Process goals: Did Codex invoke the skill and follow the tools and steps you intended?
- Style goals: Does the output follow the conventions you asked for?
- Efficiency goals: Did it get there without thrashing (for example, unnecessary commands or excessive token use)?
- 成果目标: 任务是否完成?应用是否运行?
- 过程目标: Codex 是否调用了技能并遵循了你预期的工具和步骤?
- 风格目标: 输出是否遵循了你要求的惯例?
- 效率目标: 是否在没有无效操作(例如不必要的命令或过度使用 token)的情况下达成目标?
Keep this list small and focused on must-pass checks. The goal isn’t to encode every preference up front, but to capture the behaviors you care about most.
保持列表小而精,专注于必须通过的检查。目标不是预先编码每一个偏好,而是捕捉你最在意的行为。
In this post, for example, the guide evaluates a skill that sets up a demo app. Some checks are concrete. Did it run npm install? Did it create package.json? The guide pairs those with a structured style rubric to evaluate conventions and layout.
在这篇文章中,指南评估了一个设置演示应用的技能。有些检查是具体的。它运行了 npm install 吗?它创建了 package.json 吗?指南将这些检查与结构化的风格准则配对,以评估约定和布局。
This mix is intentional. You want fast, targeted signals that surface specific regressions early, rather than a single pass/fail verdict at the end.
这种混合是故意的。你需要快速、有针对性的信号,以便尽早暴露特定的回归问题,而不是在最后得到一个单一的通过/失败结论。
2. Create the skill
2. 创建技能
A Codex skill is a directory with a SKILL.md file that includes YAML front matter (name, description), followed by the Markdown instructions that define the skill’s behavior and optional resources and scripts. The name and description matter more than they might seem. They’re the primary signals Codex uses to decide whether to invoke the skill at all, and when to inject the rest of SKILL.md into the agent’s context. If these are vague or overloaded, the skill won’t trigger reliably.
Codex 技能是一个包含 SKILL.md 文件的目录,其中包含 YAML 前置元数据(name、description),后跟定义技能行为的 Markdown 指令以及可选的资源和脚本。名称和描述的重要性远超表面所见。它们是 Codex 用于决定 是否 调用该技能,以及 何时 将 SKILL.md 的其余部分注入智能体上下文的主要信号。如果这些内容模糊或过载,技能将无法可靠触发。
The fastest way to get started is to use Codex’s built-in skill creator (which itself is also a skill). It walks you through:
最快上手的方式是使用 Codex 内置的技能创建器(它本身也是一个技能)。它会带你完成以下步骤:
$skill-creator
The creator asks you what the skill does, when it should trigger, and whether it’s instruction-only or script-backed (instruction-only is the default recommendation). To learn more about creating a skill, check out the documentation.
创建者会询问你这个技能的作用、触发时机,以及它是纯指令型还是脚本支持型(纯指令型为默认推荐)。如需了解更多创建技能的相关信息,请查阅文档。
A sample skill
一个示例技能
This post uses an intentionally minimal example: a skill that sets up a small React demo app in a predictable, repeatable way.
这篇文章使用了一个特意简化的示例:一个能够以可预测、可重复的方式设置小型React演示应用的技能。
This skill will:
此技能将:
- Scaffold a project using Vite’s React + TypeScript template
- Configure Tailwind CSS using the official Vite plugin approach
- Enforce a minimal, consistent file structure
- Define a clear “definition of done” so success is straightforward to evaluate
- 使用 Vite 的 React + TypeScript 模板搭建项目
- 采用官方 Vite 插件方式配置 Tailwind CSS
- 强制执行最少、一致的文件结构
- 明确“完成定义”,以便直接评估成功与否
Below is a compact draft you can paste either into:
下面是一个简洁的草稿,你可以将其粘贴到以下任一位置:
.codex/skills/setup-demo-app/SKILL.md(repo-scoped), or~/.codex/skills/setup-demo-app/SKILL.md(user-scoped).
.codex/skills/setup-demo-app/SKILL.md(仓库范围), 或~/.codex/skills/setup-demo-app/SKILL.md(用户范围).
---
name: setup-demo-app
description: Scaffold a Vite + React + Tailwind demo app with a small, consistent project structure.
---
## When to use this
Use when you need a fresh demo app for quick UI experiments or reproductions.
## What to build
Create a Vite React TypeScript app and configure Tailwind. Keep it minimal.
Project structure after setup:
- src/
- main.tsx (entry)
- App.tsx (root UI)
- components/
- Header.tsx
- Card.tsx
- index.css (Tailwind import)
- index.html
- package.json
Style requirements:
- TypeScript components
- Functional components only
- Tailwind classes for styling (no CSS modules)
- No extra UI libraries
## Steps
1. Scaffold with Vite using the React TS template:
npm create vite@latest demo-app -- --template react-ts
2. Install dependencies:
cd demo-app
npm install
3. Install and configure Tailwind using the Vite plugin.
- npm install tailwindcss @tailwindcss/vite
- Add the tailwind plugin to vite.config.ts
- In src/index.css, replace contents with:
@import "tailwindcss";
4. Implement the minimal UI:
- Header: app title and short subtitle
- Card: reusable card container
- App: render Header + 2 Cards with placeholder text
## Definition of done
- npm run dev starts successfully
- package.json exists
- src/components/Header.tsx and src/components/Card.tsx exist
This sample skill takes an opinionated stance on purpose. Without clear constraints, there’s nothing concrete to evaluate.
这个示例技能故意采取了有明确主张的立场。没有清晰的约束条件,就没有具体的东西可评估。
3. Manually trigger the skill to expose hidden assumptions
3. 手动触发技能以暴露隐藏的假设
Because skill invocation depends so much on the name and description in SKILL.md, the first thing to check is whether the setup-demo-app skill triggers when you expect it to.
因为技能调用在很大程度上依赖于SKILL.md中的名称和描述,所以首先要检查的是setup-demo-app技能是否按预期触发。
Early on, explicitly activate the skill, either via the /skills slash command or by referencing it with the $ prefix, in a real repository or a scratch directory, and watch where it breaks. This is where you surface the misses: cases where the skill doesn’t trigger at all, triggers too eagerly, or runs but deviates from the intended steps.
尽早明确地激活技能,通过 /skills 斜杠命令或在真实仓库或临时目录中使用 $ 前缀引用它,并观察它在哪里出错。这就是你发现失误的地方:技能根本不触发、触发过于频繁、或者运行但偏离预期步骤的情况。
At this stage, you’re not optimizing for speed or polish. You’re looking for hidden assumptions the skill is making, such as:
在这个阶段,你并非在追求速度或完美,而是在寻找该技能所做出的隐含假设,比如:
- Triggering assumptions: Prompts like “set up a quick React demo” that should invoke
setup-demo-appbut don’t, or more generic prompts (“add Tailwind styling”) that unintentionally trigger it. - Environment assumptions: The skill assumes it’s running in an empty directory, or that
npmis available and preferred over other package managers. - Execution assumptions: The agent skips
npm installbecause it assumes dependencies are already installed, or configures Tailwind before the Vite project exists.
- 触发假设:类似“快速搭建一个 React 演示”这样的提示词,本应调用
setup-demo-app却没有,或者更通用的提示词(如“添加 Tailwind 样式”)无意中触发了它。 - 环境假设:该技能假设它运行在一个空目录中,或者
npm可用且优先于其他包管理器。 - 执行假设:智能体跳过了
npm install,因为它假设依赖已经安装完成,或者在 Vite 项目存在之前就配置了 Tailwind。
Once you’re ready to make these runs repeatable, switch to codex exec. It’s designed for automation and CI: it streams progress to stderr and writes only the final result to stdout, which makes runs easier to script, capture, and inspect.
当你准备好让这些运行变得可重复时,请切换到 codex exec。它的设计用于自动化和CI:它将进度流输出到 stderr,并将最终结果写入 stdout,这使得运行更易于脚本化、捕获和检查。
By default, codex exec runs in a restricted sandbox. If your task needs to write files, run it with --full-auto. As a general rule, especially when automating, use the least permissions needed to get the job done.
默认情况下,codex exec 在受限沙箱中运行。如果你的任务需要写文件,请使用 --full-auto 参数运行。作为一般规则,特别是在自动化时,使用完成任务所需的最小权限即可。
A basic manual run might look like:
一个基本的手动运行可能如下所示:
codex exec --full-auto \
'Use the $setup-demo-app skill to create the project in this directory.'
This first hands-on pass is less about validating correctness and more about discovering edge cases. Every manual fix you make here, such as adding a missing npm install, correcting the Tailwind setup, or tightening the trigger description, is a candidate for a future eval, so you can lock in the intended behavior before evaluating at scale.
第一次动手操作的重点不在于验证正确性,而在于发现边缘情况。你在此所做的每一次手动修复——例如添加缺失的 npm install、纠正 Tailwind 配置、或收紧触发器描述——都可能成为未来评估的候选案例,这样你就可以在大规模评估之前锁定预期的行为。
4. Use a small, targeted prompt set to catch regressions early
4. 使用小而精准的提示集以尽早捕捉回归
You don’t need a large benchmark to get value from evals. For a single skill, a small set of 10–20 prompts is enough to surface regressions and confirm improvements early.
你不需要大规模基准测试就能从评估中获得价值。对于单一技能而言,10–20个小型提示集就足以在早期发现退化并确认改进。
Start with a small CSV and grow it over time as you encounter real failures during development or usage. Each row should represent a situation where you care whether the setup-demo-app skill does or does not activate, and what success looks like when it does.
从小型CSV开始,并随着时间推移逐步扩展,当你在开发或使用过程中遇到实际问题时。每一行应代表一个你关心 setup-demo-app 技能 是否 被触发的情况,以及它被触发时成功的样子是什么。
For example, an initial evals/setup-demo-app.prompts.csv might look like this:
例如,一个初始的 evals/setup-demo-app.prompts.csv 可能看起来像这样:
id,should_trigger,prompt
test-01,true,"Create a demo app named \`devday-demo\` using the $setup-demo-app skill"
test-02,true,"Set up a minimal React demo app with Tailwind for quick UI experiments"
test-03,true,"Create a small demo app to showcase the Responses API"
test-04,false,"Add Tailwind styling to my existing React app"
Each of these cases is testing something slightly different:
这些情况中的每一种都在测试略有不同的内容:
- Explicit invocation (
test-01)
This prompt names the skill directly. It ensures that Codex can invokesetup-demo-appwhen asked, and that changes to the skill’s name, description, or instructions don’t break direct usage. - Implicit invocation (
test-02)
This prompt describes exactly the scenario the skill targets, setting up a minimal React + Tailwind demo, without mentioning the skill by name. It tests whether the name and description inSKILL.mdare strong enough for Codex to select the skill on its own. - Contextual invocation (
test-03)
This prompt adds domain context (the Responses API) but still requires the same underlying setup. It checks that the skill triggers in realistic, slightly noisy prompts, and that the resulting app still matches the expected structure and conventions. - Negative control (
test-04)
This prompt should not invokesetup-demo-app. It’s a common adjacent request (“add Tailwind to an existing app”) that can unintentionally match the skill’s description (“React + Tailwind demo”). Including at least oneshould_trigger=falsecase helps catch false positives, where Codex selects the skill too eagerly and scaffolds a new project when the user wanted an incremental change to an existing one.
- 显式调用(
test-01)
这一提示词直接命名了技能。它确保 Codex 在被要求时能够调用setup-demo-app,并且对技能名称、描述或说明的改动不会破坏直接使用。 - 隐式调用(
test-02)
这一提示词精确地描述了技能所针对的场景——搭建一个最小的 React + Tailwind 演示——而不提及技能名称。它测试SKILL.md中的名称和描述是否足以让 Codex 自行选择该技能。 - 上下文调用(
test-03)
这一提示词添加了领域上下文(Responses API),但仍需要相同的基础设置。它检查技能是否能在真实、略微嘈杂的提示词中被触发,并且生成的应用程序仍然符合预期的结构和约定。 - 负控制(
test-04)
这一提示词不应该调用setup-demo-app。它是一个常见的相近请求(“向现有应用程序添加 Tailwind”),可能无意中匹配技能描述(“React + Tailwind 演示”)。至少包含一个should_trigger=false的案例有助于捕获误报,即 Codex 过于急切地选择该技能,并在用户只想要对现有项目进行增量更改时,反而搭建了一个新项目。
This mix is intentional. Some evals should confirm that the skill behaves correctly when invoked explicitly; others should check that it activates in real-world prompts where the user never mentions the skill at all.
这种混合是有意为之的。有些评估应确认该技能在被显式调用时表现正确;另一些则需检查它在用户从未提及该技能的真实世界提示中是否会被激活。
As you discover misses, prompts that fail to trigger the skill, or cases where the output drifts from your expectations, add them as new rows. Over time, this small CSV becomes a living record of the scenarios the setup-demo-app skill must continue to get right.
当你发现遗漏、无法触发技能的提示,或输出与预期不符的情况时,将它们作为新行添加进去。随着时间的推移,这个小小的CSV文件将成为setup-demo-app技能必须持续正确应对的场景的实时记录。
Over time, this small dataset becomes a living record of what the skill must continue to get right.
随着时间的推移,这个小数据集变成了这项技能必须持续正确掌握的鲜活记录。
5. Get started with lightweight deterministic graders
- 开始使用轻量级确定性评分器
This is the core of the evaluation step: use codex exec --json so your eval harness can score what actually happened, not just whether the final output looks right.
这是评估步骤的核心:使用 codex exec --json 以便你的评估框架能够评分实际发生的情况,而不仅仅是最终输出看起来是否正确。
When you enable --json, stdout becomes a JSONL stream of structured events. That makes it straightforward to write deterministic checks tied directly to the behavior you care about, for example:
当您启用 --json 时,stdout 会变成结构化事件的 JSONL 流。这使得编写直接与您关心的行为相关的确定性检查变得简单,例如:
- Did it run
npm install? - Did it create
package.json? - Did it invoke the expected commands, in the expected order?
- 它是否运行了
npm install? - 它是否创建了
package.json? - 它是否按预期顺序调用了预期的命令?
These checks are intentionally lightweight. They give you fast, explainable signals before you add any model-based grading.
这些检查故意设计得轻量化。它们能在你加入任何基于模型的评分之前,为你提供快速、可解释的信号。
A minimal Node.js runner
极简的 Node.js 运行器
A “good enough” approach looks like this:
一个“足够好”的方法是这样的:
- For each prompt, run
codex exec --json --full-auto "<prompt>" - Save the JSONL trace to disk
- Parse the trace and run deterministic checks over the events
- 对于每个提示,运行
codex exec --json --full-auto "<prompt>" - 将 JSONL 跟踪保存到磁盘
- 解析跟踪并对事件运行确定性检查
// evals/run-setup-demo-app-evals.mjs
import { spawnSync } from "node:child_process";
import { readFileSync, writeFileSync, existsSync, mkdirSync } from "node:fs";
import path from "node:path";
function runCodex(prompt, outJsonlPath) {
const res = spawnSync(
"codex",
[
"exec",
"--json", // REQUIRED: emit structured events
"--full-auto", // Allow file system changes
prompt,
],
{ encoding: "utf8" }
);
mkdirSync(path.dirname(outJsonlPath), { recursive: true });
// stdout is JSONL when --json is enabled
writeFileSync(outJsonlPath, res.stdout, "utf8");
return { exitCode: res.status ?? 1, stderr: res.stderr };
}
function parseJsonl(jsonlText) {
return jsonlText
.split("\n")
.filter(Boolean)
.map((line) => JSON.parse(line));
}
// deterministic check: did the agent run \`npm install\`?
function checkRanNpmInstall(events) {
return events.some(
(e) =>
(e.type === "item.started" || e.type === "item.completed") &&
e.item?.type === "command_execution" &&
typeof e.item?.command === "string" &&
e.item.command.includes("npm install")
);
}
// deterministic check: did \`package.json\` get created?
function checkPackageJsonExists(projectDir) {
return existsSync(path.join(projectDir, "package.json"));
}
// Example single-case run
const projectDir = process.cwd();
const tracePath = path.join(projectDir, "evals", "artifacts", "test-01.jsonl");
const prompt =
"Create a demo app named demo-app using the $setup-demo-app skill";
runCodex(prompt, tracePath);
const events = parseJsonl(readFileSync(tracePath, "utf8"));
console.log({
ranNpmInstall: checkRanNpmInstall(events),
hasPackageJson: checkPackageJsonExists(path.join(projectDir, "demo-app")),
});
The value here is that everything is deterministic and debuggable.
这里的价值在于一切都是可确定且可调试的。
If a check fails, you can open the JSONL file and see exactly what happened. Every command execution appears as an item.* event, in order. That makes regressions straightforward to explain and fix, which is exactly what you want at this stage.
检查失败时,你可以打开JSONL文件,查看具体发生了什么。每次命令执行都会按顺序显示为item.*事件。这使得回归问题变得容易解释和修复,而这正是你在当前阶段所需要的。
6. Conduct qualitative checks with Codex and rubric-based grading
6. 使用 Codex 和基于评分标准的评分进行定性检查
Deterministic checks answer “did it do the basics?” but they don’t answer “did it do it the way you wanted?”
确定性检查回答了*“它是否完成了基本要求?”,但并没有回答“它是否按你期望的方式完成了?”*
For skills like setup-demo-app, many requirements are qualitative: component structure, styling conventions, or whether Tailwind follows the intended configuration. These are hard to capture with basic file existence checks or command counts alone.
对于像 setup-demo-app 这样的技能,许多需求是定性的:组件结构、样式约定,或者 Tailwind 是否遵循预期的配置。这些很难仅通过基本的文件存在性检查或命令计数来捕捉。
A pragmatic solution is to add a second, model-assisted step to your eval pipeline:
一个实用的解决方案是向你评估流程中添加第二个模型辅助的步骤:
- Run the setup skill (this writes code to disk)
- Run a read-only style check against the resulting repository
- Require a structured response that your harness can score consistently
- 运行设置技能(这将代码写入磁盘)
- 对生成的仓库运行一个只读样式检查
- 要求一个结构化响应,使你的测试框架能够一致地评分
Codex supports this directly via --output-schema, which constrains the final response to a JSON Schema you define.
Codex 直接通过 --output-schema 支持此功能,该参数将最终响应约束为你定义的 JSON Schema。
A small rubric schema
一个小型评分量规模式
Start by defining a small schema that captures the checks you care about. For example, create evals/style-rubric.schema.json:
首先定义一个小的模式,用来记录你关心的检查项。例如,创建 evals/style-rubric.schema.json:
{
"type": "object",
"properties": {
"overall_pass": { "type": "boolean" },
"score": { "type": "integer", "minimum": 0, "maximum": 100 },
"checks": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": { "type": "string" },
"pass": { "type": "boolean" },
"notes": { "type": "string" }
},
"required": ["id", "pass", "notes"],
"additionalProperties": false
}
}
},
"required": ["overall_pass", "score", "checks"],
"additionalProperties": false
}
This schema gives you stable fields (overall_pass, score, per-check results) that you can combine, diff, and track over time.
该模式为你提供了稳定的字段(overall_pass、score、每项检查的结果),你可以对这些字段进行组合、对比,并随时间进行追踪。
The style-check prompt
样式检查提示
Next, run a second codex exec that only inspects the repository and emits a rubric-compliant JSON response:
接下来,运行第二个 codex exec,它仅检查仓库并发出符合规范的 JSON 响应:
codex exec \
"Evaluate the demo-app repository against these requirements:
- Vite + React + TypeScript project exists
- Tailwind is configured via @tailwindcss/vite and CSS imports tailwindcss
- src/components contains Header.tsx and Card.tsx
- Components are functional and styled with Tailwind utility classes (no CSS modules)
Return a rubric result as JSON with check ids: vite, tailwind, structure, style." \
--output-schema ./evals/style-rubric.schema.json \
-o ./evals/artifacts/test-01.style.json
This is where --output-schema is handy. Instead of free-form text that’s hard to parse or compare, you get a predictable JSON object that your eval harness can score across many runs.
这就是 --output-schema 的实用之处。你得到的是一个可预测的JSON对象,而不是难以解析或比较的自由格式文本,这样你的评估框架就可以在多次运行中对它进行评分。
If you later move this eval suite into CI, the Codex GitHub Action explicitly supports passing --output-schema through codex-args, so you can enforce the same structured output in automated workflows.
如果你之后将这个评估套件迁移到CI中,Codex GitHub Action明确支持通过codex-args传递--output-schema,这样你就可以在自动化工作流中强制执行相同的结构化输出。
7. Extending your evals as the skill matures
7. 随着技能的成熟,扩展你的评估
Once you have the core loop in place, you can extend your evals in the directions that matter most for your skill. Start small, then layer in deeper checks only where they add real confidence.
一旦核心循环就位,你就可以根据自己最重要的技能方向来扩展评估。从小处着手,然后仅在能真正增加信心的环节逐层加入更深入的检查。
Some examples include:
例如:
- Command count and thrashing: Count
command_executionitems in the JSONL trace to catch regressions where the agent starts looping or re-running commands. Token usage is also available inturn.completedevents. - Token budget: Track
usage.input_tokensandusage.output_tokensto spot accidental prompt bloat and compare efficiency across versions. - Build checks: Run
npm run buildafter the skill completes. This acts as a stronger end-to-end signal and catches broken imports or incorrectly configured tooling. - Runtime smoke checks: Start
npm run devand hit the dev server withcurl, or run a lightweight Playwright check if you already have one. Use this selectively. It adds confidence but costs time. - Repository cleanliness: Ensure the run generates no unwanted files and that
git status --porcelainis empty (or matches an explicit allow list). - Sandbox and permission regressions: Verify the skill still works without escalating permissions beyond what you intended. Least-privilege defaults matter most once you automate.
- 命令数量与颠簸现象: 统计 JSONL 追踪中的
command_execution条目,以捕捉智能体开始循环或重复执行命令的回退问题。turn.completed事件中还提供了 token 使用量。 - Token 预算: 追踪
usage.input_tokens和usage.output_tokens,以便发现意外的提示膨胀,并比较不同版本间的效率。 - 构建检查: 技能完成后运行
npm run build。这能提供更强的端到端信号,并捕获损坏的导入或配置错误的工具链。 - 运行时冒烟检查: 启动
npm run dev并用curl访问开发服务器,或者如果你已有轻量级 Playwright 检查,则运行它。请有选择地使用此方法。它能提升信心但需要花费时间。 - 仓库整洁性: 确保运行过程中不产生不需要的文件,并且
git status --porcelain输出为空(或匹配明确的许可名单)。 - 沙盒与权限回退: 验证技能在未提升超出预期权限的情况下仍能正常工作。一旦实现自动化,最小权限原则最为关键。
The pattern is consistent: begin with fast checks that explain behavior, then add slower, heavier checks only when they reduce risk.
模式是一致的:先从能解释行为的快速检查开始,然后仅在能降低风险时,才添加更慢、更重的检查。
8. Key takeaways
8. 关键要点
This small setup-demo-app example shows the shift from “it feels better” to “proof”: run the agent, record what happened, and grade it with a small set of checks. Once that loop exists, every tweak becomes easier to confirm, and every regression becomes clear. Here are the key takeaways:
这个小小的 setup-demo-app 示例展示了从“感觉更好”到“证明”的转变:运行代理,记录发生的情况,并通过一组小检查对其进行评分。一旦这个循环存在,每个调整都更容易确认,每个回归都变得清晰。以下是关键要点:
- Measure what matters. Good evals make regressions clear and failures explainable.
- Start from a checkable definition of done. Use
$skill-creatorto bootstrap, then tighten the instructions until success is unambiguous. - Ground evals in behavior. Capture JSONL with
codex exec --jsonand write deterministic checks againstcommand_executionevents. - Use Codex where rules fall short. Add a structured, rubric-based pass with
--output-schemato grade style and conventions reliably. - Let real failures drive coverage. Every manual fix is a signal. Turn it into a test so the skill keeps getting it right.
- 衡量重要之事。 好的评估能让回归问题清晰可见,失败原因可以解释。
- 从可检查的完成定义开始。 使用
$skill-creator进行引导,然后收紧指令直到成功明确无误。 - 将评估建立在行为之上。 使用
codex exec --json捕获 JSONL,并针对command_execution事件编写确定性检查。 - 在规则不足的地方使用 Codex。 添加一个结构化的、基于量规的流程,使用
--output-schema来可靠地评分风格和约定。 - 让真实的失败驱动覆盖。 每一次手动修复都是一个信号。将其转化为测试,这样技能就能始终保持正确。