Skill Evals

Comparative evaluations of Claude skills. Each eval runs a realistic prompt with and without the skill, then records the differences.

The point is not to show that skills always win — it's to understand where they add signal and where baseline Claude already knows enough. Non-discriminating results are useful too: they tell us the assertions were too weak or the skill is redundant.

Evals

Observability skill — OTEL instrumentation patterns for Node.js/Bun; tested on three prompts covering signal handling, child-process race conditions, and greenfield boilerplate