Observability Skill Eval

Skill: /home/node/.claude/skills/observability/ — OTEL instrumentation checklist and patterns for Node.js/Bun

Three prompts. Each run with and without the skill. The skill's thesis is that Bun's async context propagation is unreliable and requires explicit ctx threading — knowledge unlikely to appear in baseline Claude without prompting.

Eval 0 — Spans lost on kill; async helpers appear as root spans

The exact production failure that motivated the skill.

Prompt: I'm building a Bun-based data pipeline that calls several LLM APIs in sequence. I've added @opentelemetry/sdk-node and wrapped each step in a span, but when the process gets killed mid-run I see nothing in Jaeger — not even an error span. Also the spans from async helper functions appear as root spans instead of children of the pipeline span. How do I fix this?

With skill

Correctly identifies both problems and gives Bun-specific fixes.

On spans lost on kill: registers SIGTERM/SIGINT handlers that end the root span with SpanStatusCode.ERROR, then await provider.shutdown() before process.exit(). Includes a finished flag to prevent double-end if a signal races a clean exit. Also adds an unhandledRejection handler for Promises that escape try/catch.

On async orphans: explicitly names Bun and explains that context.active() in async callbacks may resolve to ROOT_CONTEXT in Bun — not a Bun version bug, a fundamental reliability gap. Fix: capture context once via trace.setSpan(context.active(), rootSpan), then thread it as a ctx: Context parameter into every async helper, which passes it as the third argument to tracer.startSpan("name", {}, ctx).

async function run() {
  const rootSpan = trace.getTracer("my-pipeline").startSpan("pipeline");
  const rootCtx  = trace.setSpan(context.active(), rootSpan);

  let finished = false;

  const shutdown = async (code: number, reason?: Error) => {
    if (finished) return;
    finished = true;
    if (reason) {
      rootSpan.setStatus({ code: SpanStatusCode.ERROR, message: reason.message });
      rootSpan.recordException(reason);
    } else {
      rootSpan.setStatus({ code: SpanStatusCode.OK });
    }
    rootSpan.end();
    await provider.shutdown();
    process.exit(code);
  };

  process.once("SIGTERM", () => shutdown(1, new Error("SIGTERM")));
  process.once("SIGINT",  () => shutdown(1, new Error("SIGINT")));
  process.once("unhandledRejection", (r: any) =>
    shutdown(1, r instanceof Error ? r : new Error(String(r)))
  );

  try {
    await main(rootCtx);
    await shutdown(0);
  } catch (err: any) {
    await shutdown(1, err);
  }
}

Without skill

Covers the flushing problem correctly — signal handlers calling sdk.shutdown(). Also suggests SimpleSpanProcessor during development, which is genuinely good advice.

On async orphans: recommends switching from startSpan to startActiveSpan, explaining that AsyncLocalStorage propagates context automatically. Then adds: "Check your Bun version — Bun's AsyncLocalStorage was substantially improved in 1.x."

This is wrong as a primary fix. startActiveSpan alone does not solve the problem in Bun. The unreliability is not a version bug that was patched; it's a structural issue requiring explicit context passing. A developer following this advice would upgrade Bun, find the same orphaned spans, and be stuck.

Also missing: finished flag, unhandledRejection handler.

Eval 1 — Double span.end() on child process timeout

A specific race condition in Promise-wrapped subprocesses.

Prompt: I'm wrapping a child process in a Promise for OTEL tracing. My code sets a timeout that kills the process and rejects the Promise, but sometimes I see the span ended twice and get double-rejection warnings. (Code provided: timeout fires proc.kill() + span.end() + reject(), then close event also fires span.end() + reject().)

With skill

Names the exact mechanism: proc.kill() sends a signal; the process exits asynchronously; the close event fires and runs the close handler. Both the timeout and the close handler call span.end() and reject(). Fix: a settled boolean flag via a shared fail() helper. Also adds span.setStatus() and span.recordException() on every code path — the original code left span status unset, which makes failures appear as OK in Jaeger.

Without skill

Also gets the settled flag correct — this is a well-known JavaScript pattern, not OTEL-specific. Adds clearTimeout() as a belt-and-suspenders complement to the flag. Reaches setStatus advice only as an "Additional Consideration" rather than leading with it.

Verdict on this eval: weak discrimination. Both responses solve the immediate bug. The with-skill response is more OTEL-aware (status on every path) but the difference is small.

Eval 2 — Greenfield production boilerplate

What does baseline Claude produce vs. skill-augmented Claude for a new service?

Prompt: I'm instrumenting a new Node.js service with OpenTelemetry from scratch. Give me a production-ready boilerplate for the entry point that handles graceful shutdown, signal interruption, and unhandled promise rejections — all with proper span error recording so nothing goes silent.

With skill

Single-file entry point. All five safety mechanisms present and explained:

SIGTERM + SIGINT handlers ending the root span as ERROR and awaiting provider.shutdown() before exit
unhandledRejection handler
finished flag preventing double-end
Explicit rootCtx storage and a note that every async helper must accept ctx: Context (Bun caveat named)
setStatus on every exit path

The explanatory comments name the failure mode each pattern prevents.

Without skill

A more elaborate two-file setup (telemetry.ts + index.ts). Has signal handlers and isShuttingDown guard. Covers unhandledRejection and uncaughtException. Also includes metrics setup and auto-instrumentation — genuinely useful for a long-lived service.

Missing: no finished / isShuttingDown check prevents double-shutdown correctly (it does have isShuttingDown, so that part is fine). Missing: no mention of Bun context propagation. Uses startActiveSpan throughout without the caveat that it fails in Bun.

For a Node.js service (not Bun), the without-skill boilerplate is arguably more complete — it adds metrics and auto-instrumentation the with-skill response omits. For Bun, the with-skill response is the correct one.

Summary

	Eval 0	Eval 1	Eval 2
With skill	Correct Bun fix; all safety mechanisms	Correct + status on all paths	All 5 mechanisms; Bun caveat named
Without skill	Wrong fix for Bun async; missing `finished` + `unhandledRejection`	Correct (well-known JS pattern)	Correct for Node.js; missing Bun caveat
Discriminating?	Yes	Weakly	Yes, for Bun

The skill earns its keep specifically on Bun projects. For standard Node.js, baseline Claude produces correct boilerplate. The Bun context propagation failure mode is niche enough that it won't surface in general advice, and specific enough that getting it wrong wastes days.

Assertion quality: All 12 assertions passed for both sides — the assertions were too weak. A second iteration should add: "mentions Bun's unreliable async context propagation specifically," "includes finished flag," and "includes unhandledRejection handler."