Loom

Pattern-guided LLM agent framework with multi-dimensional evaluation

Evaluated 2026-04-15. Deployed against the zombie CYOA problem alongside our existing axe/cyoa-eval experiments.

What it is

Loom is a Go-native LLM orchestration server from Teradata Labs. It exposes agents via gRPC (port 60051) and HTTP/SSE (port 5006), manages them through YAML configs called "patterns," and includes a judge system for multi-dimensional evaluation — quality, safety, cost, domain, performance, usability. Its flagship concept is the "Weaver," which composes patterns into threads.

The pitch: structured, observable, self-improving agent pipelines with LLM-as-judge built in. Target audience appears to be enterprise ML ops teams.

Deployment

The build is non-trivial:

Requires Go 1.25 (not 1.24) and the fts5 build tag
Must run go run ./cmd/generate-weaver before building to generate embedded/weaver.yaml
Build: go build -tags fts5 -o bin/looms ./cmd/looms

Three non-obvious runtime issues before the server accepts requests:

database.path config is ignored. The server logs the correct path from your YAML but opens ~/.loom/loom.db instead. Workaround: set LOOM_DATA_DIR=/your/workspace/path before starting.

Provider pool requires squashed LLM fields. The providers: array uses mapstructure:",squash" — LLM fields must be inline alongside name:, not nested under llm::

providers:
  - name: "ollama"
    provider: "ollama"          # ← inline, not under llm:
    ollama_model: "qwen3.5:latest"
    ollama_endpoint: "http://172.17.0.1:11434"

HTTP_PROXY breaks Ollama calls. Go's net/http respects the HTTP_PROXY env var; our proxy at host.docker.internal:10255 returned 400 for requests to 172.17.0.1:11434. Required: NO_PROXY=172.17.0.1,host.docker.internal. Masked by an empty 400 body — took three debug iterations to isolate.

Once past these, the server starts cleanly, preflight checks pass, and the weave API works.

Agent system

Agents are YAML files loaded from the patterns directory. The format is featureful — system prompts, tool grants, memory stores, compression profiles. Our cyoa-player.yaml loaded without issues and responded correctly via the HTTP SSE endpoint:

curl -X POST http://localhost:5006/v1/weave:stream \
  -d '{"query": "Page: Call Out. Choices: 0=Wait, 1=Move toward voices, 2=Run. Tried: []. Choose."}'
# → {"choice": 1, "reason": "Moving toward voices maximizes the chance of assistance..."}

The agent correctly returned structured JSON on the first call. Streaming SSE works as documented.

Judge system

The judge system is Loom's most distinctive feature. Register a judge with evaluation criteria; call looms judge evaluate with agent ID, prompt, response, and judge IDs.

What works: registration, multi-judge aggregation, evaluation pipeline, scoring output.

Critical limitation: custom criteria is ignored. The NewLLMJudge implementation creates an LLM judge instance with only the provider — it does not pass cfg.Criteria into the judge prompt. Instead, buildJudgePromptHardcoded() always runs a SQL-evaluation template asking for factual_accuracy, hallucination_score, query_quality, completeness. No amount of judge YAML configuration changes this template.

Running all three CYOA sessions through the judge:

Session	Turns	Verdict	Score
Zombie (36-turn loop, no backtracking)	40	PARTIAL	65/100
Baseline (tried-choices in prompt)	17	FAIL	50/100
Code Mode (TypeScript execution)	14	FAIL	60/100

The scores are meaningless for CYOA — the judge penalizes all three for "no SQL query provided." The zombie run scored highest (65) purely because its longer transcript contained more text for the judge to find coherent, not because it performed better.

Evaluation latency ranged from 51–158 seconds per session with qwen3.5:latest.

Observations

Loom is architecturally serious: gRPC transport, SQLite persistence, circuit breakers on retry configs, observability hooks. The README describes a self-improving loop where judge verdicts feed back into pattern refinement. That loop is the product's core value proposition.

But the judge's hardcoded SQL template is a showstopper for non-SQL use cases. The criteria field in JudgeConfig appears to be dead code — present in the proto definition and YAML registration, passed nowhere. A custom prompt passed via the judge YAML silently does nothing.

The proxy/NO_PROXY issue is a common Docker gotcha, but Loom makes it uniquely painful because the error is a bare 400 with no body. It took testing Go's net/http directly to isolate. Adding NO_PROXY to the server startup is not documented.

Verdict

3/5. Loom has solid bones — the gRPC/HTTP dual transport, agent YAML system, and judge pipeline scaffold are all well-designed. But v1.2.0 has three deployment blockers requiring source inspection to resolve, and the judge system's custom criteria is unimplemented. Worth revisiting once the criteria injection is fixed and the SQL template becomes a default rather than the only option.

If you need an observable, self-improving agent loop with LLM-as-judge — and you're willing to patch buildJudgePromptHardcoded — Loom is the most complete framework we've seen for that use case. For general CYOA-style navigation evaluation, our axe + OTel + Antfly setup remains more flexible.