Opik

LLM evaluation platform with OTel ingestion, versioned experiments, and Python SDK.

comet-ml/opik · Python SDK + Java/Python backends Evaluated 2026-04-15. Deployed via Docker Compose against the CYOA agent family.

What it is

Opik is an open-source LLM observability and evaluation platform from Comet ML. The pitch: structured evaluation experiments, dataset versioning, a scoring SDK (GEval, AgentTaskCompletionJudge, etc.), and a UI for comparing runs. Ingestion is via OTel traces (/api/v1/private/otel/v1/traces) so existing instrumented codebases send data without SDK changes.

It is the only tool in this survey that unifies tracing and structured evals in a single platform. Everything else requires stitching Jaeger + a separate judge + a dashboard yourself.

Deployment

The stack is genuinely large:

Service Image Notes
Java backend ghcr.io/comet-ml/opik/opik-backend Port 8080, primary API
Python backend ghcr.io/comet-ml/opik/opik-backend-python Port 8000, scoring
nginx frontend + reverse proxy Port 5173 (template) / port 80 (actual)
MySQL 8.4 persistence
ClickHouse 25.3 trace/span storage
Redis 7.2 cache
MinIO blob storage
Zookeeper ClickHouse coordination
clickhouse-init one-shot config

Eight containers. Compose brings them up cleanly on a fresh host. In a DooD (Docker-out-of-Docker) environment, two things break:

clickhouse-init bind mount fails. The clickhouse-init service bind-mounts ./clickhouse_config from the host filesystem. In DooD, ./ is resolved by the host Docker daemon, which cannot see paths inside the running container — it's a Docker volume, not a host bind. Symptoms: cp: can't stat '/clickhouse_config_files/*'. Fix: pre-populate the opik_clickhouse-config Docker volume manually before starting the stack, then override clickhouse-init's command to a no-op in a compose override file.

nginx template bind mount also fails. The nginx service mounts a config template at ./nginx/.... Same DooD failure mode. Workaround: let nginx use its baked-in config; the backend API is accessible directly on the remapped port.

Both are documented neither in the README nor the Docker Compose file. They require reading the DooD volume semantics from scratch.

Port conflicts. Opik's defaults collide with a workspace that already runs Antfly (8080) and langalpha-backend (8000):

# docker-compose.dood.yaml override
services:
  backend:
    ports: !override
      - "8081:8080"
      - "3004:3003"
  python-backend:
    ports: !override
      - "8002:8000"

/health-check not /api/v1/is-alive/ping. The README and various docs reference /api/v1/is-alive/ping as a health endpoint; it returns 404. The actual health endpoint is GET /health-check (Java backend, 200 with "ok"). All private API paths are at /v1/private/... — no /api prefix.

API

Once running, the API is stable and well-structured. Key endpoints:

Path Method Description
/health-check GET 200 "ok"
/v1/private/projects GET List projects (includes Default Project)
/v1/private/datasets POST Create a dataset
/v1/private/experiments POST Create an experiment
/api/v1/private/otel/v1/traces POST OTel OTLP/HTTP trace ingestion

The OpenAPI spec is served at port 3003 (/swagger-ui.html). Worth checking before writing any client code — the spec is accurate and the request shapes aren't obvious from the README.

Evaluation SDK

The Python SDK (pip install opik) handles dataset management, experiment creation, and LLM-as-judge metrics. Core objects:

import opik
from opik.evaluation.metrics import GEval, AgentTaskCompletionJudge

opik.configure(use_local=True)
client = opik.Opik(host="http://localhost:8081")

dataset = client.get_or_create_dataset("cyoa-sessions")
dataset.insert([{"input": transcript, "expected": "all-pages-visited"} for transcript in sessions])

experiment = opik.evaluation.evaluate(
    dataset=dataset,
    task=lambda x: {"output": x["input"]},
    scoring_metrics=[GEval(
        name="navigation_quality",
        criteria="Did the agent visit diverse story pages without looping?",
        model="ollama/qwen3.5:27b",
    )],
)

GEval takes a natural-language criteria string and actually uses it — unlike Loom's judge, which ignores custom criteria. The metric calls the specified model, parses a 0–10 score, and returns it with a reason string.

AgentTaskCompletionJudge is a specialized metric designed exactly for agent session evaluation: given a task description and a transcript, it scores whether the agent completed the task. This is the right metric for CYOA navigation without having to prompt-engineer a custom judge.

What works

Friction

The stack is operationally heavy. Eight containers, three databases (MySQL, ClickHouse, Redis), object storage, and a coordinator — for what is essentially an LLM eval database and UI. On a machine that already runs Ollama, Antfly, langalpha, and the podcast pipeline, this is a meaningful resource commitment.

ClickHouse initialization is the most fragile part. If the volume is not pre-populated correctly, ClickHouse starts but all write attempts fail silently. The Java backend accepts requests and returns 200s while ClickHouse rejects the inserts internally. Traces disappear. Fix: confirm ClickHouse can execute a test query before sending real data.

Python backend 8002 is not the scoring endpoint. The Python backend at port 8002 handles model-based scoring internally; it's not called directly by clients. The SDK routes to it via the Java backend. Exposing it separately is not necessary for normal use.

GEval with a local Ollama model requires litellm format. The model string is "ollama/qwen3.5:27b" (litellm provider prefix), not the Ollama API model name directly. This is documented in the SDK reference but not in the GEval docstring.

Observations

Opik fills a gap that neither Loom nor our axe+OTel+Antfly setup fills: structured multi-run experiments with a real eval SDK where criteria actually propagates to the model. The Loom judge is better engineered at the infrastructure level (gRPC transport, circuit breakers) but unusable for non-SQL domains because its criteria field is dead code. Opik's SDK is a thin wrapper over an HTTP API; less impressive architecturally but it works.

The OTel ingestion story is the strongest integration point. Our existing podcast pipeline and CYOA sessions are already instrumented; routing those traces to Opik instead of (or alongside) Jaeger costs one env var change.

Verdict

Scoring deferred — CYOA evals scheduled for 2026-04-16. The deployment story and SDK mechanics are clear; the verdict depends on whether GEval and AgentTaskCompletionJudge produce discriminating scores across the baseline, codemode, and zombie CYOA sessions. Will update after results are in.