Opik

LLM evaluation platform with OTel ingestion, versioned experiments, and Python SDK.

comet-ml/opik · Python SDK + Java/Python backends Evaluated 2026-04-15. Deployed via Docker Compose against the CYOA agent family.

What it is

Opik is an open-source LLM observability and evaluation platform from Comet ML. The pitch: structured evaluation experiments, dataset versioning, a scoring SDK (GEval, AgentTaskCompletionJudge, etc.), and a UI for comparing runs. Ingestion is via OTel traces (/api/v1/private/otel/v1/traces) so existing instrumented codebases send data without SDK changes.

It is the only tool in this survey that unifies tracing and structured evals in a single platform. Everything else requires stitching Jaeger + a separate judge + a dashboard yourself.

Deployment

The stack is genuinely large:

Service	Image	Notes
Java backend	`ghcr.io/comet-ml/opik/opik-backend`	Port 8080, primary API
Python backend	`ghcr.io/comet-ml/opik/opik-backend-python`	Port 8000, scoring
nginx	frontend + reverse proxy	Port 5173 (template) / port 80 (actual)
MySQL 8.4	persistence
ClickHouse 25.3	trace/span storage
Redis 7.2	cache
MinIO	blob storage
Zookeeper	ClickHouse coordination
clickhouse-init	one-shot config

Eight containers. Compose brings them up cleanly on a fresh host. In a DooD (Docker-out-of-Docker) environment, two things break:

clickhouse-init bind mount fails. The clickhouse-init service bind-mounts ./clickhouse_config from the host filesystem. In DooD, ./ is resolved by the host Docker daemon, which cannot see paths inside the running container — it's a Docker volume, not a host bind. Symptoms: cp: can't stat '/clickhouse_config_files/*'. Fix: pre-populate the opik_clickhouse-config Docker volume manually before starting the stack, then override clickhouse-init's command to a no-op in a compose override file.

nginx template bind mount also fails. The nginx service mounts a config template at ./nginx/.... Same DooD failure mode. Workaround: let nginx use its baked-in config; the backend API is accessible directly on the remapped port.

Both are documented neither in the README nor the Docker Compose file. They require reading the DooD volume semantics from scratch.

Port conflicts. Opik's defaults collide with a workspace that already runs Antfly (8080) and langalpha-backend (8000):

# docker-compose.dood.yaml override
services:
  backend:
    ports: !override
      - "8081:8080"
      - "3004:3003"
  python-backend:
    ports: !override
      - "8002:8000"

/health-check not /api/v1/is-alive/ping. The README and various docs reference /api/v1/is-alive/ping as a health endpoint; it returns 404. The actual health endpoint is GET /health-check (Java backend, 200 with "ok"). All private API paths are at /v1/private/... — no /api prefix.

API

Once running, the API is stable and well-structured. Key endpoints:

Path	Method	Description
`/health-check`	GET	200 "ok"
`/v1/private/projects`	GET	List projects (includes Default Project)
`/v1/private/datasets`	POST	Create a dataset
`/v1/private/experiments`	POST	Create an experiment
`/api/v1/private/otel/v1/traces`	POST	OTel OTLP/HTTP trace ingestion

The OpenAPI spec is served at port 3003 (/swagger-ui.html). Worth checking before writing any client code — the spec is accurate and the request shapes aren't obvious from the README.

Evaluation SDK

The Python SDK (pip install opik) handles dataset management, experiment creation, and LLM-as-judge metrics. Core objects:

import opik
from opik.evaluation.metrics import GEval, AgentTaskCompletionJudge

opik.configure(use_local=True)
client = opik.Opik(host="http://localhost:8081")

dataset = client.get_or_create_dataset("cyoa-sessions")
dataset.insert([{"input": transcript, "expected": "all-pages-visited"} for transcript in sessions])

experiment = opik.evaluation.evaluate(
    dataset=dataset,
    task=lambda x: {"output": x["input"]},
    scoring_metrics=[GEval(
        name="navigation_quality",
        criteria="Did the agent visit diverse story pages without looping?",
        model="ollama/qwen3.5:27b",
    )],
)

GEval takes a natural-language criteria string and actually uses it — unlike Loom's judge, which ignores custom criteria. The metric calls the specified model, parses a 0–10 score, and returns it with a reason string.

AgentTaskCompletionJudge is a specialized metric designed exactly for agent session evaluation: given a task description and a transcript, it scores whether the agent completed the task. This is the right metric for CYOA navigation without having to prompt-engineer a custom judge.

What works

Criteria is actually used. The single most important property for our use case. You write what you want to measure; the model scores against it.
OTel ingestion is first-class. Existing OTEL-instrumented pipelines send traces to Opik without modification — just point OTEL_EXPORTER_OTLP_ENDPOINT at http://localhost:8081/api/v1/private/otel/v1/traces. Spans appear as traces in the UI alongside experiment results.
Dataset versioning. Datasets are named and versioned; experiments reference a dataset version. You can re-evaluate the same sessions against a new metric without re-importing data.
UI is usable. Traces, experiments, and dataset items are all browsable. Score distributions, metric breakdowns by item, and experiment comparison are built in.

Friction

The stack is operationally heavy. Eight containers, three databases (MySQL, ClickHouse, Redis), object storage, and a coordinator — for what is essentially an LLM eval database and UI. On a machine that already runs Ollama, Antfly, langalpha, and the podcast pipeline, this is a meaningful resource commitment.

ClickHouse initialization is the most fragile part. If the volume is not pre-populated correctly, ClickHouse starts but all write attempts fail silently. The Java backend accepts requests and returns 200s while ClickHouse rejects the inserts internally. Traces disappear. Fix: confirm ClickHouse can execute a test query before sending real data.

Python backend 8002 is not the scoring endpoint. The Python backend at port 8002 handles model-based scoring internally; it's not called directly by clients. The SDK routes to it via the Java backend. Exposing it separately is not necessary for normal use.

GEval with a local Ollama model requires litellm format. The model string is "ollama/qwen3.5:27b" (litellm provider prefix), not the Ollama API model name directly. This is documented in the SDK reference but not in the GEval docstring.

Observations

Opik fills a gap that neither Loom nor our axe+OTel+Antfly setup fills: structured multi-run experiments with a real eval SDK where criteria actually propagates to the model. The Loom judge is better engineered at the infrastructure level (gRPC transport, circuit breakers) but unusable for non-SQL domains because its criteria field is dead code. Opik's SDK is a thin wrapper over an HTTP API; less impressive architecturally but it works.

The OTel ingestion story is the strongest integration point. Our existing podcast pipeline and CYOA sessions are already instrumented; routing those traces to Opik instead of (or alongside) Jaeger costs one env var change.

Verdict

Scoring deferred — CYOA evals scheduled for 2026-04-16. The deployment story and SDK mechanics are clear; the verdict depends on whether GEval and AgentTaskCompletionJudge produce discriminating scores across the baseline, codemode, and zombie CYOA sessions. Will update after results are in.