Ollama

Local LLM inference server running on the Docker host.

Goals

Run large language models locally — on hardware we control, without per-token API costs. Primary requirement: host the 27b model that all axe agents use for task decomposition and podcast generation.

Effectiveness

Effective as the inference backend. Once the 27b model is loaded it handles all agent workloads without intervention. The native request queue is sufficient for our sequential workload (one axe call at a time). Reliability issues are real but manageable — per-task fault isolation means a dropped connection doesn't abort an entire run.

What made it effective

VRAM management is automatic: Ollama loads and evicts models as needed. The 27b model holds a permanent de facto lock (always the current model) — we don't need to explicitly pin it.
nomic-embed-text co-exists alongside the primary model for Antfly embeddings, without requiring a separate embedding service or separate process.
The /api/tags endpoint is a reliable health check: a fast GET with a 5s timeout confirms reachability before any task processing starts, rather than discovering a dead server on the third task.

Bonus utility

OLLAMA_HOST=http://172.17.0.1:11434 ollama run <model> "prompt" works as a one-shot query from inside the container — useful for quick text processing tasks without writing a full axe agent.

Friction / pain points / surprises

ConnectionClosed on long generations. Ollama drops idle connections. axe uses the batch (non-streaming) endpoint; if generation takes long enough the socket closes mid-response. The entire request fails with no partial result. Streaming would eliminate this, but axe doesn't currently use it.

VRAM contention is absolute. The 27b model occupies all available VRAM. qwen3.5:latest (9.7B) cannot be loaded simultaneously — any attempt queues until the 27b is evicted, which never happens. We assumed both models could be used interchangeably; they can't.

context deadline exceeded is Ollama's internal timeout, not axe's. They're distinct failure modes with similar-sounding messages. Ollama's fires under heavy load or very long prompts, not configurable from the client.

Parallel requests on a single GPU serialize invisibly, causing timeout cascades. Submitting N concurrent LLM requests doesn't speed anything up — Ollama queues them one at a time on the GPU. With ~5–13 minutes per generation and 6 requests queued, the last ones in line hit the client timeout before Ollama even starts on them. Fix: sequential invocation. Same total GPU time, no timeouts, predictable progress.

server: response contains no content is a transient empty response, not a connection error. Ollama returns this when the model produces no tokens — typically right after a prior generation finishes and the model is briefly in a bad state. axe exits 0 with this string as stdout, making it easy to misread as a successful (if strange) response. On retry it recovers.

OLLAMA_NUM_PARALLEL defaults to 4, causing KV cache starvation on large models. With qwen3.5:27b loaded, Ollama reserves KV cache for 4 simultaneous contexts by default. This fragments the 24GB-minus-weights budget across slots that are never used (our workload is strictly sequential), leaving each individual generation with ≈¼ of available KV cache. For a 32K-token context model this matters. Set OLLAMA_NUM_PARALLEL=1 in the Ollama systemd service to dedicate the full KV budget to one generation at a time. Requires a host-level systemd edit.

Input context overflow produces silent context deadline exceeded, not an error about token count. When the prompt exceeds num_ctx, Ollama doesn't reject it — it times out internally while trying to process it. The caller gets a transient exit 3 that looks like a flaky connection. We discovered this after deriving PER_ARTICLE_CAP from a constant (40,000 chars per article) that was sized for 3 articles but applied to 6 — 240K chars into a ~128K-char window. Twenty-one minutes of retries, all failing identically, before the math surfaced the cause.

"unable to load model" on VRAM exhaustion exits 3 with no distinguishing message in the OTEL trace. When a model blob is missing or VRAM is full, Ollama returns a 500 with "unable to load model: …" in the response body. axe propagates this as exit 3. If the caller uses stdio: "inherit" for the subprocess, this message reaches the container log but never the OTEL span — the trace shows only "axe essay-outliner failed (exit 3)". Fix: pipe axe's stderr and attach it to the span. The root cause (VRAM exhaustion, missing blob, wrong model tag) then surfaces in the trace without a separate log dive.

Model blobs can be partially downloaded and silently corrupt. Both hf.co/unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL and :UD-Q5_K_XL reported "unable to load model" for specific blob SHA256 hashes. The model appears in ollama list but fails every load attempt. gemma4:31b-it-q4_K_M (the standard Ollama registry tag) loads correctly. Prefer official Ollama registry tags over HuggingFace GGUF URLs when available — they're more likely to be complete and tested.

num_ctx defaults are much smaller than the model's native context window. qwen3.5:27b supports 262,144 tokens natively, but Ollama's default num_ctx is a fraction of that (typically 2048–8192). The mismatch silently limits context without any warning — the model accepts prompts up to num_ctx and truncates beyond. To extend: create a derived model via POST /api/create with a parameters field:

{ "model": "qwen3.5-128k:27b", "from": "qwen3.5:27b", "parameters": { "num_ctx": 131072 } }

However, artificially extending num_ctx does not improve instruction-following — in our testing with LangAlpha's flash agent, it made hallucination worse. The model began rewriting user queries entirely (asked for AAPL, returned ASML data; asked for TSLA, invented a "top 5 AI stocks" list). Larger context gives the model more rope to wander. Reserve context expansion for workloads where the model is already following instructions correctly.

num_ctx expansion doubles KV cache usage. At num_ctx=131072, KV cache for a single generation is ~~6–8GB on top of model weights. With a 24GB GPU this is tight alongside qwen3.5:27b (~~17GB). The generation may succeed but VRAM headroom disappears, making simultaneous embedding queries (nomic-embed-text) unsafe.