Ollama

Local LLM inference server running on the Docker host.

Goals

Run large language models locally — on hardware we control, without per-token API costs. Primary requirement: host the 27b model that all axe agents use for task decomposition and podcast generation.

Effectiveness

Effective as the inference backend. Once the 27b model is loaded it handles all agent workloads without intervention. The native request queue is sufficient for our sequential workload (one axe call at a time). Reliability issues are real but manageable — per-task fault isolation means a dropped connection doesn't abort an entire run.

What made it effective

Bonus utility

OLLAMA_HOST=http://172.17.0.1:11434 ollama run <model> "prompt" works as a one-shot query from inside the container — useful for quick text processing tasks without writing a full axe agent.

Friction / pain points / surprises

ConnectionClosed on long generations. Ollama drops idle connections. axe uses the batch (non-streaming) endpoint; if generation takes long enough the socket closes mid-response. The entire request fails with no partial result. Streaming would eliminate this, but axe doesn't currently use it.

VRAM contention is absolute. The 27b model occupies all available VRAM. qwen3.5:latest (9.7B) cannot be loaded simultaneously — any attempt queues until the 27b is evicted, which never happens. We assumed both models could be used interchangeably; they can't.

context deadline exceeded is Ollama's internal timeout, not axe's. They're distinct failure modes with similar-sounding messages. Ollama's fires under heavy load or very long prompts, not configurable from the client.