Ollama
Local LLM inference server running on the Docker host.
Goals
Run large language models locally — on hardware we control, without per-token API costs. Primary requirement: host the 27b model that all axe agents use for task decomposition and podcast generation.
Effectiveness
Effective as the inference backend. Once the 27b model is loaded it handles all agent workloads without intervention. The native request queue is sufficient for our sequential workload (one axe call at a time). Reliability issues are real but manageable — per-task fault isolation means a dropped connection doesn't abort an entire run.
What made it effective
- VRAM management is automatic: Ollama loads and evicts models as needed. The 27b model holds a permanent de facto lock (always the current model) — we don't need to explicitly pin it.
nomic-embed-textco-exists alongside the primary model for Antfly embeddings, without requiring a separate embedding service or separate process.- The
/api/tagsendpoint is a reliable health check: a fast GET with a 5s timeout confirms reachability before any task processing starts, rather than discovering a dead server on the third task.
Bonus utility
OLLAMA_HOST=http://172.17.0.1:11434 ollama run <model> "prompt" works as a one-shot query from inside the container — useful for quick text processing tasks without writing a full axe agent.
Friction / pain points / surprises
ConnectionClosed on long generations. Ollama drops idle connections. axe uses the batch (non-streaming) endpoint; if generation takes long enough the socket closes mid-response. The entire request fails with no partial result. Streaming would eliminate this, but axe doesn't currently use it.
VRAM contention is absolute. The 27b model occupies all available VRAM. qwen3.5:latest (9.7B) cannot be loaded simultaneously — any attempt queues until the 27b is evicted, which never happens. We assumed both models could be used interchangeably; they can't.
context deadline exceeded is Ollama's internal timeout, not axe's. They're distinct failure modes with similar-sounding messages. Ollama's fires under heavy load or very long prompts, not configurable from the client.