Hatchet
Durable task queue and workflow orchestration platform backed by PostgreSQL. TypeScript/Python/Go/Ruby SDKs.
hatchet-dev/hatchet · 6.8k ★ · MIT · evaluated 2026-04-03
Goals
Replace axe + the home-grown checkpoint.ts in the podcast pipeline. The pipeline takes ~90 minutes and gets killed mid-run by agent task timeouts. Need step-level durability: if a step completes, it stays completed across process restarts and rescheduling.
Effectiveness
Not adopted. The infrastructure cost outweighs the benefit for a single-machine, single-pipeline workload. Hatchet's durability model is correct and the TypeScript SDK is clean, but it requires a running server process plus PostgreSQL. Adding two services to a container that runs a cron job once per day is the wrong trade.
If the podcast pipeline ever runs on multiple workers, or if the number of pipelines grows to where a shared dashboard matters, Hatchet becomes the obvious answer.
What made it effective
- Step-level checkpoint recovery: completed Hatchet task steps are not re-executed on worker restart. The event log replay model is sound — it's the right way to do durable execution.
- DAG support: parent outputs typed and routed to children via
ctx.getParentOutput(task). More structured than the current "pass everything through a mutableckptobject" approach. - Rate limits and concurrency controls are first-class features — the podcast pipeline currently has neither; if Ollama requests were fanned out instead of serialized, Hatchet's per-key rate limiting would be useful.
executionTimeout: "90m"on a task actually means something — the server tracks the clock, not the worker process.
Friction / pain points / surprises
Requires a server. PostgreSQL + the Hatchet process (or Hatchet Cloud) must be reachable before a single workflow can register. There is no embedded mode. For scripting and local pipelines, this is a dealbreaker.
Workers connect over gRPC (port 7077). Not HTTP. Container networking needs to expose this port in addition to the dashboard API port (8888).
Hatchet Cloud free tier is 100k task runs/month. Each section-writer call would be a task run. The podcast pipeline is ~15 task runs per episode. Fine at current volume, but it's a third-party dependency for execution now, not just observability.
Hatchet Lite (single Docker image) still needs Postgres. "Lite" means single binary, not "no database."
NonRetryableError is the only way to stop retries mid-flight. If a step crashes in a way you'd want to investigate before retrying (e.g., Ollama OOM), you have to throw a NonRetryableError explicitly. Default behavior is to retry per the retries count.