html-to-markdown

HTML → Markdown converter written in Go.

Goals

Convert extracted article HTML to clean Markdown for storage in Antfly and consumption by the essay writer. Would pair with a content extractor (Readability, Defuddle, etc.) in a two-step pipeline: fetch + extract HTML body → convert to Markdown.

Verdict

Excellent converter, wrong layer for our problem. html-to-markdown is a formatting tool, not an extractor — it faithfully converts whatever HTML it's given, including nav bars, footers, and cookie banners. It doesn't know what the article is. We need something that first identifies the main content, and defuddle does both steps in one pass. Use this only if we already have clean article HTML from another source.

What makes it effective

Thorough format coverage. Tables, nested lists, footnotes, code blocks, strikethrough, inline styles — all handled with alignment and escaping. It produces spec-valid Markdown, not a best-effort approximation.
Plugin architecture. Composable transformations; custom rules for domain-specific HTML patterns are straightforward to add.
Battle-tested. 3.5k stars, active maintenance, golden-file test suite. Not a weekend project.
Go library + REST API + CLI. Multiple integration surfaces; the REST API at html-to-markdown.com is handy for one-offs.

Friction / pain points / surprises

Go dependency. The library is Go-only. Calling it from a Bun/TypeScript pipeline means shelling out or using the REST API — neither is clean for an inline sources.ts function.

Not an extractor. This point is worth repeating: it converts the whole page's HTML without filtering. The output will include navigation, headers, footers, and every other element present in the source. A separate extraction step is mandatory.

Overkill for our use case. The podcast pipeline stores plain prose; we don't need table-of-contents conversion, image links, or footnote formatting. Defuddle's built-in Markdown output handles the content we care about without a second pass.

When to reach for it

You have clean, pre-extracted article HTML from a structured data source (RSS full-text feed, API response) and need reliable Markdown output.
You need Markdown from pages with complex layouts — tables, deeply nested lists — that simpler converters mangle.
Go is already in the stack.