Defuddle

Article content extractor — converts web pages to clean Markdown.

Goals

Extract the full body text of articles from arbitrary URLs so the podcast pipeline has real source material to work with. The current pipeline only stores Raindrop excerpts (~35 words each), which leaves the essay writer hallucinating specifics rather than citing sources.

Verdict

Strong fit. It's the right tool for this slot: a focused, dependency-light content extractor built specifically for the web-clipper use case. The CLI makes integration trivial — npx defuddle parse <url> --markdown outputs exactly what we need. No browser required.

What makes it effective

Friction / pain points / surprises

Requires a DOM — can't call with raw HTML string in Bun without a shim. In the browser it uses document; in Node/Bun it needs JSDOM or linkedom as a peer dep. Not a blocker but requires two packages.

Won't defeat paywalls or JavaScript-rendered content. Defuddle extracts from fetched HTML — if the page requires auth or JavaScript to populate the article body, it gets the paywall page or a blank <main>. The Raindrop articles are Substack and a blog; Substack free posts are server-rendered and should be fine.

May need --content-selector tuning for unusual page structures. Auto-detection works well for standard article layouts but can misfire on bespoke blog templates.

Integration path

In src/sources.ts, after fetching raindrop items, fetch each URL and pass the HTML to the Defuddle Node library, then store result.content (Markdown) instead of item.excerpt. This turns 103 words of metadata into potentially thousands of words of real source material.