Defuddle
Article content extractor — converts web pages to clean Markdown.
Goals
Extract the full body text of articles from arbitrary URLs so the podcast pipeline has real source material to work with. The current pipeline only stores Raindrop excerpts (~35 words each), which leaves the essay writer hallucinating specifics rather than citing sources.
Verdict
Strong fit. It's the right tool for this slot: a focused, dependency-light content extractor built specifically for the web-clipper use case. The CLI makes integration trivial — npx defuddle parse <url> --markdown outputs exactly what we need. No browser required.
What makes it effective
- Readability-class extraction without Readability's aggression. Defuddle is more forgiving — it removes clutter (comments, sidebars, nav) while keeping more of the actual article body. Less risk of gutting a piece by over-trimming.
- Markdown output natively. Other extractors hand back HTML and require a second pass. Defuddle emits Markdown directly via
--markdown, which is what Antfly and the essay writer expect. - Rich metadata. Returns author, title, publish date, word count, and schema.org data alongside the text — all fields the podcast outline step could use.
- Node.js / Bun compatible. Works as a library with JSDOM or linkedom, so it can be called inline from
sources.tswithout shelling out. npxzero-install CLI. Easy to prototype without adding a dep.
Friction / pain points / surprises
Requires a DOM — can't call with raw HTML string in Bun without a shim. In the browser it uses document; in Node/Bun it needs JSDOM or linkedom as a peer dep. Not a blocker but requires two packages.
Won't defeat paywalls or JavaScript-rendered content. Defuddle extracts from fetched HTML — if the page requires auth or JavaScript to populate the article body, it gets the paywall page or a blank <main>. The Raindrop articles are Substack and a blog; Substack free posts are server-rendered and should be fine.
May need --content-selector tuning for unusual page structures. Auto-detection works well for standard article layouts but can misfire on bespoke blog templates.
Integration path
In src/sources.ts, after fetching raindrop items, fetch each URL and pass the HTML to the Defuddle Node library, then store result.content (Markdown) instead of item.excerpt. This turns 103 words of metadata into potentially thousands of words of real source material.