Pydoll
Async Chromium automation library for Python — no WebDriver required.
Goals
Potential scraper for paywalled or JavaScript-rendered articles that plain HTTP fetch can't reach. If Raindrop sources increasingly live behind JS gates, a headless browser becomes necessary.
Verdict
Not the right tool for this use case right now. Pydoll is a full browser automation framework — the right answer when you need to log in, click through, fill forms, or extract data from a SPA. Our current sources (two Substacks and a blog) are server-rendered and fetchable without a browser. The weight of a Chromium binary + async Python process inside a Bun pipeline doesn't pay off for content that defuddle can handle with a single HTTP GET.
Keep it in mind if sources shift toward paywalled or heavily JS-rendered sites.
What makes it effective
- No WebDriver binary. Uses Chrome DevTools Protocol directly — fewer dependencies and less breakage on Chromium version changes than Selenium/Playwright.
- Stealth features. Humanized mouse movements, realistic typing, fingerprint controls — useful for sites that actively block automation. This is its real differentiator over Playwright.
- Declarative extraction via Pydantic models.
tab.extract(QuoteModel)returns typed, validated data without manual selector chaining. Clean for structured scraping targets. - Shadow DOM + cross-origin iframe support. Handles modern front-end architectures that defeat simpler scrapers.
- Async-native.
asyncio-based throughout, which integrates naturally with Python async pipelines.
Friction / pain points / surprises
Python in a Bun/TypeScript pipeline. Calling Pydoll from sources.ts means subprocess or microservice overhead. A Python scraping sidecar is maintainable but adds a process boundary and restart surface.
Chromium binary weight. A full Chromium install is 200–300 MB and requires specific system libraries. The container would need provisioning in ensure-deps.sh, and cold start time for a headless browser launch is measured in seconds per URL.
Stealth features are probably unnecessary for our sources. Substack and ACOUP are not actively blocking scrapers. The stealth machinery is overhead for sites that don't require it.
6.7k stars but young. Active development is good; a young project also means API churn. Check changelog before pinning a version.
When to reach for it
- A source article is behind a JS-rendered paywall (e.g., Bloomberg, The Atlantic) and no other extraction path exists.
- Structured scraping of a site with consistent HTML patterns where Pydantic extraction is cleaner than regex.
- We're already running a Python process in the pipeline for TTS (KittenTTS), so the interpreter cost is already paid.