Q: How does docpull compare to Firecrawl, Jina Reader, or Crawl4AI?

Firecrawl and Jina Reader are hosted APIs — your URLs route through their infrastructure and pricing scales past their free tiers. Base docpull runs locally with no API key, and --budget 0 blocks paid-capable provider and cloud calls before execution. Crawl4AI is the closest open-source peer, but it's a general-purpose agent toolkit; docpull is narrower — YAML-frontmatter Markdown and context packs tuned for public web-source ingestion, with rag, mirror, and quick profiles baked in.

Q: Will it scale to a 10,000-page site, and can I re-run it on a schedule?

Yes — measured against a synthetic 10,000-page site: about 309 s wall time, about 93 MB peak RSS delta, and 0 failed pages. Streaming deduplication keeps memory constant per page; the cache sends If-None-Match / If-Modified-Since on every cached URL so scheduled re-runs only transfer changed pages, and fetched and failed URL sets persist on disk so a crash resumes from the discovered-URL list instead of restarting.

Question 1

How does docpull compare to Firecrawl, Jina Reader, or Crawl4AI?

Accepted Answer

Firecrawl and Jina Reader are hosted APIs &mdash; your URLs route through their infrastructure and pricing scales past their free tiers. Base docpull runs locally with no API key, and --budget 0 blocks paid-capable provider and cloud calls before execution. Crawl4AI is the closest open-source peer, but it's a general-purpose agent toolkit; docpull is narrower &mdash; YAML-frontmatter Markdown and context packs tuned for public web-source ingestion, with rag, mirror, and quick profiles baked in.

Question 2

How clean is the Markdown? Does it preserve code blocks, tables, and images?

Accepted Answer

Yes. Fenced code blocks keep their language hints (Prism, highlight.js, Shiki, and GitHub conventions are all normalized), tables convert to Markdown pipes, and images keep their alt text. Nav bars, footers, sidebars, and common cookie/consent banners (OneTrust, Osano, GDPR walls, Cookiebot, Iubenda) are stripped before conversion via the extractor's remove-selector list.

Question 3

Does it render JavaScript?

Accepted Answer

Not by default. The normal crawler runs without a browser. Pages that require JavaScript to render content are detected and skipped, or hard-failed with --strict-js-required, so an agent can route elsewhere. For simple JavaScript-rendered public pages, use --render fallback or docpull render. For interaction-heavy pages, use a browser automation tool.

Question 4

Will it scale to a 10,000-page site, and can I re-run it on a schedule?

Accepted Answer

Yes &mdash; measured against a synthetic 10,000-page site: about 309 s wall time, about 93 MB peak RSS delta, and 0 failed pages. Streaming deduplication keeps memory constant per page; the cache sends If-None-Match / If-Modified-Since on every cached URL so scheduled re-runs only transfer changed pages, and fetched and failed URL sets persist on disk so a crash resumes from the discovered-URL list instead of restarting.

Question 5

Does it handle auth-gated pages?

Accepted Answer

Yes. Pass credentials with --auth-bearer, --auth-basic, --auth-cookie, or --auth-header. They ride with every request, so internal docs, subscriber-only pages, customer portals, and corporate wikis all work.

Question 6

Do Parallel workflows require an API key?

Accepted Answer

Base crawling, offline demo/import packs, and pack scoring do not require a Parallel API key. Live Parallel API workflows read the key from PARALLEL_API_KEY, user config, or project .env.local after docpull parallel init or docpull parallel auth checks local SDK and key presence. The auth check does not make a live key-validation call. docpull never writes the key into pack artifacts, but the artifacts can include source content, workflow inputs and outputs, selected URLs, and metadata. Every generated pack also includes AGENT_CONTEXT.md so agents have a local load plan before inspecting deeper metadata.

Question 7

Does the output drop straight into a Claude Code skill?

Accepted Answer

Yes. Run `docpull URL --skill name` and docpull writes a complete skill directory to .claude/skills/name/: a generated SKILL.md manifest with name and description fields derived from the source's OpenGraph metadata, plus hierarchically-named pages alongside it. No hand-editing required.

Question 8

Can I use it as a Python library?

Accepted Answer

Yes. Import Fetcher and DocpullConfig, configure programmatically, and iterate over async events as pages are fetched. See the Python tab above for a minimal setup.

Dimension	DocPull	Exa	Parallel	Tavily
Best role	Local-first source pipeline for known URLs, source lists, pack audits, exports, and MCP-ready context.	AI search, contents extraction, deep research, structured outputs, and monitors.	LLM-optimized search and extract APIs plus repeatable task research with citations.	Search, extract, crawl, map, and research API for agent web access.
Starting point	A URL, sitemap, explicit source list, existing pack, or provider-discovered candidates.	A query, URL, category, structured schema, monitor, or agent research task.	A natural-language objective, search query set, URL list, or task spec.	A search query, known URL, website root, crawl instruction, or research prompt.
Primary output	Markdown, NDJSON, SQLite, OKF, manifests, citations, entity maps, briefs, and local pack routes.	Ranked results, highlights, full text, summaries, grounded answers, and JSON fields.	LLM-ready excerpts, clean markdown, structured task outputs, citations, and confidence signals.	Search results, extracted page content, site maps, crawls, and research reports.
Local corpus	First-class: cache, resume, refresh, diff, audit, answer-pack, monitor, export, and serve.	API-first. Persist results yourself when you need a durable local corpus.	API-first, with DocPull integration for local context packs from Parallel results.	API-first. Store crawl, extract, or research results in your own pipeline.
Agent surface	CLI, Python SDK, MCP tools, pack server, and agent skill or rule exports.	SDKs, API docs, OpenAI compatibility, MCP, and coding-agent integration guidance.	Python and TypeScript SDKs, API docs, MCP search tooling, and agent setup prompts.	REST API, Python and JavaScript SDKs, CLI, LangChain integration, and agent skills.
Choose it when	You need repeatable local artifacts an agent can inspect, cite, diff, refresh, and reuse offline.	You need a high-quality AI search layer with token-dense contents or structured web research.	You need web search or extraction shaped for model context, or long-running research tasks.	You need a broad hosted web API that covers search, extraction, crawl, map, and research.

Public web to agent-ready Markdown.

How it works

Point

Fetch

Use

Capability map

Fetch, crawl, and render

Safety and repeatability

Local context packs

Agent and developer surfaces

Discovery and provider research

Core workflows are exposed where agents and developers need them.

CLI

SDK

MCP

Outputs

DocPull vs hosted web APIs

Profiles

RAG

Mirror

Quick

LLM

Examples

Parallel context packs

Use docpull for known URLs

Add Parallel for web research

Discovery & research packs

API specs & entity research

Diffs & change briefs

Install

Why docpull?