docpull
Public web to agent-ready Markdown.
Fetch public web sources locally, use explicit rendering only when needed, then hand clean Markdown, NDJSON, and context packs to coding agents, MCP clients, and RAG pipelines.
How it works
Three steps from URL to usable Markdown.
Point
Give docpull a public URL.
Fetch
It discovers pages, respects robots.txt, and converts server HTML.
Use
Load the Markdown into your agent, search index, or skill directory.
Capability map
DocPull is more than a crawler: it is a local-first source pipeline for capture, policy, pack intelligence, exports, and agent tools.
Capture
Fetch, crawl, and render
Start from one public URL, a site section, or an explicit source list.
docpull URL --profile rag --render fallback- Static and server-rendered HTML to Markdown, NDJSON, SQLite, or OKF
- Profiles for RAG, mirrors, quick samples, LLM chunks, OKF, and SEC filings
- Depth, page, path, concurrency, per-host, proxy, retry, and tokenizer controls
- Optional agent-browser rendering with domain, viewport, timeout, and HTML-size limits
Guardrails
Safety and repeatability
Designed for agent-driven fetches where URLs, credentials, and reruns need clear boundaries.
docpull policy validate source_policy.json- HTTPS and SSRF validation, robots.txt handling, pinned-DNS checks, and strict TLS defaults
- Auth policy labels plus bearer, basic, cookie, and custom-header checks
- Cache, resume, conditional fetches, dry runs, and changed-only refreshes
- Doctor diagnostics, benchmark reports, and CI-friendly regression tests
Pack intelligence
Local context packs
Turn saved sources into a local evidence set an agent can inspect before it writes.
docpull pack prepare ./pack- Refresh, score, diff, audit, and source-inventory reports
- Citation maps, entity extraction, pack search, and research briefs
- answer-pack responses grounded in local Markdown with cited source files
- Monitor init, run, list, and report flows for scheduled pack updates
Interfaces
Agent and developer surfaces
The same core workflows are available to humans, Python code, and MCP clients.
docpull mcp- CLI commands for full operator workflows and file outputs
- Python SDK exports for fetch, scrape, render, chunk, search, refresh, audit, answer, export, and serve
- MCP tools for fetch, render, ensure, list, search, read, packs, policy, and exports
- Local pack server plus JSONL, agent skill, and rule exports
Source finding
Discovery and provider research
Use local discovery first, then add provider-backed research when the agent needs to find sources.
docpull discover sitemap URL ./candidates- Import URLs, read sitemaps, normalize candidates, select sources, and fetch chosen URLs
- Source policy explain and validate flows before a pack is built
- Optional Parallel context, API, discovery, extract, fallback, diff, and entity packs
- Provider auth, init, status, and batch workflows for larger research jobs
Surface index
Core workflows are exposed where agents and developers need them.
Names differ by surface, but the durable capabilities stay aligned.
CLI
- fetch
- render
- discover
- refresh
- pack
- answer
- export
- serve
- monitor
- provider
SDK
- Fetcher
- Scraper
- RenderConfig
- PolicyConfig
- refresh_pack
- audit_pack
- answer_pack
- export_pack
- load_pack
- create_pack_app
MCP
- fetch_url
- render_url
- ensure_docs
- grep_docs
- read_doc
- source aliases
- pack_diff
- audit_pack
- answer_pack
- validate_policy
- export_pack
Outputs
- Markdown
- frontmatter
- NDJSON
- SQLite
- OKF
- chunks
- citations
- entities
- skills
- server routes
DocPull vs hosted web APIs
Exa, Parallel, and Tavily are strong hosted web-intelligence layers. DocPull is the local artifact layer that turns selected sources into repeatable context packs, audits, exports, and MCP tools.
| Dimension | DocPull | Exa | Parallel | Tavily |
|---|---|---|---|---|
| Best role | Local-first source pipeline for known URLs, source lists, pack audits, exports, and MCP-ready context. | AI search, contents extraction, deep research, structured outputs, and monitors. | LLM-optimized search and extract APIs plus repeatable task research with citations. | Search, extract, crawl, map, and research API for agent web access. |
| Starting point | A URL, sitemap, explicit source list, existing pack, or provider-discovered candidates. | A query, URL, category, structured schema, monitor, or agent research task. | A natural-language objective, search query set, URL list, or task spec. | A search query, known URL, website root, crawl instruction, or research prompt. |
| Primary output | Markdown, NDJSON, SQLite, OKF, manifests, citations, entity maps, briefs, and local pack routes. | Ranked results, highlights, full text, summaries, grounded answers, and JSON fields. | LLM-ready excerpts, clean markdown, structured task outputs, citations, and confidence signals. | Search results, extracted page content, site maps, crawls, and research reports. |
| Local corpus | First-class: cache, resume, refresh, diff, audit, answer-pack, monitor, export, and serve. | API-first. Persist results yourself when you need a durable local corpus. | API-first, with DocPull integration for local context packs from Parallel results. | API-first. Store crawl, extract, or research results in your own pipeline. |
| Agent surface | CLI, Python SDK, MCP tools, pack server, and agent skill or rule exports. | SDKs, API docs, OpenAI compatibility, MCP, and coding-agent integration guidance. | Python and TypeScript SDKs, API docs, MCP search tooling, and agent setup prompts. | REST API, Python and JavaScript SDKs, CLI, LangChain integration, and agent skills. |
| Choose it when | You need repeatable local artifacts an agent can inspect, cite, diff, refresh, and reuse offline. | You need a high-quality AI search layer with token-dense contents or structured web research. | You need web search or extraction shaped for model context, or long-running research tasks. | You need a broad hosted web API that covers search, extraction, crawl, map, and research. |
Use hosted APIs to find, rank, enrich, or research the web. Use DocPull when those selected sources need to become a durable local corpus with file paths, manifests, citations, and repeatable audits.
Profiles
Choose the output shape before you crawl.
RAG
Clean Markdown with metadata and deduplication for search and retrieval.
docpull URL --profile ragMirror
A full local archive with caching, resume on interrupt, and stable file paths.
docpull URL --profile mirrorQuick
A 50-page sample when you want to inspect output before committing to a full crawl.
docpull URL --profile quickLLM
Chunked, streaming records sized for language model context windows. JavaScript-only pages are skipped unless strict mode is on.
docpull URL --profile llm --stream | jq .Examples
See the command, then see the artifact it leaves behind.
docpull https://www.python.org/blogs/ -o ./python-news./python-news/index.md:
---
title: "Blogs"
source: https://www.python.org/blogs/
---
# Blogs
News from the Python Software Foundation, Python core
developers, and the wider Python community.
Recent posts include release notes, governance updates,
events, and project announcements...Parallel context packs
Parallel is an optional source-discovery layer. Use docpull when you already know the URL. Add Parallel when an agent needs to find sources, extract live content, and package everything into a local context pack before it starts work.
Use docpull for known URLs
Start here when you already have the URL and want a clean Markdown mirror — no browser, no API key.
- static pages, blogs, docs, and API references
- search-ready or skill-ready Markdown
- repeatable, offline-friendly archives
Add Parallel for web research
Use the Parallel layer when you need to find sources first, extract live content, or run entity and batch research before writing local context.
- research packs from search queries
- ranked source discovery with crawl plans
- cited source bundles with a load plan
- API and vendor comparison research
- diffs, entity dossiers, and batch workflows
Discovery & research packs
Parallel finds and extracts current web sources. docpull saves them locally as Markdown, structured records, source indexes, and an AGENT_CONTEXT.md load plan.
API specs & entity research
Turn llms.txt files and OpenAPI specs into local packs, or build dossiers on companies, vendors, and research targets from Parallel's entity search.
Diffs & change briefs
Compare two snapshots of a pack to see what changed, or fall back to Parallel Extract only for pages your local crawl missed.
Install
Install once, then crawl from your terminal, scripts, or agent workflow. Requires Python 3.10 or newer.
pip install docpullWhy docpull?
Answers to questions people ask before installing.