docpull

docpull

Public web to agent-ready Markdown.

Fetch public web sources locally, use explicit rendering only when needed, then hand clean Markdown, NDJSON, and context packs to coding agents, MCP clients, and RAG pipelines.

pip install docpull
rag crawl
$ docpull https://www.python.org/blogs/ --profile rag -o ./python-news
robots.txt allowed; discovered 38 pages
fetching static HTML with conditional cache
[==============================] 38/38
wrote Markdown, NDJSON, manifest, and sources.md
done in 12s; 2.8 MB saved to ./python-news

How it works

Three steps from URL to usable Markdown.

STEP 01
https://
STEP 02
discovered000 / 38
HTMLMD
STEP 03
Vector store
RAG pipeline
Agent skill

Point

Give docpull a public URL.

Fetch

It discovers pages, respects robots.txt, and converts server HTML.

Use

Load the Markdown into your agent, search index, or skill directory.

Capability map

DocPull is more than a crawler: it is a local-first source pipeline for capture, policy, pack intelligence, exports, and agent tools.

Capture

Fetch, crawl, and render

Start from one public URL, a site section, or an explicit source list.

docpull URL --profile rag --render fallback
  • Static and server-rendered HTML to Markdown, NDJSON, SQLite, or OKF
  • Profiles for RAG, mirrors, quick samples, LLM chunks, OKF, and SEC filings
  • Depth, page, path, concurrency, per-host, proxy, retry, and tokenizer controls
  • Optional agent-browser rendering with domain, viewport, timeout, and HTML-size limits

Guardrails

Safety and repeatability

Designed for agent-driven fetches where URLs, credentials, and reruns need clear boundaries.

docpull policy validate source_policy.json
  • HTTPS and SSRF validation, robots.txt handling, pinned-DNS checks, and strict TLS defaults
  • Auth policy labels plus bearer, basic, cookie, and custom-header checks
  • Cache, resume, conditional fetches, dry runs, and changed-only refreshes
  • Doctor diagnostics, benchmark reports, and CI-friendly regression tests

Pack intelligence

Local context packs

Turn saved sources into a local evidence set an agent can inspect before it writes.

docpull pack prepare ./pack
  • Refresh, score, diff, audit, and source-inventory reports
  • Citation maps, entity extraction, pack search, and research briefs
  • answer-pack responses grounded in local Markdown with cited source files
  • Monitor init, run, list, and report flows for scheduled pack updates

Interfaces

Agent and developer surfaces

The same core workflows are available to humans, Python code, and MCP clients.

docpull mcp
  • CLI commands for full operator workflows and file outputs
  • Python SDK exports for fetch, scrape, render, chunk, search, refresh, audit, answer, export, and serve
  • MCP tools for fetch, render, ensure, list, search, read, packs, policy, and exports
  • Local pack server plus JSONL, agent skill, and rule exports

Source finding

Discovery and provider research

Use local discovery first, then add provider-backed research when the agent needs to find sources.

docpull discover sitemap URL ./candidates
  • Import URLs, read sitemaps, normalize candidates, select sources, and fetch chosen URLs
  • Source policy explain and validate flows before a pack is built
  • Optional Parallel context, API, discovery, extract, fallback, diff, and entity packs
  • Provider auth, init, status, and batch workflows for larger research jobs

Surface index

Core workflows are exposed where agents and developers need them.

Names differ by surface, but the durable capabilities stay aligned.

CLI

  • fetch
  • render
  • discover
  • refresh
  • pack
  • answer
  • export
  • serve
  • monitor
  • provider

SDK

  • Fetcher
  • Scraper
  • RenderConfig
  • PolicyConfig
  • refresh_pack
  • audit_pack
  • answer_pack
  • export_pack
  • load_pack
  • create_pack_app

MCP

  • fetch_url
  • render_url
  • ensure_docs
  • grep_docs
  • read_doc
  • source aliases
  • pack_diff
  • audit_pack
  • answer_pack
  • validate_policy
  • export_pack

Outputs

  • Markdown
  • frontmatter
  • NDJSON
  • SQLite
  • OKF
  • chunks
  • citations
  • entities
  • skills
  • server routes

DocPull vs hosted web APIs

Exa, Parallel, and Tavily are strong hosted web-intelligence layers. DocPull is the local artifact layer that turns selected sources into repeatable context packs, audits, exports, and MCP tools.

Comparison of DocPull, Exa, Parallel, and Tavily across role, inputs, outputs, local corpus support, agent surfaces, and fit.
DimensionDocPullExaParallelTavily
Best roleLocal-first source pipeline for known URLs, source lists, pack audits, exports, and MCP-ready context.AI search, contents extraction, deep research, structured outputs, and monitors.LLM-optimized search and extract APIs plus repeatable task research with citations.Search, extract, crawl, map, and research API for agent web access.
Starting pointA URL, sitemap, explicit source list, existing pack, or provider-discovered candidates.A query, URL, category, structured schema, monitor, or agent research task.A natural-language objective, search query set, URL list, or task spec.A search query, known URL, website root, crawl instruction, or research prompt.
Primary outputMarkdown, NDJSON, SQLite, OKF, manifests, citations, entity maps, briefs, and local pack routes.Ranked results, highlights, full text, summaries, grounded answers, and JSON fields.LLM-ready excerpts, clean markdown, structured task outputs, citations, and confidence signals.Search results, extracted page content, site maps, crawls, and research reports.
Local corpusFirst-class: cache, resume, refresh, diff, audit, answer-pack, monitor, export, and serve.API-first. Persist results yourself when you need a durable local corpus.API-first, with DocPull integration for local context packs from Parallel results.API-first. Store crawl, extract, or research results in your own pipeline.
Agent surfaceCLI, Python SDK, MCP tools, pack server, and agent skill or rule exports.SDKs, API docs, OpenAI compatibility, MCP, and coding-agent integration guidance.Python and TypeScript SDKs, API docs, MCP search tooling, and agent setup prompts.REST API, Python and JavaScript SDKs, CLI, LangChain integration, and agent skills.
Choose it whenYou need repeatable local artifacts an agent can inspect, cite, diff, refresh, and reuse offline.You need a high-quality AI search layer with token-dense contents or structured web research.You need web search or extraction shaped for model context, or long-running research tasks.You need a broad hosted web API that covers search, extraction, crawl, map, and research.

Use hosted APIs to find, rank, enrich, or research the web. Use DocPull when those selected sources need to become a durable local corpus with file paths, manifests, citations, and repeatable audits.

Profiles

Choose the output shape before you crawl.

RAG

Clean Markdown with metadata and deduplication for search and retrieval.

docpull URL --profile rag

Mirror

A full local archive with caching, resume on interrupt, and stable file paths.

docpull URL --profile mirror

Quick

A 50-page sample when you want to inspect output before committing to a full crawl.

docpull URL --profile quick

LLM

Chunked, streaming records sized for language model context windows. JavaScript-only pages are skipped unless strict mode is on.

docpull URL --profile llm --stream | jq .

Examples

See the command, then see the artifact it leaves behind.

Input
docpull https://www.python.org/blogs/ -o ./python-news
Output
./python-news/index.md:

---
title: "Blogs"
source: https://www.python.org/blogs/
---

# Blogs

News from the Python Software Foundation, Python core
developers, and the wider Python community.

Recent posts include release notes, governance updates,
events, and project announcements...

Parallel context packs

Parallel is an optional source-discovery layer. Use docpull when you already know the URL. Add Parallel when an agent needs to find sources, extract live content, and package everything into a local context pack before it starts work.

Use docpull for known URLs

Start here when you already have the URL and want a clean Markdown mirror — no browser, no API key.

  • static pages, blogs, docs, and API references
  • search-ready or skill-ready Markdown
  • repeatable, offline-friendly archives

Add Parallel for web research

Use the Parallel layer when you need to find sources first, extract live content, or run entity and batch research before writing local context.

  • research packs from search queries
  • ranked source discovery with crawl plans
  • cited source bundles with a load plan
  • API and vendor comparison research
  • diffs, entity dossiers, and batch workflows

Discovery & research packs

context-pack / discover-docs

Parallel finds and extracts current web sources. docpull saves them locally as Markdown, structured records, source indexes, and an AGENT_CONTEXT.md load plan.

API specs & entity research

api-pack / entity-pack

Turn llms.txt files and OpenAPI specs into local packs, or build dossiers on companies, vendors, and research targets from Parallel's entity search.

Diffs & change briefs

diff-brief / fallback-pack

Compare two snapshots of a pack to see what changed, or fall back to Parallel Extract only for pages your local crawl missed.

Install

Install once, then crawl from your terminal, scripts, or agent workflow. Requires Python 3.10 or newer.

pip install docpull

Why docpull?

Answers to questions people ask before installing.