LangChain document loader for OpenDataLoader PDF — parse PDFs into structured Document objects for RAG pipelines.
For the full feature set of the core engine (hybrid AI mode, OCR, formula extraction, benchmarks, accessibility), see the OpenDataLoader PDF documentation.
- Accurate reading order — XY-Cut++ algorithm handles multi-column layouts correctly
- Table extraction — Preserves table structure in output
- Multiple formats — Text, Markdown, JSON (with bounding boxes), HTML
- Per-page splitting — Each page becomes a separate
Documentwith page number metadata - AI safety — Built-in prompt injection filtering (hidden text, off-page content, invisible layers)
- 100% local — No cloud APIs, your documents never leave your machine
- Fast — Rule-based extraction, no GPU required
- Python >= 3.10
- Java 11+ available on system
PATH
pip install -U langchain-opendataloader-pdffrom langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
format="text"
)
documents = loader.load()
print(documents[0].page_content)
print(documents[0].metadata)
# {'source': 'document.pdf', 'format': 'text', 'page': 1}from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
# Single file, multiple files, or directories — all in one call
loader = OpenDataLoaderPDFLoader(
file_path=["report1.pdf", "report2.pdf", "documents/"]
)
docs = loader.load()# Plain text (default) — best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")
# Markdown — preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")
# JSON — structured data with bounding boxes for source citations
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")
# HTML — styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")For accessible PDFs with structure tags (common in government/legal documents):
loader = OpenDataLoaderPDFLoader(
file_path="accessible_document.pdf",
use_struct_tree=True # Use native PDF structure
)loader = OpenDataLoaderPDFLoader(
file_path="financial_report.pdf",
format="markdown",
table_method="cluster" # Better for borderless tables
)# Replace emails, phone numbers, IPs, credit cards, URLs with placeholders
loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
sanitize=True
)loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
pages="1,3,5-10"
)# By default, headers and footers are excluded for cleaner RAG output
loader = OpenDataLoaderPDFLoader(
file_path="document.pdf",
include_header_footer=True
)loader = OpenDataLoaderPDFLoader(
file_path="encrypted.pdf",
password="secret123"
)# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines
# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
file_path="doc.pdf",
format="markdown",
image_output="embedded",
image_format="jpeg" # or "png"
)
# Save images as files to a local directory
loader = OpenDataLoaderPDFLoader(
file_path="doc.pdf",
format="markdown",
image_output="external",
image_dir="./images", # images saved here; defaults to temp dir if not set
image_format="png"
)For complex documents (tables, charts, scanned content), hybrid mode routes pages to an AI backend for better accuracy while keeping simple pages on the fast local engine:
# Requires a running docling-fast server (default: localhost:5002)
loader = OpenDataLoaderPDFLoader(
file_path="complex_report.pdf",
format="markdown",
hybrid="docling-fast", # Enable hybrid extraction
hybrid_mode="auto", # Auto-triage: only complex pages go to backend
hybrid_url="http://localhost:5002",
)
documents = loader.load()
# Document metadata shows which backend was used
print(documents[0].metadata)
# {'source': 'complex_report.pdf', 'format': 'markdown', 'page': 1, 'hybrid': 'docling-fast'}loader = OpenDataLoaderPDFLoader(
file_path="doc.pdf",
quiet=True
)from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Load PDF
loader = OpenDataLoaderPDFLoader(
file_path="knowledge_base.pdf",
format="markdown",
quiet=True
)
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
# Query
results = vectorstore.similarity_search("What is the main topic?")| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str | List[str] |
— | (Required) PDF file path(s) or directories |
format |
str |
"text" |
Output format: "text", "markdown", "json", "html" |
split_pages |
bool |
True |
Split into separate Documents per page |
quiet |
bool |
False |
Suppress console logging |
password |
str |
None |
Password for encrypted PDFs |
use_struct_tree |
bool |
False |
Use PDF structure tree (tagged PDFs) |
table_method |
str |
"default" |
"default" (border-based) or "cluster" (border + clustering) |
reading_order |
str |
"xycut" |
"xycut" or "off" |
keep_line_breaks |
bool |
False |
Preserve original line breaks |
image_output |
str |
"off" |
"off", "embedded" (Base64), or "external" |
image_format |
str |
"png" |
"png" or "jpeg" |
image_dir |
str |
None |
Directory for extracted images when using image_output="external" |
sanitize |
bool |
False |
Sanitize sensitive data (emails, phone numbers, IPs, credit cards, URLs) |
pages |
str |
None |
Pages to extract (e.g., "1,3,5-7"). Default: all pages |
include_header_footer |
bool |
False |
Include page headers and footers in output |
content_safety_off |
List[str] |
None |
Disable safety filters: "hidden-text", "off-page", "tiny", "hidden-ocg", "all" |
replace_invalid_chars |
str |
None |
Replacement for invalid characters |
hybrid |
str |
None |
Hybrid AI backend: "docling-fast". Requires running backend server |
hybrid_mode |
str |
None |
"auto" (route complex pages) or "full" (route all pages) |
hybrid_url |
str |
None |
Backend server URL. Default: http://localhost:5002 |
hybrid_timeout |
str |
None |
Backend timeout in ms. Default: "30000" |
hybrid_fallback |
bool |
False |
Fall back to Java extraction on backend failure |
Each returned Document includes metadata:
doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}
# When hybrid mode is active:
# {'source': 'document.pdf', 'format': 'text', 'page': 1, 'hybrid': 'docling-fast'}When split_pages=False, the page key is omitted.
Apache License 2.0. See LICENSE for details.
- Documentation — Full documentation (hybrid mode, benchmarks, accessibility)
- GitHub — Core engine source code
- LangChain Docs — LangChain integration reference
- PyPI Package