pdftomd.shis a RAG workflow-friendly enhancement of Marker that converts a PDF into a single markdown file. It handles GPU and PyTorch configuration, document splitting and chunking, image BASE64 embedding, LLM post-processing and cleanup, and consolidation of output
For more on Marker, see https://github.com/datalab-to/marker
- Splits large PDFs into chunks (defaults configurable via
CHUNK_PAGES_DEFAULT,CHUNK_PAGES_CLEAN,CHUNK_PAGES_LLMinpdftomd.conf) and runs Marker once on the chunk folder (avoids repeated model loads). - Consolidates all chunk markdown into a single
.mdfile. - Optionally embeds images as Base64 (no external asset folders needed).
- Optional text-only output that strips image links from the final markdown.
- Optional OCR pass via bundled
ocr-pdf/ocr-pdf.shbefore conversion (advanced PDF OCR pipeline script, GPU-aware via EasyOCR plugin) - Optional LLM helper via a built-in Marker
--use_llm. - Automatically uses GPU when available and installs CUDA-enabled torch when needed.
- Cleans up intermediate files and attempts to stop spawned processes on exit.
- Optional supplemental LLM post-processing step with
--clean.
The overall result can be a much cleaner more streamlined end product more suited to RAG pipeline ingestion.
Run pdftomd.sh as the ingestion step that turns source PDFs into markdown your splitter and embedder can consume. A typical flow is:
- (Optional) OCR the PDF with
-ofor scanned documents or rely on Marker's built-in OCR. - Convert to a single consolidated markdown file (and optionally embed images with
-eor ignore images altogether with-t). - Feed the markdown into your chunker, add metadata (file name, page ranges), then index.
Example ingestion command:
./pdftomd.sh -e -o /path/to/source.pdfBenefits over calling Marker directly:
- Handles large documents via chunking while keeping a single output file, which simplifies downstream chunking and metadata.
- Avoids repeated model loads by running Marker once across all chunks, improving throughput for big PDFs.
- Keeps assets self-contained with Base64 embedding or a single attachment bundle, reducing file management for ingestion jobs.
- Adds a wrapper-managed LLM cleanup pass (
--clean) with explicit chunking viaMAX_TOKENS, which can handle prompt-size limits and timeouts more predictably than Marker’s built-in LLM helper. - Provides operational glue (GPU detection, torch install, cleanup on exit, consistent output location) so pipeline orchestration is simpler.
./pdftomd.sh /path/to/file.pdfYou can also pass a directory to process all PDFs inside it sequentially:
./pdftomd.sh /path/to/folderAdd -r/--recurse to include PDFs in subdirectories:
./pdftomd.sh -r /path/to/folderThis produces file.md in the current directory. If you are not embedding images, it also produces a file_bundle.tar.xz archive with attachments.
-c, --clean: Post-process the final markdown with the configured LLM to improve readability and fix OCR errors. Creates a.bakof the original markdown and appends footnotes with original text. This is a wrapper-level cleanup pass and can be used together with-l. Note that it can result in longer conversion times.--cpu: Force CPU processing (ignore GPU even if present).-e, --embed: Embed images as Base64 in the output markdown.-h, --help: Show usage.-l, --llm: Enable Marker LLM helper (--use_llm) during conversion. Copypdftomd.conf.pubtopdftomd.confand configure credentials (e.g.,GOOGLE_API_KEY), then optionally setLLM_SERVICE. For OpenAI-compatible endpoints setLLM_SERVICE=marker.services.openai.OpenAIServiceand supplyOPENAI_API_KEY,OPENAI_MODEL, andOPENAI_BASE_URL. When-lis enabled, the wrapper uses smaller PDF chunks controlled byCHUNK_PAGES_LLM(overridingCHUNK_PAGES_CLEANif--cleanis also enabled) to reduce prompt sizes, and it will abort/retry once without--use_llmif it detects a "Rate limit error" in Marker output.-n, --no-strip-ocr-layer: Disable OCR text layer stripping when-ois not used.-o, --ocr: Run OCR via bundledocr-pdf/ocr-pdf.shbefore conversion (produces<filename>_OCR.md).--preclean-copy: Save a copy of the merged markdown (before--clean) as<name>_preclean.md.-r, --recurse: Recursively process PDFs when a directory is provided.-s, --strip-ocr-layer: Always strip OCR text layer when-ois not used (skips detection).-t, --text: Remove image links from the final markdown (ignores--embed).-v, --verbose: Show verbose output.-w, --workers N: Number of worker processes for marker (default is 1).
- Output is moved to the directory where the script is run.
- When
-o/--ocris used, the OCR pass writes<filename>_OCR.pdfin the current directory and the final markdown is named<filename>_OCR.md. - When images are not embedded, the script creates an archive (
*_bundle.tar.xz) with attachment directories and prints a reminder to extract it. - When
-t/--textis used, image links are removed from the final markdown and no attachment bundle is created. - At the end, the script prints total conversion time (HH:MM:SS) and time per page (seconds, 2 decimals).
Copy pdftomd.conf.pub to pdftomd.conf, edit the values for your environment, and keep pdftomd.conf out of version control.
All tweakable defaults (paths, OCR stripping thresholds, LLM settings, etc.) can be overridden in pdftomd.conf; pdftomd.conf.pub contains the full list of supported parameters with defaults.
Run ./install.sh to clone/update the Marker repo into ./marker, create the Marker venv, install dependencies, and populate pdftomd.conf if it does not exist (by copying pdftomd.conf.pub). It also updates MARKER_DIRECTORY, MARKER_VENV, MARKER_RESULTS, and OCR_SCRIPT in pdftomd.conf to match the local install.
Use ./install.sh --force to overwrite pdftomd.conf by re-copying pdftomd.conf.pub before applying the path updates.
Run ./update-marker.sh to pull the latest Marker changes without modifying your local configuration or venv. It will refuse to update if the marker repo has local changes, and performs lightweight checks (entrypoints + wrapper syntax).
Marker already performs OCR on images during conversion, so -o/--ocr is optional. The bundled ocr-pdf/ocr-pdf.sh is a separate pre-processing pipeline that uses OCRmyPDF + Tesseract (optionally via the EasyOCR plugin for GPU) and adds steps like blank-page detection/removal, deskewing, autorotation, and size optimization before Marker runs. Use it if you want to experiment with alternate OCR engines/languages or extra pre-processing on scanned PDFs. In general, Marker's built-in OCR does a better job, however.
When -o/--ocr is enabled, the wrapper passes --disable_ocr to Marker so it does not override the pre-processed OCR layer. When -o/--ocr is not used, the wrapper forces Marker OCR and strips existing OCR text layers to prefer Marker’s own OCR.
When -o/--ocr is not used, the wrapper performs a fast PyPDF2 pass to detect OCR text layers and, if detected, physically strips text objects from the input PDF before running Marker. This helps prevent stale OCR layers from being reused. The pass uses the Marker venv and will install PyPDF2 there if missing.
Use -s/--strip-ocr-layer to force stripping without detection, or -n/--no-strip-ocr-layer to disable the stripping step. Detection thresholds are configurable in pdftomd.sh via OCR_DETECT_INVISIBLE_RATIO, OCR_DETECT_MIN_PAGE_RATIO, and OCR_DETECT_MIN_PAGES.
-l/--llm tells Marker to use its LLM helper during conversion. Marker does not enforce an input token cap for this helper; it sends the full prompt and relies on the backend model’s limits. --clean is a separate, wrapper-driven post-processing step that is more aggressive about fixing OCR errors and adds footnotes for traceability; it also chunk-splits the markdown based on MAX_TOKENS in pdftomd.conf.
When -l is enabled, the wrapper monitors Marker output for "Rate limit error" and will abort and then retry calling Marker without the --use_llm option to see if that works. This stops Marker from timing out repeatedly and, after quite some time has elapsed, ultimately erroring out. This detection is string-based and could be brittle if Marker’s log messaging changes.
If the fallback keeps triggering (and time is being lost restarting the conversion), consider dropping -l while keeping --clean: the wrapper’s cleanup pass handles chunking more predictably, and still delivers improved readability after conversion.
If you want to strip the --clean footnotes and the OCR Corrections Notes section, use the bundled helper:
./remove-OCR-correction.sh /path/to/file.mdqpdfandpxz- Marker installed in the configured
MARKER_DIRECTORYwith an active venv - Bundled
ocr-pdf/ocr-pdf.sh(required for-o/--ocr) - NVIDIA driver installed if you want GPU (torch will be auto-installed in the venv)
- CUDA OOM with multiple workers: reduce to
-w 1. - If a run is interrupted, stale marker processes may hold GPU memory. Check with
nvidia-smi. - If Marker reports conversion errors (e.g., CUDA OOM), the script exits non-zero even if marker itself returns 0.
https://github.com/datalab-to/marker
Marker converts documents to markdown, JSON, and HTML with a focus on speed and layout fidelity.
- Supports PDFs plus common office and web formats (PPTX, DOCX, XLSX, HTML, EPUB, images)
- Preserves structure for tables, forms, equations, inline math, links, references, and code blocks
- Extracts images and reduces layout artifacts like headers and footers
- Extensible with custom formatting and post-processing logic
- Optional LLM-assisted mode for higher accuracy on complex layouts
- Runs on GPU, CPU, or MPS with batch-friendly processing
Marker is designed for high throughput and strong accuracy. Reported benchmarks show it outperforming many hosted services and other open source tools.
For the highest accuracy, pass the --use_llm flag to combine Marker with an LLM. This improves table structure, multi-page table merging, inline math, and form value extraction.