π― A lean Python tool for extracting clean, LLM-optimized markdown from web pages.
Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. It combines Playwright for JavaScript rendering with Trafilatura for intelligent content extraction, delivering clean markdown ready for LLM processing.
Traditional tools extract everything: ads, cookie banners, navigation menus, social media widgets...
url2md4ai extracts only what matters: clean, structured content ready for LLM processing.
# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata
# Result: 97% noise reduction (from 51KB to 9KB)
# β
Clean job title, description, requirements, benefits
# β No cookie banners, ads, or navigation clutterPerfect for:
- π€ AI content analysis workflows
- π LLM-based information extraction
- π Web scraping for research and analysis
- π Content preprocessing for RAG systems
- π― Automated content monitoring
- π§ Smart Content Extraction: Powered by
trafilaturafor intelligent text extraction from HTML. - π Dynamic Content Support: Uses
playwrightto render JavaScript on web pages, ensuring content from SPAs and dynamic sites is captured. - π§Ή Clean Output: Removes ads, cookie banners, navigation, and other noise for a cleaner final output.
- π Simple API: A straightforward Python API and CLI for easy integration into your workflows.
- π Deterministic Filenames: Generates unique, hash-based filenames from URLs for consistent output.
- π― Focused Purpose: Built specifically for AI/LLM text extraction workflows
- β‘ Fast Processing: Optional non-JavaScript mode for static content (3x faster)
- π§ CLI-First: Simple command-line interface for batch processing and automation
- π Python API: Clean programmatic access for integration into AI pipelines
- π Smart Filenames: Generate unique, deterministic filenames using URL hashes
- π Batch Processing: Parallel processing support for multiple URLs
- ποΈ Configurable: Extensive configuration options for different content types
- π Reliable: Built-in retry logic and error handling
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync
# Install Playwright browsers
uv run playwright install chromium
# Convert your first URL
uv run url2md4ai convert "https://example.com"pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"See DOCKER_USAGE.md for instructions on how to use the provided Docker setup.
The CLI provides a simple way to convert URLs to markdown or extract raw HTML.
# Convert a single URL and print to console
url2md4ai convert "https://example.com" --no-save
# Save the markdown to the default 'output' directory
url2md4ai convert "https://example.com"
# Specify a custom output directory
url2md4ai convert "https://example.com" --output-dir my_markdown# Get the raw HTML of a page and print it to the console
url2md4ai extract-html "https://example.com"# Convert a local HTML file to markdown
url2md4ai convert-html my_page.htmlFor more options, use the --help flag with any command:
url2md4ai convert --helpThe Python API provides programmatic access to the content extraction functionality.
import asyncio
from url2md4ai import ContentExtractor
# Initialize the extractor
extractor = ContentExtractor()
async def main():
url = "https://example.com"
# Extract clean markdown from a URL
markdown_result = await extractor.extract_markdown(url)
if markdown_result:
print("--- MARKDOWN ---")
print(markdown_result["markdown"])
print(f"\\nSaved to: {markdown_result['output_path']}")
# Extract raw HTML from a URL
html_content = await extractor.extract_html(url)
if html_content:
print("\\n--- HTML ---")
print(html_content[:200] + "...") # Print first 200 characters
asyncio.run(main())For use cases where you can't use asyncio, synchronous wrappers are available:
from url2md4ai import ContentExtractor
extractor = ContentExtractor()
url = "https://example.com"
# Synchronously extract markdown
markdown_result = extractor.extract_markdown_sync(url)
if markdown_result:
print(markdown_result["markdown"])
# Synchronously extract HTML
html_content = extractor.extract_html_sync(url)
if html_content:
print(html_content[:200] + "...")The behavior of the ContentExtractor can be customized through a Config object or environment variables.
Example: Custom Configuration
from url2md4ai import ContentExtractor, Config
# Customize configuration
config = Config(
timeout=60, # Page load timeout in seconds
user_agent="MyTestAgent/1.0", # Custom User-Agent
output_dir="custom_output", # Default output directory
browser_headless=True, # Run Playwright in headless mode
wait_for_network_idle=True, # Wait for network to be idle
page_wait_timeout=2000 # Additional wait time in ms
)
extractor = ContentExtractor(config=config)
# This will use the custom configuration
extractor.extract_markdown_sync("https://example.com")See src/url2md4ai/config.py for all available configuration options and their corresponding environment variables.
Contributions are welcome! Please feel free to submit a pull request or open an issue.
This project is licensed under the MIT License. See the LICENSE file for details.
# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadataBefore (Raw HTML): 51KB, 797 lines
- β Cookie consent banners
- β Website navigation
- β Social media widgets
- β Advertising content
- β Footer links and legal text
After (url2md4ai): 9KB, 69 lines
- β Job title and description
- β Key requirements
- β Company benefits
- β Application process
- β 97% noise reduction!
| Content Type | Extraction Quality | Best Settings |
|---|---|---|
| News Articles | βββββ | --no-js (faster) |
| Job Postings | βββββ | --force-js (complete) |
| Product Pages | ββββ | --clean (essential) |
| Documentation | βββββ | --raw (preserve structure) |
| Blog Posts | βββββ | default settings |
| Social Media | βββ | --force-js required |
- Support for more output formats (PDF, DOCX)
- Custom CSS selector filtering
- Integration with popular LLM APIs
- Web UI interface
- Plugin system for custom processors
- Support for authentication-required pages