MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
-
Updated
Mar 27, 2026 - Python
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
pebkac Chrome Nonautomation - A Local LLM-Driven Web Co-Browser using Smolagents, Zendriver, Trafilatura.
Fast, accurate web content extraction in Rust. ML page-type classification, per-type extraction, confidence scoring. F1=0.966 on ScrapingHub (#1), F1=0.859 across 2,008 annotated pages (1,497 development + 511 held-out test
web Scrapper In Python
Telegram Mini App that saves internet articles to read them later
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
ChatGPT AI Clone
A pipe-based news article scraping and metadata extraction library for Python
Real-time AI search and chat backend with WebSocket streaming, powered by Tavily web search and Google Gemini for Flutter apps.
Tools for LLMs to anonymously search and browse the web
🕷️ Clean, chunked documentation crawler optimized for RAG & AnythingLLM. Dockerized.
Trafilatura API for html content info extract
This project is a Python-based web scraping tool that uses the Trafilatura library to extract and save text content from a list of specified websites. The program is designed to process multiple URLs, extract their main content, and save each website's content to a separate .txt file.
A web scraper with an LLM-powered document suggestion system that combines web crawling, data extraction, and advanced AI capabilities to recommend relevant documents.
🕵️♂️ Enable anonymous web searches for your LLM with the first-ever Model Context Protocol server utilizing Tor for secure and private information retrieval.
🤖 Collection of AI agents for web search, RAG, and multi-agent collaboration. Features phi-agent + Groq integration, Ollama support, DuckDuckGo/Google search, web scraping, and local knowledge base querying with vector embeddings.
Tools for LLMs to anonymously search and browse the web
Local-first search tool layer for AI agents, built with FastAPI, SearXNG, and Trafilatura.
Add a description, image, and links to the trafilatura topic page so that developers can more easily learn about it.
To associate your repository with the trafilatura topic, visit your repo's landing page and select "manage topics."