A specialized web mining tool for Tunisian Arabic (Derja), designed to build high-quality NLP datasets for LLM training.
Derja Smart Web Scraper is a lightweight CLI utility that automates the collection of Tunisian Arabic text from the open web. It leverages Google Search (via SerpAPI) to discover authentic content, scrapes the resulting pages, and applies a heuristic Derja Detector to filter for relevant sentences.
It is specifically engineered to handle the nuances of the Tunisian dialect, including:
- Code-switching: Mixing French/English with Arabic.
- Arabizi: Latin-script Arabic (e.g., "chnowa a7welek").
- Arabic Script: Standard and dialectal Arabic text.
- Intelligent Search: Automates Google queries to find niche forums and blogs.
- Smart Filtering: Uses a vocabulary-based scoring system (
score_derja) to distinguish Tunisian Derja from MSA (Modern Standard Arabic) or other noise. - Clean Output: Produces ready-to-train JSONL data, compatible with standard NLP pipelines.
- Robust Cleaning: Strips HTML, ads, and navigation elements using BeautifulSoup.
-
Clone the repository:
git clone https://github.com/bahaeddinmselmi/derja-smart-scraper.git cd derja-smart-scraper -
Install dependencies:
pip install -r requirements.txt
-
Set up API Key: You need a SerpAPI key to perform searches.
# PowerShell $env:SERPAPI_API_KEY = "YOUR_API_KEY"
Run the scraper with custom queries to start building your dataset:
python -m smart_scraper.collect_web_derja_search \
--queries "كلام تونسي" "proverbes tunisiens" "nokat tounsiya" \
--max_results_per_query 20 \
--min_score 0.15 \
--out_file data/my_tunisian_dataset.jsonl| Flag | Description | Default |
|---|---|---|
--queries |
List of search terms | (Internal List) |
--min_score |
Threshold for Derja detection (0.0 - 1.0) | 0.12 |
--max_segments |
Max sentences to collect | 50,000 |
--out_file |
Output JSONL path | data/derja_segments_raw.jsonl |
The tool provides rich metadata for every scraped sentence:
{
"text": "chnowa raykom fi hadha?",
"source": "web_search",
"score": 0.88,
"meta": {
"url": "https://example.com/forum/topic-123",
"domain": "example.com",
"query": "tunisian forum"
}
}Contributions are welcome! If you have improved heuristics for Derja detection or new search keywords, please submit a Pull Request.
MIT