Skip to content

bahaeddinmselmi/derja-smart-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🇹🇳 Derja Smart Web Scraper

A specialized web mining tool for Tunisian Arabic (Derja), designed to build high-quality NLP datasets for LLM training.

Python License: MIT NLP

📖 Overview

Derja Smart Web Scraper is a lightweight CLI utility that automates the collection of Tunisian Arabic text from the open web. It leverages Google Search (via SerpAPI) to discover authentic content, scrapes the resulting pages, and applies a heuristic Derja Detector to filter for relevant sentences.

It is specifically engineered to handle the nuances of the Tunisian dialect, including:

  • Code-switching: Mixing French/English with Arabic.
  • Arabizi: Latin-script Arabic (e.g., "chnowa a7welek").
  • Arabic Script: Standard and dialectal Arabic text.

✨ Key Features

  • Intelligent Search: Automates Google queries to find niche forums and blogs.
  • Smart Filtering: Uses a vocabulary-based scoring system (score_derja) to distinguish Tunisian Derja from MSA (Modern Standard Arabic) or other noise.
  • Clean Output: Produces ready-to-train JSONL data, compatible with standard NLP pipelines.
  • Robust Cleaning: Strips HTML, ads, and navigation elements using BeautifulSoup.

🛠 Installation

  1. Clone the repository:

    git clone https://github.com/bahaeddinmselmi/derja-smart-scraper.git
    cd derja-smart-scraper
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up API Key: You need a SerpAPI key to perform searches.

    # PowerShell
    $env:SERPAPI_API_KEY = "YOUR_API_KEY"

🚀 Usage

Run the scraper with custom queries to start building your dataset:

python -m smart_scraper.collect_web_derja_search \
  --queries "كلام تونسي" "proverbes tunisiens" "nokat tounsiya" \
  --max_results_per_query 20 \
  --min_score 0.15 \
  --out_file data/my_tunisian_dataset.jsonl

Options

Flag Description Default
--queries List of search terms (Internal List)
--min_score Threshold for Derja detection (0.0 - 1.0) 0.12
--max_segments Max sentences to collect 50,000
--out_file Output JSONL path data/derja_segments_raw.jsonl

📊 Output Format

The tool provides rich metadata for every scraped sentence:

{
  "text": "chnowa raykom fi hadha?",
  "source": "web_search",
  "score": 0.88,
  "meta": {
    "url": "https://example.com/forum/topic-123",
    "domain": "example.com",
    "query": "tunisian forum"
  }
}

🤝 Contributing

Contributions are welcome! If you have improved heuristics for Derja detection or new search keywords, please submit a Pull Request.

📄 License

MIT

About

A lightweight CLI tool for collecting Tunisian Derja text snippets from the open web. It queries Google via [SerpAPI](https://serpapi.com), downloads each result, extracts readable text, and keeps only the sentences that look like Tunisian Derja using a heuristic detector.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages