🇹🇳 Derja Smart Web Scraper

A specialized web mining tool for Tunisian Arabic (Derja), designed to build high-quality NLP datasets for LLM training.

📖 Overview

Derja Smart Web Scraper is a lightweight CLI utility that automates the collection of Tunisian Arabic text from the open web. It leverages Google Search (via SerpAPI) to discover authentic content, scrapes the resulting pages, and applies a heuristic Derja Detector to filter for relevant sentences.

It is specifically engineered to handle the nuances of the Tunisian dialect, including:

Code-switching: Mixing French/English with Arabic.
Arabizi: Latin-script Arabic (e.g., "chnowa a7welek").
Arabic Script: Standard and dialectal Arabic text.

✨ Key Features

Intelligent Search: Automates Google queries to find niche forums and blogs.
Smart Filtering: Uses a vocabulary-based scoring system (score_derja) to distinguish Tunisian Derja from MSA (Modern Standard Arabic) or other noise.
Clean Output: Produces ready-to-train JSONL data, compatible with standard NLP pipelines.
Robust Cleaning: Strips HTML, ads, and navigation elements using BeautifulSoup.

🛠 Installation

Clone the repository:

git clone https://github.com/bahaeddinmselmi/derja-smart-scraper.git
cd derja-smart-scraper

Install dependencies:
```
pip install -r requirements.txt
```
Set up API Key: You need a SerpAPI key to perform searches.
```
# PowerShell
$env:SERPAPI_API_KEY = "YOUR_API_KEY"
```

🚀 Usage

Run the scraper with custom queries to start building your dataset:

python -m smart_scraper.collect_web_derja_search \
  --queries "كلام تونسي" "proverbes tunisiens" "nokat tounsiya" \
  --max_results_per_query 20 \
  --min_score 0.15 \
  --out_file data/my_tunisian_dataset.jsonl

Options

Flag	Description	Default
`--queries`	List of search terms	(Internal List)
`--min_score`	Threshold for Derja detection (0.0 - 1.0)	`0.12`
`--max_segments`	Max sentences to collect	`50,000`
`--out_file`	Output JSONL path	`data/derja_segments_raw.jsonl`

📊 Output Format

The tool provides rich metadata for every scraped sentence:

{
  "text": "chnowa raykom fi hadha?",
  "source": "web_search",
  "score": 0.88,
  "meta": {
    "url": "https://example.com/forum/topic-123",
    "domain": "example.com",
    "query": "tunisian forum"
  }
}

🤝 Contributing

Contributions are welcome! If you have improved heuristics for Derja detection or new search keywords, please submit a Pull Request.

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
smart_scraper		smart_scraper
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🇹🇳 Derja Smart Web Scraper

📖 Overview

✨ Key Features

🛠 Installation

🚀 Usage

Options

📊 Output Format

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🇹🇳 Derja Smart Web Scraper

📖 Overview

✨ Key Features

🛠 Installation

🚀 Usage

Options

📊 Output Format

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages