A tokenization evaluation suite with interactive visualization that allows quick comparison of tokenizers across languages.
- Comprehensive Metrics: Evaluate tokenizers using multiple metrics including entropy, bits per character, sequence length/ratio, token length, and divergence measures (Jensen-Shannon, Kullback-Liebler)
- Multi-Language Support: Built-in support for 200+ languages from FLORES-200 dataset
- Interactive Visualization: Web-based interface for exploring and comparing tokenization metrics
- Flexible Configuration: YAML-based configuration for customizing evaluation runs
Download and tokenize FLORES-200 data with tokenizers from various LLMs:
cd data/tokenized/flores
snakemakeThis will download the FLORES-200 dataset and tokenize it using multiple state-of-the-art tokenizers (Llama-4, Gemma-3, Command-A, Qwen, Mistral-3, DeepSeek-V3, and more).
Run the evaluation using the example configuration:
./go.py run --config-file example-config.ymlWarning: This analysis requires a significant amount of memory (several GB depending on the number of tokenizers and languages).
The results will be saved to the directory specified in your config file (default: experiments/flores-example/).
Launch the interactive web interface to explore your results:
cd frontend
npm install # Install dependencies (First time only)
npm run build # Build for production
npm run start # Run the visualization serverThe visualization interface will open at http://localhost:3000. From there:
- Click Import Data to import the directory with evaluation results.
- Configure graphs by selecting tokenizers, metrics, and languages
- Click Generate Graph to create visualizations
- Export graphs as images or save your configuration for later
See frontend/README.md for detailed visualization instructions.
- Python 3.8+
- Node.js 16+ (for visualization)
- Snakemake (for data preparation)
- Install Python dependencies:
pip install -r requirements.txt- Install frontend dependencies (optional, only needed for visualization):
cd frontend
npm installTokCollate uses YAML configuration files. See example-config.yml for a template that works with the FLORES-200 Snakemake pipeline.
Key configuration options:
input_dir: Directory containing tokenized text filesoutput_dir: Where to save evaluation resultsmetrics: List of metrics to computelanguages: Languages to evaluate (if omitted, all available languages are used)system_dataset_suffix: File extension for tokenized files (default: "txt")
See LICENSE file for details.