WordLlamaDetect / README.md

Update README.md

8cc94a0 verified 29 days ago

7.06 kB

	---
	license: apache-2.0
	datasets:
	- laurievb/OpenLID-v2
	---


	# WordLlama Detect

	WordLlama Detect is a [WordLlama](https://github.com/dleemiller/WordLlama)-like library focused on the task of language identification.
	It supports identification of 148 languages, and high accuracy and fast CPU & numpy-only inference.
	WordLlama detect was trained from static token embeddings extracted from Gemma3-series LLMs.

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/65ff92ea467d83751a727538/6xqLD9ciun2KIgCiC9w6T.png" alt="WordLlamaDetect" width="60%">
	</p>

	## Overview

	Features:
	- NumPy-only inference with no PyTorch dependency
	- Pre-trained model (148 languages), with 103 @ >95% accuracy
	- Sparse lookup table (13MB)
	- Fast inference: >70k texts/s single thread
	- Simple interface

	## Installation

	```bash
	pip install wldetect
	```

	Or install from source:
	```bash
	git clone https://github.com/dleemiller/WordLlamaDetect.git
	cd WordLlamaDetect
	uv sync
	```

	## Quick Start

	### Python API

	```python
	from wldetect import WLDetect

	# Load bundled model (no path needed)
	wld = WLDetect.load()

	# Detect language for single text
	lang, confidence = wld.predict("Hello, how are you today?")
	# ('eng_Latn', 0.9564036726951599)
	```

	### CLI Usage

	```bash
	# Detect from text
	uv run wldetect detect --text "Bonjour le monde"

	# Detect from file
	uv run wldetect detect --file input.txt
	```

	## Included Model

	WLDetect ships with a pre-trained model based on concatenated Gemma3-27B + Gemma3-4B token embeddings:
	- Languages: 148 (from OpenLID-v2 dataset)
	- Accuracy: 92.92% on FLORES+ dev set
	- F1 (macro): 92.74%
	- Language codes: ISO 639-3 + ISO 15924 script (e.g., `eng_Latn`, `cmn_Hans`, `arb_Arab`)


	> [!TIP]
	> See [docs/languages.md](docs/languages.md) for the complete list of supported languages with performance metrics.

	> [!NOTE]
	> Gemma3 is a good choice for this application, because it was trained on over 140 languages.
	> The tokenizer, vocab size (262k) and multi-language training are critical for performance.

	## Architecture

	### Simple Inference Pipeline (NumPy-only)

	1. Tokenize: Use HuggingFace fast tokenizer (512-length truncation)
	2. Lookup: Index into pre-computed exponential lookup table (vocab_size × n_languages)
	3. Pool: LogSum pooling over token sequence
	4. Softmax: Calculate language probabilities

	The lookup table is pre-trained using: `exp((embeddings * token_weights) @ projection.T + bias)`,
	where embeddings are frozen token embeddings from Gemma3, trained with focal loss on OpenLID-v2.
	During training, token vectors are aggregated using logsumexp pooling along the sequence dimension.


	> [!IMPORTANT]
	> To optimize artifact size and compute, we perform `exp(logits)` before saving the lookup table.
	> Then we apply a threshold to make the table sparse.
	> This reduces the artifact size 10x (~130mb -> 13mb), with negligable performance degradation.

	### Sparse Lookup Table

	The lookup table uses sparse COO (Coordinate) format with configurable sparsification threshold:
	- Sparsity: 97.15% (values below threshold (<10) set to zero)
	- Format: COO (row, col, data) indices stored as int32, values as fp32
	- Performance impact: Negligible (0.003% accuracy loss)


	## Performance

	### FLORES+ Benchmark Results

	Evaluated on FLORES+ dataset (148 languages, ~1k sentences per language):

	\| Split \| Accuracy \| F1 (macro) \| F1 (weighted) \| Samples \|
	\|---------\|----------\|------------\|---------------\|----------\|
	\| dev \| 92.92% \| 92.74% \| 92.75% \| 150,547 \|
	\| devtest \| 92.86% \| 92.71% \| 92.69% \| 153,824 \|

	See [docs/languages.md](docs/languages.md) for detailed results.

	### Inference Speed

	Benchmarked on 12th gen Intel-i9 (single thread):

	- Single text: 71,500 texts/second (0.014 ms/text)
	- Batch (1000): 82,500 texts/second (12.1 ms/batch)

	## Supported Languages

	The bundled model supports 148 languages from the OpenLID-v2 dataset. Languages use ISO 639-3 language codes with ISO 15924 script codes (e.g., `eng_Latn`, `cmn_Hans`, `arb_Arab`).

	See [model_config.yaml](src/wldetect/models/model_config.yaml) for the complete list of supported languages.

	## Training

	### Installation for Training

	```bash
	# CPU or default CUDA version
	uv sync --extra training

	# With CUDA 12.8 (Blackwell)
	uv sync --extra cu128
	```

	### Training Pipeline

	1. Configure model in `configs/models/custom-config.yaml`:
	```yaml
	model:
	name: google/gemma-3-27b-pt
	hidden_dim: 5376
	shard_pattern: model-00001-of-00012.safetensors
	embedding_layer_name: language_model.model.embed_tokens.weight

	languages:
	eng_Latn: 0
	spa_Latn: 1
	fra_Latn: 2
	# ... add more languages

	inference:
	max_sequence_length: 512
	pooling: logsumexp
	```

	2. Configure training in `configs/training/custom-training.yaml`:
	```yaml
	model_config_path: "configs/models/custom-model.yaml"

	dataset:
	name: "laurievb/OpenLID-v2"
	filter_languages: true

	training:
	batch_size: 1536
	learning_rate: 0.002
	epochs: 2
	```

	3. Train:
	```bash
	uv run wldetect train --config configs/training/custom-training.yaml
	```

	Artifacts saved to `artifacts/`:
	- `lookup_table_exp.safetensors` - Sparse exp lookup table (for inference)
	- `projection.safetensors` - Projection matrix (fp32, for fine-tuning)
	- `model_config.yaml` - Model configuration
	- `model.pt` - Full PyTorch checkpoint

	### Training Commands

	```bash
	# Train model
	uv run wldetect train --config configs/training/gemma3-27b.yaml

	# Evaluate on FLORES+
	uv run wldetect eval --model-path artifacts/ --split dev

	# Generate sparse lookup table from checkpoint (default: threshold=10.0)
	uv run wldetect create-lookup \
	--checkpoint artifacts/checkpoints/checkpoint_step_100000.pt \
	--config configs/training/gemma3-27b.yaml \
	--output-dir artifacts/
	```

	### Training Details

	- Embedding extraction: Downloads only embedding tensor shards from HuggingFace (not full models)
	- Dataset: OpenLID-v2 with configurable language filtering and balancing
	- Model: Simple linear projection (hidden_dim → n_languages) with dropout
	- Pooling: LogSumExp or max pooling over token sequences
	- Training time: ~2-4 hours on GPU for 2 epochs (150 languages, 5000 samples/language)
	- Evaluation: Automatic FLORES+ evaluation after training

	## License

	Apache 2.0 License

	## Citations

	If you use WordLlama Detect in your research or project, please consider citing it as follows:

	```bibtex
	@software{miller2025wordllamadetect,
	author = {Miller, D. Lee},
	title = {WordLlama Detect: The Language of the Token},
	year = {2025},
	url = {https://github.com/dleemiller/WordLlamaDetect},
	version = {0.1.0}
	}
	```

	## Acknowledgments

	- OpenLID-v2 dataset: [laurievb/OpenLID-v2](https://huggingface.co/datasets/laurievb/OpenLID-v2)
	- FLORES+ dataset: [openlanguagedata/flores_plus](https://huggingface.co/datasets/openlanguagedata/flores_plus)
	- HuggingFace transformers and tokenizers libraries
	- Google Gemma model team