dleemiller
/

WordLlamaDetect

Model card Files Files and versions

xet

Community

dleemiller commited on Dec 16, 2025

Commit

8cc94a0

verified ·

1 Parent(s): 39440dc

Update README.md

Browse files

Files changed (1) hide show

README.md +235 -2

README.md CHANGED Viewed

@@ -1,7 +1,240 @@
 ---
 license: apache-2.0
 ---
-# WordLlamaDetect
-Tokenizer configurations for [WordLlamaDetect](https://github.com/dleemiller/WordLlamaDetect)

 ---
 license: apache-2.0
+datasets:
+- laurievb/OpenLID-v2
 ---
+# WordLlama Detect
+**WordLlama Detect** is a [WordLlama](https://github.com/dleemiller/WordLlama)-like library focused on the task of language identification.
+It supports identification of **148 languages**, and high accuracy and fast CPU & numpy-only inference.
+WordLlama detect was trained from static token embeddings extracted from *Gemma3*-series LLMs.
+<p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/65ff92ea467d83751a727538/6xqLD9ciun2KIgCiC9w6T.png" alt="WordLlamaDetect" width="60%">
+</p>
+## Overview
+**Features:**
+- NumPy-only inference with no PyTorch dependency
+- Pre-trained model (148 languages), with 103 @ >95% accuracy
+- Sparse lookup table (13MB)
+- Fast inference: >70k texts/s single thread
+- Simple interface
+## Installation
+```bash
+pip install wldetect
+```
+Or install from source:
+```bash
+git clone https://github.com/dleemiller/WordLlamaDetect.git
+cd WordLlamaDetect
+uv sync
+```
+## Quick Start
+### Python API
+```python
+from wldetect import WLDetect
+# Load bundled model (no path needed)
+wld = WLDetect.load()
+# Detect language for single text
+lang, confidence = wld.predict("Hello, how are you today?")
+# ('eng_Latn', 0.9564036726951599)
+```
+### CLI Usage
+```bash
+# Detect from text
+uv run wldetect detect --text "Bonjour le monde"
+# Detect from file
+uv run wldetect detect --file input.txt
+```
+## Included Model
+WLDetect ships with a pre-trained model based on concatenated Gemma3-27B + Gemma3-4B token embeddings:
+- **Languages**: 148 (from OpenLID-v2 dataset)
+- **Accuracy**: 92.92% on FLORES+ dev set
+- **F1 (macro)**: 92.74%
+- **Language codes**: ISO 639-3 + ISO 15924 script (e.g., `eng_Latn`, `cmn_Hans`, `arb_Arab`)
+> [!TIP]
+> See [docs/languages.md](docs/languages.md) for the complete list of supported languages with performance metrics.
+> [!NOTE]
+> Gemma3 is a good choice for this application, because it was trained on over 140 languages.
+> The tokenizer, vocab size (262k) and multi-language training are critical for performance.
+## Architecture
+### Simple Inference Pipeline (NumPy-only)
+1. **Tokenize**: Use HuggingFace fast tokenizer (512-length truncation)
+2. **Lookup**: Index into pre-computed exponential lookup table (vocab_size × n_languages)
+3. **Pool**: LogSum pooling over token sequence
+4. **Softmax**: Calculate language probabilities
+The lookup table is pre-trained using: `exp((embeddings * token_weights) @ projection.T + bias)`,
+where embeddings are frozen token embeddings from Gemma3, trained with focal loss on OpenLID-v2.
+During training, token vectors are aggregated using *logsumexp* pooling along the sequence dimension.
+> [!IMPORTANT]
+> To optimize artifact size and compute, we perform `exp(logits)` before saving the lookup table.
+> Then we apply a threshold to make the table *sparse*.
+> This reduces the artifact size 10x (~130mb -> 13mb), with negligable performance degradation.
+### Sparse Lookup Table
+The lookup table uses sparse COO (Coordinate) format with configurable sparsification threshold:
+- **Sparsity**: 97.15% (values below threshold (<10) set to zero)
+- **Format**: COO (row, col, data) indices stored as int32, values as fp32
+- **Performance impact**: Negligible (0.003% accuracy loss)
+## Performance
+### FLORES+ Benchmark Results
+Evaluated on FLORES+ dataset (148 languages, ~1k sentences per language):
+| Split   | Accuracy | F1 (macro) | F1 (weighted) | Samples  |
+|---------|----------|------------|---------------|----------|
+| dev     | 92.92%   | 92.74%     | 92.75%        | 150,547  |
+| devtest | 92.86%   | 92.71%     | 92.69%        | 153,824  |
+See [docs/languages.md](docs/languages.md) for detailed results.
+### Inference Speed
+Benchmarked on 12th gen Intel-i9 (single thread):
+- **Single text**: 71,500 texts/second (0.014 ms/text)
+- **Batch (1000)**: 82,500 texts/second (12.1 ms/batch)
+## Supported Languages
+The bundled model supports 148 languages from the OpenLID-v2 dataset. Languages use ISO 639-3 language codes with ISO 15924 script codes (e.g., `eng_Latn`, `cmn_Hans`, `arb_Arab`).
+See [model_config.yaml](src/wldetect/models/model_config.yaml) for the complete list of supported languages.
+## Training
+### Installation for Training
+```bash
+# CPU or default CUDA version
+uv sync --extra training
+# With CUDA 12.8 (Blackwell)
+uv sync --extra cu128
+```
+### Training Pipeline
+1. **Configure model** in `configs/models/custom-config.yaml`:
+```yaml
+model:
+  name: google/gemma-3-27b-pt
+  hidden_dim: 5376
+  shard_pattern: model-00001-of-00012.safetensors
+  embedding_layer_name: language_model.model.embed_tokens.weight
+languages:
+  eng_Latn: 0
+  spa_Latn: 1
+  fra_Latn: 2
+  # ... add more languages
+inference:
+  max_sequence_length: 512
+  pooling: logsumexp
+```
+2. **Configure training** in `configs/training/custom-training.yaml`:
+```yaml
+model_config_path: "configs/models/custom-model.yaml"
+dataset:
+  name: "laurievb/OpenLID-v2"
+  filter_languages: true
+training:
+  batch_size: 1536
+  learning_rate: 0.002
+  epochs: 2
+```
+3. **Train**:
+```bash
+uv run wldetect train --config configs/training/custom-training.yaml
+```
+Artifacts saved to `artifacts/`:
+- `lookup_table_exp.safetensors` - Sparse exp lookup table (for inference)
+- `projection.safetensors` - Projection matrix (fp32, for fine-tuning)
+- `model_config.yaml` - Model configuration
+- `model.pt` - Full PyTorch checkpoint
+### Training Commands
+```bash
+# Train model
+uv run wldetect train --config configs/training/gemma3-27b.yaml
+# Evaluate on FLORES+
+uv run wldetect eval --model-path artifacts/ --split dev
+# Generate sparse lookup table from checkpoint (default: threshold=10.0)
+uv run wldetect create-lookup \
+  --checkpoint artifacts/checkpoints/checkpoint_step_100000.pt \
+  --config configs/training/gemma3-27b.yaml \
+  --output-dir artifacts/
+```
+### Training Details
+- **Embedding extraction**: Downloads only embedding tensor shards from HuggingFace (not full models)
+- **Dataset**: OpenLID-v2 with configurable language filtering and balancing
+- **Model**: Simple linear projection (hidden_dim → n_languages) with dropout
+- **Pooling**: LogSumExp or max pooling over token sequences
+- **Training time**: ~2-4 hours on GPU for 2 epochs (150 languages, 5000 samples/language)
+- **Evaluation**: Automatic FLORES+ evaluation after training
+## License
+Apache 2.0 License
+## Citations
+If you use WordLlama Detect in your research or project, please consider citing it as follows:
+```bibtex
+@software{miller2025wordllamadetect,
+  author = {Miller, D. Lee},
+  title = {WordLlama Detect: The Language of the Token},
+  year = {2025},
+  url = {https://github.com/dleemiller/WordLlamaDetect},
+  version = {0.1.0}
+}
+```
+## Acknowledgments
+- OpenLID-v2 dataset: [laurievb/OpenLID-v2](https://huggingface.co/datasets/laurievb/OpenLID-v2)
+- FLORES+ dataset: [openlanguagedata/flores_plus](https://huggingface.co/datasets/openlanguagedata/flores_plus)
+- HuggingFace transformers and tokenizers libraries
+- Google Gemma model team