File size: 7,062 Bytes
1483227
 
8cc94a0
 
1483227
 
 
8cc94a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
license: apache-2.0
datasets:
- laurievb/OpenLID-v2
---


# WordLlama Detect

**WordLlama Detect** is a [WordLlama](https://github.com/dleemiller/WordLlama)-like library focused on the task of language identification.
It supports identification of **148 languages**, and high accuracy and fast CPU & numpy-only inference.
WordLlama detect was trained from static token embeddings extracted from *Gemma3*-series LLMs.

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/65ff92ea467d83751a727538/6xqLD9ciun2KIgCiC9w6T.png" alt="WordLlamaDetect" width="60%">
</p>

## Overview

**Features:**
- NumPy-only inference with no PyTorch dependency
- Pre-trained model (148 languages), with 103 @ >95% accuracy
- Sparse lookup table (13MB)
- Fast inference: >70k texts/s single thread
- Simple interface

## Installation

```bash
pip install wldetect
```

Or install from source:
```bash
git clone https://github.com/dleemiller/WordLlamaDetect.git
cd WordLlamaDetect
uv sync
```

## Quick Start

### Python API

```python
from wldetect import WLDetect

# Load bundled model (no path needed)
wld = WLDetect.load()

# Detect language for single text
lang, confidence = wld.predict("Hello, how are you today?")
# ('eng_Latn', 0.9564036726951599)
```

### CLI Usage

```bash
# Detect from text
uv run wldetect detect --text "Bonjour le monde"

# Detect from file
uv run wldetect detect --file input.txt
```

## Included Model

WLDetect ships with a pre-trained model based on concatenated Gemma3-27B + Gemma3-4B token embeddings:
- **Languages**: 148 (from OpenLID-v2 dataset)
- **Accuracy**: 92.92% on FLORES+ dev set
- **F1 (macro)**: 92.74%
- **Language codes**: ISO 639-3 + ISO 15924 script (e.g., `eng_Latn`, `cmn_Hans`, `arb_Arab`)


> [!TIP]
> See [docs/languages.md](docs/languages.md) for the complete list of supported languages with performance metrics.

> [!NOTE]  
> Gemma3 is a good choice for this application, because it was trained on over 140 languages.
> The tokenizer, vocab size (262k) and multi-language training are critical for performance.

## Architecture

### Simple Inference Pipeline (NumPy-only)

1. **Tokenize**: Use HuggingFace fast tokenizer (512-length truncation)
2. **Lookup**: Index into pre-computed exponential lookup table (vocab_size × n_languages)
3. **Pool**: LogSum pooling over token sequence
4. **Softmax**: Calculate language probabilities

The lookup table is pre-trained using: `exp((embeddings * token_weights) @ projection.T + bias)`,
where embeddings are frozen token embeddings from Gemma3, trained with focal loss on OpenLID-v2.
During training, token vectors are aggregated using *logsumexp* pooling along the sequence dimension.


> [!IMPORTANT]  
> To optimize artifact size and compute, we perform `exp(logits)` before saving the lookup table.
> Then we apply a threshold to make the table *sparse*.
> This reduces the artifact size 10x (~130mb -> 13mb), with negligable performance degradation.

### Sparse Lookup Table

The lookup table uses sparse COO (Coordinate) format with configurable sparsification threshold:
- **Sparsity**: 97.15% (values below threshold (<10) set to zero)
- **Format**: COO (row, col, data) indices stored as int32, values as fp32
- **Performance impact**: Negligible (0.003% accuracy loss)


## Performance

### FLORES+ Benchmark Results

Evaluated on FLORES+ dataset (148 languages, ~1k sentences per language):

| Split   | Accuracy | F1 (macro) | F1 (weighted) | Samples  |
|---------|----------|------------|---------------|----------|
| dev     | 92.92%   | 92.74%     | 92.75%        | 150,547  |
| devtest | 92.86%   | 92.71%     | 92.69%        | 153,824  |

See [docs/languages.md](docs/languages.md) for detailed results.

### Inference Speed

Benchmarked on 12th gen Intel-i9 (single thread):

- **Single text**: 71,500 texts/second (0.014 ms/text)
- **Batch (1000)**: 82,500 texts/second (12.1 ms/batch)

## Supported Languages

The bundled model supports 148 languages from the OpenLID-v2 dataset. Languages use ISO 639-3 language codes with ISO 15924 script codes (e.g., `eng_Latn`, `cmn_Hans`, `arb_Arab`).

See [model_config.yaml](src/wldetect/models/model_config.yaml) for the complete list of supported languages.

## Training

### Installation for Training

```bash
# CPU or default CUDA version
uv sync --extra training

# With CUDA 12.8 (Blackwell)
uv sync --extra cu128
```

### Training Pipeline

1. **Configure model** in `configs/models/custom-config.yaml`:
```yaml
model:
  name: google/gemma-3-27b-pt
  hidden_dim: 5376
  shard_pattern: model-00001-of-00012.safetensors
  embedding_layer_name: language_model.model.embed_tokens.weight

languages:
  eng_Latn: 0
  spa_Latn: 1
  fra_Latn: 2
  # ... add more languages

inference:
  max_sequence_length: 512
  pooling: logsumexp
```

2. **Configure training** in `configs/training/custom-training.yaml`:
```yaml
model_config_path: "configs/models/custom-model.yaml"

dataset:
  name: "laurievb/OpenLID-v2"
  filter_languages: true

training:
  batch_size: 1536
  learning_rate: 0.002
  epochs: 2
```

3. **Train**:
```bash
uv run wldetect train --config configs/training/custom-training.yaml
```

Artifacts saved to `artifacts/`:
- `lookup_table_exp.safetensors` - Sparse exp lookup table (for inference)
- `projection.safetensors` - Projection matrix (fp32, for fine-tuning)
- `model_config.yaml` - Model configuration
- `model.pt` - Full PyTorch checkpoint

### Training Commands

```bash
# Train model
uv run wldetect train --config configs/training/gemma3-27b.yaml

# Evaluate on FLORES+
uv run wldetect eval --model-path artifacts/ --split dev

# Generate sparse lookup table from checkpoint (default: threshold=10.0)
uv run wldetect create-lookup \
  --checkpoint artifacts/checkpoints/checkpoint_step_100000.pt \
  --config configs/training/gemma3-27b.yaml \
  --output-dir artifacts/
```

### Training Details

- **Embedding extraction**: Downloads only embedding tensor shards from HuggingFace (not full models)
- **Dataset**: OpenLID-v2 with configurable language filtering and balancing
- **Model**: Simple linear projection (hidden_dim → n_languages) with dropout
- **Pooling**: LogSumExp or max pooling over token sequences
- **Training time**: ~2-4 hours on GPU for 2 epochs (150 languages, 5000 samples/language)
- **Evaluation**: Automatic FLORES+ evaluation after training

## License

Apache 2.0 License

## Citations

If you use WordLlama Detect in your research or project, please consider citing it as follows:

```bibtex
@software{miller2025wordllamadetect,
  author = {Miller, D. Lee},
  title = {WordLlama Detect: The Language of the Token},
  year = {2025},
  url = {https://github.com/dleemiller/WordLlamaDetect},
  version = {0.1.0}
}
```

## Acknowledgments

- OpenLID-v2 dataset: [laurievb/OpenLID-v2](https://huggingface.co/datasets/laurievb/OpenLID-v2)
- FLORES+ dataset: [openlanguagedata/flores_plus](https://huggingface.co/datasets/openlanguagedata/flores_plus)
- HuggingFace transformers and tokenizers libraries
- Google Gemma model team