dleemiller commited on
Commit
8cc94a0
·
verified ·
1 Parent(s): 39440dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +235 -2
README.md CHANGED
@@ -1,7 +1,240 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
 
5
- # WordLlamaDetect
6
 
7
- Tokenizer configurations for [WordLlamaDetect](https://github.com/dleemiller/WordLlamaDetect)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - laurievb/OpenLID-v2
5
  ---
6
 
 
7
 
8
+ # WordLlama Detect
9
+
10
+ **WordLlama Detect** is a [WordLlama](https://github.com/dleemiller/WordLlama)-like library focused on the task of language identification.
11
+ It supports identification of **148 languages**, and high accuracy and fast CPU & numpy-only inference.
12
+ WordLlama detect was trained from static token embeddings extracted from *Gemma3*-series LLMs.
13
+
14
+ <p align="center">
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/65ff92ea467d83751a727538/6xqLD9ciun2KIgCiC9w6T.png" alt="WordLlamaDetect" width="60%">
16
+ </p>
17
+
18
+ ## Overview
19
+
20
+ **Features:**
21
+ - NumPy-only inference with no PyTorch dependency
22
+ - Pre-trained model (148 languages), with 103 @ >95% accuracy
23
+ - Sparse lookup table (13MB)
24
+ - Fast inference: >70k texts/s single thread
25
+ - Simple interface
26
+
27
+ ## Installation
28
+
29
+ ```bash
30
+ pip install wldetect
31
+ ```
32
+
33
+ Or install from source:
34
+ ```bash
35
+ git clone https://github.com/dleemiller/WordLlamaDetect.git
36
+ cd WordLlamaDetect
37
+ uv sync
38
+ ```
39
+
40
+ ## Quick Start
41
+
42
+ ### Python API
43
+
44
+ ```python
45
+ from wldetect import WLDetect
46
+
47
+ # Load bundled model (no path needed)
48
+ wld = WLDetect.load()
49
+
50
+ # Detect language for single text
51
+ lang, confidence = wld.predict("Hello, how are you today?")
52
+ # ('eng_Latn', 0.9564036726951599)
53
+ ```
54
+
55
+ ### CLI Usage
56
+
57
+ ```bash
58
+ # Detect from text
59
+ uv run wldetect detect --text "Bonjour le monde"
60
+
61
+ # Detect from file
62
+ uv run wldetect detect --file input.txt
63
+ ```
64
+
65
+ ## Included Model
66
+
67
+ WLDetect ships with a pre-trained model based on concatenated Gemma3-27B + Gemma3-4B token embeddings:
68
+ - **Languages**: 148 (from OpenLID-v2 dataset)
69
+ - **Accuracy**: 92.92% on FLORES+ dev set
70
+ - **F1 (macro)**: 92.74%
71
+ - **Language codes**: ISO 639-3 + ISO 15924 script (e.g., `eng_Latn`, `cmn_Hans`, `arb_Arab`)
72
+
73
+
74
+ > [!TIP]
75
+ > See [docs/languages.md](docs/languages.md) for the complete list of supported languages with performance metrics.
76
+
77
+ > [!NOTE]
78
+ > Gemma3 is a good choice for this application, because it was trained on over 140 languages.
79
+ > The tokenizer, vocab size (262k) and multi-language training are critical for performance.
80
+
81
+ ## Architecture
82
+
83
+ ### Simple Inference Pipeline (NumPy-only)
84
+
85
+ 1. **Tokenize**: Use HuggingFace fast tokenizer (512-length truncation)
86
+ 2. **Lookup**: Index into pre-computed exponential lookup table (vocab_size × n_languages)
87
+ 3. **Pool**: LogSum pooling over token sequence
88
+ 4. **Softmax**: Calculate language probabilities
89
+
90
+ The lookup table is pre-trained using: `exp((embeddings * token_weights) @ projection.T + bias)`,
91
+ where embeddings are frozen token embeddings from Gemma3, trained with focal loss on OpenLID-v2.
92
+ During training, token vectors are aggregated using *logsumexp* pooling along the sequence dimension.
93
+
94
+
95
+ > [!IMPORTANT]
96
+ > To optimize artifact size and compute, we perform `exp(logits)` before saving the lookup table.
97
+ > Then we apply a threshold to make the table *sparse*.
98
+ > This reduces the artifact size 10x (~130mb -> 13mb), with negligable performance degradation.
99
+
100
+ ### Sparse Lookup Table
101
+
102
+ The lookup table uses sparse COO (Coordinate) format with configurable sparsification threshold:
103
+ - **Sparsity**: 97.15% (values below threshold (<10) set to zero)
104
+ - **Format**: COO (row, col, data) indices stored as int32, values as fp32
105
+ - **Performance impact**: Negligible (0.003% accuracy loss)
106
+
107
+
108
+ ## Performance
109
+
110
+ ### FLORES+ Benchmark Results
111
+
112
+ Evaluated on FLORES+ dataset (148 languages, ~1k sentences per language):
113
+
114
+ | Split | Accuracy | F1 (macro) | F1 (weighted) | Samples |
115
+ |---------|----------|------------|---------------|----------|
116
+ | dev | 92.92% | 92.74% | 92.75% | 150,547 |
117
+ | devtest | 92.86% | 92.71% | 92.69% | 153,824 |
118
+
119
+ See [docs/languages.md](docs/languages.md) for detailed results.
120
+
121
+ ### Inference Speed
122
+
123
+ Benchmarked on 12th gen Intel-i9 (single thread):
124
+
125
+ - **Single text**: 71,500 texts/second (0.014 ms/text)
126
+ - **Batch (1000)**: 82,500 texts/second (12.1 ms/batch)
127
+
128
+ ## Supported Languages
129
+
130
+ The bundled model supports 148 languages from the OpenLID-v2 dataset. Languages use ISO 639-3 language codes with ISO 15924 script codes (e.g., `eng_Latn`, `cmn_Hans`, `arb_Arab`).
131
+
132
+ See [model_config.yaml](src/wldetect/models/model_config.yaml) for the complete list of supported languages.
133
+
134
+ ## Training
135
+
136
+ ### Installation for Training
137
+
138
+ ```bash
139
+ # CPU or default CUDA version
140
+ uv sync --extra training
141
+
142
+ # With CUDA 12.8 (Blackwell)
143
+ uv sync --extra cu128
144
+ ```
145
+
146
+ ### Training Pipeline
147
+
148
+ 1. **Configure model** in `configs/models/custom-config.yaml`:
149
+ ```yaml
150
+ model:
151
+ name: google/gemma-3-27b-pt
152
+ hidden_dim: 5376
153
+ shard_pattern: model-00001-of-00012.safetensors
154
+ embedding_layer_name: language_model.model.embed_tokens.weight
155
+
156
+ languages:
157
+ eng_Latn: 0
158
+ spa_Latn: 1
159
+ fra_Latn: 2
160
+ # ... add more languages
161
+
162
+ inference:
163
+ max_sequence_length: 512
164
+ pooling: logsumexp
165
+ ```
166
+
167
+ 2. **Configure training** in `configs/training/custom-training.yaml`:
168
+ ```yaml
169
+ model_config_path: "configs/models/custom-model.yaml"
170
+
171
+ dataset:
172
+ name: "laurievb/OpenLID-v2"
173
+ filter_languages: true
174
+
175
+ training:
176
+ batch_size: 1536
177
+ learning_rate: 0.002
178
+ epochs: 2
179
+ ```
180
+
181
+ 3. **Train**:
182
+ ```bash
183
+ uv run wldetect train --config configs/training/custom-training.yaml
184
+ ```
185
+
186
+ Artifacts saved to `artifacts/`:
187
+ - `lookup_table_exp.safetensors` - Sparse exp lookup table (for inference)
188
+ - `projection.safetensors` - Projection matrix (fp32, for fine-tuning)
189
+ - `model_config.yaml` - Model configuration
190
+ - `model.pt` - Full PyTorch checkpoint
191
+
192
+ ### Training Commands
193
+
194
+ ```bash
195
+ # Train model
196
+ uv run wldetect train --config configs/training/gemma3-27b.yaml
197
+
198
+ # Evaluate on FLORES+
199
+ uv run wldetect eval --model-path artifacts/ --split dev
200
+
201
+ # Generate sparse lookup table from checkpoint (default: threshold=10.0)
202
+ uv run wldetect create-lookup \
203
+ --checkpoint artifacts/checkpoints/checkpoint_step_100000.pt \
204
+ --config configs/training/gemma3-27b.yaml \
205
+ --output-dir artifacts/
206
+ ```
207
+
208
+ ### Training Details
209
+
210
+ - **Embedding extraction**: Downloads only embedding tensor shards from HuggingFace (not full models)
211
+ - **Dataset**: OpenLID-v2 with configurable language filtering and balancing
212
+ - **Model**: Simple linear projection (hidden_dim → n_languages) with dropout
213
+ - **Pooling**: LogSumExp or max pooling over token sequences
214
+ - **Training time**: ~2-4 hours on GPU for 2 epochs (150 languages, 5000 samples/language)
215
+ - **Evaluation**: Automatic FLORES+ evaluation after training
216
+
217
+ ## License
218
+
219
+ Apache 2.0 License
220
+
221
+ ## Citations
222
+
223
+ If you use WordLlama Detect in your research or project, please consider citing it as follows:
224
+
225
+ ```bibtex
226
+ @software{miller2025wordllamadetect,
227
+ author = {Miller, D. Lee},
228
+ title = {WordLlama Detect: The Language of the Token},
229
+ year = {2025},
230
+ url = {https://github.com/dleemiller/WordLlamaDetect},
231
+ version = {0.1.0}
232
+ }
233
+ ```
234
+
235
+ ## Acknowledgments
236
+
237
+ - OpenLID-v2 dataset: [laurievb/OpenLID-v2](https://huggingface.co/datasets/laurievb/OpenLID-v2)
238
+ - FLORES+ dataset: [openlanguagedata/flores_plus](https://huggingface.co/datasets/openlanguagedata/flores_plus)
239
+ - HuggingFace transformers and tokenizers libraries
240
+ - Google Gemma model team