File size: 6,902 Bytes
6313038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
license: apache-2.0
language:
- multilingual
- en
- de
- fr
- es
- it
- pt
- nl
- pl
- sv
- da
- no
- fi
- cs
- ro
- hu
- ca
- ru
- uk
- bg
- sr
- el
- zh
- ja
- ko
- hi
- bn
- ur
- ta
- te
- mr
- th
- vi
- id
- ms
- fil
- ar
- fa
- tr
- he
- sw
- am
- yo
- lt
- sl
- et
- lv
- sk
- hr
- az
- kk
- uz
library_name: transformers
tags:
- dictionary
- translation
- multilingual
- bilingual
- glossing
- vocabulary
base_model: Qwen/Qwen3-0.6B
datasets:
- HuggingFaceFW/fineweb
- HuggingFaceFW/fineweb-2
pipeline_tag: text-generation
---

# Ikhou Dictionary Model (dict-xs)

A lightweight multilingual dictionary model based on Qwen3-0.6B, fine-tuned on 1.7M dictionary-style glosses across 50+ languages.

## Model Description

This model provides short, dictionary-style translations and glosses for words and phrases in context. It's designed for:
- Quick word lookups in reading applications
- Vocabulary learning tools
- Translation assistance
- Language learning applications

**Key Features:**
- 🌍 **50+ languages** supported (see list below)
- 📖 **Dictionary-style glosses** with grammatical markers
-**Fast inference** (596M parameters, bfloat16)
- 🎯 **Context-aware** translations

## Supported Languages (50)

### Major European Languages
English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Norwegian Bokmål, Finnish, Czech, Romanian, Hungarian, Catalan, Greek

### Cyrillic Script
Russian, Ukrainian, Bulgarian, Serbian

### Asian Languages
Chinese (Mandarin), Japanese, Korean, Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Thai, Vietnamese, Indonesian, Malay, Filipino

### Middle Eastern
Arabic, Persian, Turkish, Hebrew

### African
Swahili, Amharic, Yoruba

### Other
Lithuanian, Slovenian, Estonian, Latvian, Slovak, Croatian, Azerbaijani, Kazakh, Uzbek

## Grammar Markers Explained

The model outputs grammatical information using standard linguistic abbreviations:

### Noun Markers (Gender-based Languages)
- **nm.** = Masculine noun (e.g., "nm. roi, monarque" = king, monarch in French)
- **nf.** = Feminine noun (e.g., "nf. maison, demeure" = house, dwelling in French)
- **nn.** = Neuter noun (German, Russian) (e.g., "nn. Haus, Gebäude" = house, building)

### Noun Markers (Non-gendered Languages)
- **n.** = Noun (e.g., "n. house, home" in English)

### Other Parts of Speech
- **adj.** = Adjective (e.g., "adj. rapide, vite" = fast, quick)
- **adv.** = Adverb (e.g., "adv. rapidement, vite" = quickly, fast)
- **pp** = Past participle (e.g., "mangé → eaten, consumed (pp)")

### Verb Forms
For conjugated verbs, the model provides:
- Translation(s)
- Tense/mood information in parentheses
- Example: "venait → came, was coming (imparfait, il)" = imperfect tense, he

## Usage

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ikhou/dict-xs",
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ikhou/dict-xs")

# Example: Get a French→English gloss
messages = [
    {
        "role": "system",
        "content": "You are a bilingual dictionary. Given a word/phrase in context, output a short gloss.\n\nRules:\n- One line only, no labels\n- Use grammar markers: nm./nf./nn. for gendered nouns, n. for others, adj., adv., verbs with tense info\n- 1-4 short translations, comma-separated\n- Apply markers based on definition language"
    },
    {
        "role": "user",
        "content": 'Expression: "maison"\nContext: Il habite dans une petite 【maison】 près de la mer.\nSource language: fra (French)\nDefinition language: eng (English)\n\nReturn the single-line gloss now.'
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=50,
    temperature=0.3,
    do_sample=True,
    top_p=0.9
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)  # Output: "nf. house, home"
```

### Input Format

The model expects:
1. **Expression**: The word/phrase to define
2. **Context**: Sentence with the expression (use 【】 to highlight)
3. **Source language**: ISO 639-3 code (e.g., fra, eng, deu)
4. **Definition language**: ISO 639-3 code

### Output Format

The model returns a single line with:
- Grammar marker (nm./nf./nn./n./adj./adv./pp)
- 1-4 short translations/synonyms, comma-separated
- For verbs: glosses + grammatical info in parentheses

## Training Details

### Training Data
- **Dataset**: 1.7M synthetic dictionary entries
- **Sources**: FineWeb (English), FineWeb-2 (49 other languages)
- **Generation**: GPT-4-based teacher model for quality glosses
- **Filtering**: Proper noun filtering, quality scoring

### Training Configuration
- **Base Model**: Qwen/Qwen3-0.6B
- **Training Type**: Full fine-tuning (not LoRA)
- **Precision**: bfloat16
- **Batch Size**: 32 per device
- **Gradient Accumulation**: 8 steps
- **Total Steps**: 6,568
- **Learning Rate**: AdamW with cosine schedule
- **Hardware**: NVIDIA H100 (95GB)
- **Training Time**: ~6 hours

### Training Results
- **Final Loss**: 1.30
- **Eval Loss**: 1.34
- **Training thoroughly validated** - no zero loss issues

## Model Architecture

- **Architecture**: Qwen3ForCausalLM
- **Parameters**: 596M
- **Layers**: 28 transformer layers
- **Hidden Size**: 1024
- **Attention Heads**: 16 (8 KV heads)
- **Context Length**: 40,960 tokens (model max, trained on 512)
- **Vocabulary**: 151,936 tokens

## Limitations

- **Context**: Works best with clear, simple contexts
- **Proper nouns**: May struggle with names, places, brands
- **Rare languages**: Better performance on high-resource languages
- **Multi-word phrases**: Best for 1-6 token phrases
- **Ambiguity**: Provides common meanings, may miss context-specific nuances

## Ethical Considerations

- **Bias**: Trained on web data which may contain biases
- **Not for sensitive applications**: Dictionary glosses may have errors
- **Educational use**: Best for learning and reference, not authoritative translation

## License

Apache 2.0

## Citation

```bibtex
@misc{ikhou-dict-xs,
  author = {Ikhou},
  title = {Ikhou Dictionary Model (dict-xs)},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ikhou/dict-xs}}
}
```

## Acknowledgments

- Based on [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by Alibaba Cloud
- Training data sourced from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
- Trained with [Hugging Face Transformers](https://github.com/huggingface/transformers)

## Contact

For issues or questions, please open an issue on the [model repository](https://huggingface.co/ikhou/dict-xs/discussions).