File size: 6,902 Bytes
6313038 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | ---
license: apache-2.0
language:
- multilingual
- en
- de
- fr
- es
- it
- pt
- nl
- pl
- sv
- da
- no
- fi
- cs
- ro
- hu
- ca
- ru
- uk
- bg
- sr
- el
- zh
- ja
- ko
- hi
- bn
- ur
- ta
- te
- mr
- th
- vi
- id
- ms
- fil
- ar
- fa
- tr
- he
- sw
- am
- yo
- lt
- sl
- et
- lv
- sk
- hr
- az
- kk
- uz
library_name: transformers
tags:
- dictionary
- translation
- multilingual
- bilingual
- glossing
- vocabulary
base_model: Qwen/Qwen3-0.6B
datasets:
- HuggingFaceFW/fineweb
- HuggingFaceFW/fineweb-2
pipeline_tag: text-generation
---
# Ikhou Dictionary Model (dict-xs)
A lightweight multilingual dictionary model based on Qwen3-0.6B, fine-tuned on 1.7M dictionary-style glosses across 50+ languages.
## Model Description
This model provides short, dictionary-style translations and glosses for words and phrases in context. It's designed for:
- Quick word lookups in reading applications
- Vocabulary learning tools
- Translation assistance
- Language learning applications
**Key Features:**
- 🌍 **50+ languages** supported (see list below)
- 📖 **Dictionary-style glosses** with grammatical markers
- ⚡ **Fast inference** (596M parameters, bfloat16)
- 🎯 **Context-aware** translations
## Supported Languages (50)
### Major European Languages
English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Norwegian Bokmål, Finnish, Czech, Romanian, Hungarian, Catalan, Greek
### Cyrillic Script
Russian, Ukrainian, Bulgarian, Serbian
### Asian Languages
Chinese (Mandarin), Japanese, Korean, Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Thai, Vietnamese, Indonesian, Malay, Filipino
### Middle Eastern
Arabic, Persian, Turkish, Hebrew
### African
Swahili, Amharic, Yoruba
### Other
Lithuanian, Slovenian, Estonian, Latvian, Slovak, Croatian, Azerbaijani, Kazakh, Uzbek
## Grammar Markers Explained
The model outputs grammatical information using standard linguistic abbreviations:
### Noun Markers (Gender-based Languages)
- **nm.** = Masculine noun (e.g., "nm. roi, monarque" = king, monarch in French)
- **nf.** = Feminine noun (e.g., "nf. maison, demeure" = house, dwelling in French)
- **nn.** = Neuter noun (German, Russian) (e.g., "nn. Haus, Gebäude" = house, building)
### Noun Markers (Non-gendered Languages)
- **n.** = Noun (e.g., "n. house, home" in English)
### Other Parts of Speech
- **adj.** = Adjective (e.g., "adj. rapide, vite" = fast, quick)
- **adv.** = Adverb (e.g., "adv. rapidement, vite" = quickly, fast)
- **pp** = Past participle (e.g., "mangé → eaten, consumed (pp)")
### Verb Forms
For conjugated verbs, the model provides:
- Translation(s)
- Tense/mood information in parentheses
- Example: "venait → came, was coming (imparfait, il)" = imperfect tense, he
## Usage
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"ikhou/dict-xs",
torch_dtype="bfloat16",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ikhou/dict-xs")
# Example: Get a French→English gloss
messages = [
{
"role": "system",
"content": "You are a bilingual dictionary. Given a word/phrase in context, output a short gloss.\n\nRules:\n- One line only, no labels\n- Use grammar markers: nm./nf./nn. for gendered nouns, n. for others, adj., adv., verbs with tense info\n- 1-4 short translations, comma-separated\n- Apply markers based on definition language"
},
{
"role": "user",
"content": 'Expression: "maison"\nContext: Il habite dans une petite 【maison】 près de la mer.\nSource language: fra (French)\nDefinition language: eng (English)\n\nReturn the single-line gloss now.'
}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=50,
temperature=0.3,
do_sample=True,
top_p=0.9
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response) # Output: "nf. house, home"
```
### Input Format
The model expects:
1. **Expression**: The word/phrase to define
2. **Context**: Sentence with the expression (use 【】 to highlight)
3. **Source language**: ISO 639-3 code (e.g., fra, eng, deu)
4. **Definition language**: ISO 639-3 code
### Output Format
The model returns a single line with:
- Grammar marker (nm./nf./nn./n./adj./adv./pp)
- 1-4 short translations/synonyms, comma-separated
- For verbs: glosses + grammatical info in parentheses
## Training Details
### Training Data
- **Dataset**: 1.7M synthetic dictionary entries
- **Sources**: FineWeb (English), FineWeb-2 (49 other languages)
- **Generation**: GPT-4-based teacher model for quality glosses
- **Filtering**: Proper noun filtering, quality scoring
### Training Configuration
- **Base Model**: Qwen/Qwen3-0.6B
- **Training Type**: Full fine-tuning (not LoRA)
- **Precision**: bfloat16
- **Batch Size**: 32 per device
- **Gradient Accumulation**: 8 steps
- **Total Steps**: 6,568
- **Learning Rate**: AdamW with cosine schedule
- **Hardware**: NVIDIA H100 (95GB)
- **Training Time**: ~6 hours
### Training Results
- **Final Loss**: 1.30
- **Eval Loss**: 1.34
- **Training thoroughly validated** - no zero loss issues
## Model Architecture
- **Architecture**: Qwen3ForCausalLM
- **Parameters**: 596M
- **Layers**: 28 transformer layers
- **Hidden Size**: 1024
- **Attention Heads**: 16 (8 KV heads)
- **Context Length**: 40,960 tokens (model max, trained on 512)
- **Vocabulary**: 151,936 tokens
## Limitations
- **Context**: Works best with clear, simple contexts
- **Proper nouns**: May struggle with names, places, brands
- **Rare languages**: Better performance on high-resource languages
- **Multi-word phrases**: Best for 1-6 token phrases
- **Ambiguity**: Provides common meanings, may miss context-specific nuances
## Ethical Considerations
- **Bias**: Trained on web data which may contain biases
- **Not for sensitive applications**: Dictionary glosses may have errors
- **Educational use**: Best for learning and reference, not authoritative translation
## License
Apache 2.0
## Citation
```bibtex
@misc{ikhou-dict-xs,
author = {Ikhou},
title = {Ikhou Dictionary Model (dict-xs)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ikhou/dict-xs}}
}
```
## Acknowledgments
- Based on [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by Alibaba Cloud
- Training data sourced from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
- Trained with [Hugging Face Transformers](https://github.com/huggingface/transformers)
## Contact
For issues or questions, please open an issue on the [model repository](https://huggingface.co/ikhou/dict-xs/discussions).
|