Ikhou Dictionary Model (dict-xs)
A lightweight multilingual dictionary model based on Qwen3-0.6B, fine-tuned on 1.7M dictionary-style glosses across 50+ languages.
Model Description
This model provides short, dictionary-style translations and glosses for words and phrases in context. It's designed for:
- Quick word lookups in reading applications
- Vocabulary learning tools
- Translation assistance
- Language learning applications
Key Features:
- 🌍 50+ languages supported (see list below)
- 📖 Dictionary-style glosses with grammatical markers
- ⚡ Fast inference (596M parameters, bfloat16)
- 🎯 Context-aware translations
Supported Languages (50)
Major European Languages
English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Norwegian Bokmål, Finnish, Czech, Romanian, Hungarian, Catalan, Greek
Cyrillic Script
Russian, Ukrainian, Bulgarian, Serbian
Asian Languages
Chinese (Mandarin), Japanese, Korean, Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Thai, Vietnamese, Indonesian, Malay, Filipino
Middle Eastern
Arabic, Persian, Turkish, Hebrew
African
Swahili, Amharic, Yoruba
Other
Lithuanian, Slovenian, Estonian, Latvian, Slovak, Croatian, Azerbaijani, Kazakh, Uzbek
Grammar Markers Explained
The model outputs grammatical information using standard linguistic abbreviations:
Noun Markers (Gender-based Languages)
- nm. = Masculine noun (e.g., "nm. roi, monarque" = king, monarch in French)
- nf. = Feminine noun (e.g., "nf. maison, demeure" = house, dwelling in French)
- nn. = Neuter noun (German, Russian) (e.g., "nn. Haus, Gebäude" = house, building)
Noun Markers (Non-gendered Languages)
- n. = Noun (e.g., "n. house, home" in English)
Other Parts of Speech
- adj. = Adjective (e.g., "adj. rapide, vite" = fast, quick)
- adv. = Adverb (e.g., "adv. rapidement, vite" = quickly, fast)
- pp = Past participle (e.g., "mangé → eaten, consumed (pp)")
Verb Forms
For conjugated verbs, the model provides:
- Translation(s)
- Tense/mood information in parentheses
- Example: "venait → came, was coming (imparfait, il)" = imperfect tense, he
Usage
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"ikhou/dict-xs",
torch_dtype="bfloat16",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ikhou/dict-xs")
# Example: Get a French→English gloss
messages = [
{
"role": "system",
"content": "You are a bilingual dictionary. Given a word/phrase in context, output a short gloss.\n\nRules:\n- One line only, no labels\n- Use grammar markers: nm./nf./nn. for gendered nouns, n. for others, adj., adv., verbs with tense info\n- 1-4 short translations, comma-separated\n- Apply markers based on definition language"
},
{
"role": "user",
"content": 'Expression: "maison"\nContext: Il habite dans une petite 【maison】 près de la mer.\nSource language: fra (French)\nDefinition language: eng (English)\n\nReturn the single-line gloss now.'
}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=50,
temperature=0.3,
do_sample=True,
top_p=0.9
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response) # Output: "nf. house, home"
Input Format
The model expects:
- Expression: The word/phrase to define
- Context: Sentence with the expression (use 【】 to highlight)
- Source language: ISO 639-3 code (e.g., fra, eng, deu)
- Definition language: ISO 639-3 code
Output Format
The model returns a single line with:
- Grammar marker (nm./nf./nn./n./adj./adv./pp)
- 1-4 short translations/synonyms, comma-separated
- For verbs: glosses + grammatical info in parentheses
Training Details
Training Data
- Dataset: 1.7M synthetic dictionary entries
- Sources: FineWeb (English), FineWeb-2 (49 other languages)
- Generation: GPT-4-based teacher model for quality glosses
- Filtering: Proper noun filtering, quality scoring
Training Configuration
- Base Model: Qwen/Qwen3-0.6B
- Training Type: Full fine-tuning (not LoRA)
- Precision: bfloat16
- Batch Size: 32 per device
- Gradient Accumulation: 8 steps
- Total Steps: 6,568
- Learning Rate: AdamW with cosine schedule
- Hardware: NVIDIA H100 (95GB)
- Training Time: ~6 hours
Training Results
- Final Loss: 1.30
- Eval Loss: 1.34
- Training thoroughly validated - no zero loss issues
Model Architecture
- Architecture: Qwen3ForCausalLM
- Parameters: 596M
- Layers: 28 transformer layers
- Hidden Size: 1024
- Attention Heads: 16 (8 KV heads)
- Context Length: 40,960 tokens (model max, trained on 512)
- Vocabulary: 151,936 tokens
Limitations
- Context: Works best with clear, simple contexts
- Proper nouns: May struggle with names, places, brands
- Rare languages: Better performance on high-resource languages
- Multi-word phrases: Best for 1-6 token phrases
- Ambiguity: Provides common meanings, may miss context-specific nuances
Ethical Considerations
- Bias: Trained on web data which may contain biases
- Not for sensitive applications: Dictionary glosses may have errors
- Educational use: Best for learning and reference, not authoritative translation
License
Apache 2.0
Citation
@misc{ikhou-dict-xs,
author = {Ikhou},
title = {Ikhou Dictionary Model (dict-xs)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ikhou/dict-xs}}
}
Acknowledgments
- Based on Qwen3-0.6B by Alibaba Cloud
- Training data sourced from FineWeb and FineWeb-2
- Trained with Hugging Face Transformers
Contact
For issues or questions, please open an issue on the model repository.
- Downloads last month
- 25