--- license: apache-2.0 language: - multilingual - en - de - fr - es - it - pt - nl - pl - sv - da - no - fi - cs - ro - hu - ca - ru - uk - bg - sr - el - zh - ja - ko - hi - bn - ur - ta - te - mr - th - vi - id - ms - fil - ar - fa - tr - he - sw - am - yo - lt - sl - et - lv - sk - hr - az - kk - uz library_name: transformers tags: - dictionary - translation - multilingual - bilingual - glossing - vocabulary base_model: Qwen/Qwen3-0.6B datasets: - HuggingFaceFW/fineweb - HuggingFaceFW/fineweb-2 pipeline_tag: text-generation --- # Ikhou Dictionary Model (dict-xs) A lightweight multilingual dictionary model based on Qwen3-0.6B, fine-tuned on 1.7M dictionary-style glosses across 50+ languages. ## Model Description This model provides short, dictionary-style translations and glosses for words and phrases in context. It's designed for: - Quick word lookups in reading applications - Vocabulary learning tools - Translation assistance - Language learning applications **Key Features:** - 🌍 **50+ languages** supported (see list below) - 📖 **Dictionary-style glosses** with grammatical markers - ⚡ **Fast inference** (596M parameters, bfloat16) - 🎯 **Context-aware** translations ## Supported Languages (50) ### Major European Languages English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Norwegian BokmĂ„l, Finnish, Czech, Romanian, Hungarian, Catalan, Greek ### Cyrillic Script Russian, Ukrainian, Bulgarian, Serbian ### Asian Languages Chinese (Mandarin), Japanese, Korean, Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Thai, Vietnamese, Indonesian, Malay, Filipino ### Middle Eastern Arabic, Persian, Turkish, Hebrew ### African Swahili, Amharic, Yoruba ### Other Lithuanian, Slovenian, Estonian, Latvian, Slovak, Croatian, Azerbaijani, Kazakh, Uzbek ## Grammar Markers Explained The model outputs grammatical information using standard linguistic abbreviations: ### Noun Markers (Gender-based Languages) - **nm.** = Masculine noun (e.g., "nm. roi, monarque" = king, monarch in French) - **nf.** = Feminine noun (e.g., "nf. maison, demeure" = house, dwelling in French) - **nn.** = Neuter noun (German, Russian) (e.g., "nn. Haus, GebĂ€ude" = house, building) ### Noun Markers (Non-gendered Languages) - **n.** = Noun (e.g., "n. house, home" in English) ### Other Parts of Speech - **adj.** = Adjective (e.g., "adj. rapide, vite" = fast, quick) - **adv.** = Adverb (e.g., "adv. rapidement, vite" = quickly, fast) - **pp** = Past participle (e.g., "mangĂ© → eaten, consumed (pp)") ### Verb Forms For conjugated verbs, the model provides: - Translation(s) - Tense/mood information in parentheses - Example: "venait → came, was coming (imparfait, il)" = imperfect tense, he ## Usage ### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "ikhou/dict-xs", torch_dtype="bfloat16", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("ikhou/dict-xs") # Example: Get a French→English gloss messages = [ { "role": "system", "content": "You are a bilingual dictionary. Given a word/phrase in context, output a short gloss.\n\nRules:\n- One line only, no labels\n- Use grammar markers: nm./nf./nn. for gendered nouns, n. for others, adj., adv., verbs with tense info\n- 1-4 short translations, comma-separated\n- Apply markers based on definition language" }, { "role": "user", "content": 'Expression: "maison"\nContext: Il habite dans une petite 【maison】 prĂšs de la mer.\nSource language: fra (French)\nDefinition language: eng (English)\n\nReturn the single-line gloss now.' } ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( inputs, max_new_tokens=50, temperature=0.3, do_sample=True, top_p=0.9 ) response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True) print(response) # Output: "nf. house, home" ``` ### Input Format The model expects: 1. **Expression**: The word/phrase to define 2. **Context**: Sentence with the expression (use 【】 to highlight) 3. **Source language**: ISO 639-3 code (e.g., fra, eng, deu) 4. **Definition language**: ISO 639-3 code ### Output Format The model returns a single line with: - Grammar marker (nm./nf./nn./n./adj./adv./pp) - 1-4 short translations/synonyms, comma-separated - For verbs: glosses + grammatical info in parentheses ## Training Details ### Training Data - **Dataset**: 1.7M synthetic dictionary entries - **Sources**: FineWeb (English), FineWeb-2 (49 other languages) - **Generation**: GPT-4-based teacher model for quality glosses - **Filtering**: Proper noun filtering, quality scoring ### Training Configuration - **Base Model**: Qwen/Qwen3-0.6B - **Training Type**: Full fine-tuning (not LoRA) - **Precision**: bfloat16 - **Batch Size**: 32 per device - **Gradient Accumulation**: 8 steps - **Total Steps**: 6,568 - **Learning Rate**: AdamW with cosine schedule - **Hardware**: NVIDIA H100 (95GB) - **Training Time**: ~6 hours ### Training Results - **Final Loss**: 1.30 - **Eval Loss**: 1.34 - **Training thoroughly validated** - no zero loss issues ## Model Architecture - **Architecture**: Qwen3ForCausalLM - **Parameters**: 596M - **Layers**: 28 transformer layers - **Hidden Size**: 1024 - **Attention Heads**: 16 (8 KV heads) - **Context Length**: 40,960 tokens (model max, trained on 512) - **Vocabulary**: 151,936 tokens ## Limitations - **Context**: Works best with clear, simple contexts - **Proper nouns**: May struggle with names, places, brands - **Rare languages**: Better performance on high-resource languages - **Multi-word phrases**: Best for 1-6 token phrases - **Ambiguity**: Provides common meanings, may miss context-specific nuances ## Ethical Considerations - **Bias**: Trained on web data which may contain biases - **Not for sensitive applications**: Dictionary glosses may have errors - **Educational use**: Best for learning and reference, not authoritative translation ## License Apache 2.0 ## Citation ```bibtex @misc{ikhou-dict-xs, author = {Ikhou}, title = {Ikhou Dictionary Model (dict-xs)}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/ikhou/dict-xs}} } ``` ## Acknowledgments - Based on [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by Alibaba Cloud - Training data sourced from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) - Trained with [Hugging Face Transformers](https://github.com/huggingface/transformers) ## Contact For issues or questions, please open an issue on the [model repository](https://huggingface.co/ikhou/dict-xs/discussions).