dict-xs / README.md
isemmanuelolowe's picture
Upload Ikhou dict-xs model - 1.7M samples, 50+ languages, zero-loss issue fixed
6313038 verified
---
license: apache-2.0
language:
- multilingual
- en
- de
- fr
- es
- it
- pt
- nl
- pl
- sv
- da
- no
- fi
- cs
- ro
- hu
- ca
- ru
- uk
- bg
- sr
- el
- zh
- ja
- ko
- hi
- bn
- ur
- ta
- te
- mr
- th
- vi
- id
- ms
- fil
- ar
- fa
- tr
- he
- sw
- am
- yo
- lt
- sl
- et
- lv
- sk
- hr
- az
- kk
- uz
library_name: transformers
tags:
- dictionary
- translation
- multilingual
- bilingual
- glossing
- vocabulary
base_model: Qwen/Qwen3-0.6B
datasets:
- HuggingFaceFW/fineweb
- HuggingFaceFW/fineweb-2
pipeline_tag: text-generation
---
# Ikhou Dictionary Model (dict-xs)
A lightweight multilingual dictionary model based on Qwen3-0.6B, fine-tuned on 1.7M dictionary-style glosses across 50+ languages.
## Model Description
This model provides short, dictionary-style translations and glosses for words and phrases in context. It's designed for:
- Quick word lookups in reading applications
- Vocabulary learning tools
- Translation assistance
- Language learning applications
**Key Features:**
- 🌍 **50+ languages** supported (see list below)
- 📖 **Dictionary-style glosses** with grammatical markers
-**Fast inference** (596M parameters, bfloat16)
- 🎯 **Context-aware** translations
## Supported Languages (50)
### Major European Languages
English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Norwegian Bokmål, Finnish, Czech, Romanian, Hungarian, Catalan, Greek
### Cyrillic Script
Russian, Ukrainian, Bulgarian, Serbian
### Asian Languages
Chinese (Mandarin), Japanese, Korean, Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Thai, Vietnamese, Indonesian, Malay, Filipino
### Middle Eastern
Arabic, Persian, Turkish, Hebrew
### African
Swahili, Amharic, Yoruba
### Other
Lithuanian, Slovenian, Estonian, Latvian, Slovak, Croatian, Azerbaijani, Kazakh, Uzbek
## Grammar Markers Explained
The model outputs grammatical information using standard linguistic abbreviations:
### Noun Markers (Gender-based Languages)
- **nm.** = Masculine noun (e.g., "nm. roi, monarque" = king, monarch in French)
- **nf.** = Feminine noun (e.g., "nf. maison, demeure" = house, dwelling in French)
- **nn.** = Neuter noun (German, Russian) (e.g., "nn. Haus, Gebäude" = house, building)
### Noun Markers (Non-gendered Languages)
- **n.** = Noun (e.g., "n. house, home" in English)
### Other Parts of Speech
- **adj.** = Adjective (e.g., "adj. rapide, vite" = fast, quick)
- **adv.** = Adverb (e.g., "adv. rapidement, vite" = quickly, fast)
- **pp** = Past participle (e.g., "mangé → eaten, consumed (pp)")
### Verb Forms
For conjugated verbs, the model provides:
- Translation(s)
- Tense/mood information in parentheses
- Example: "venait → came, was coming (imparfait, il)" = imperfect tense, he
## Usage
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"ikhou/dict-xs",
torch_dtype="bfloat16",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ikhou/dict-xs")
# Example: Get a French→English gloss
messages = [
{
"role": "system",
"content": "You are a bilingual dictionary. Given a word/phrase in context, output a short gloss.\n\nRules:\n- One line only, no labels\n- Use grammar markers: nm./nf./nn. for gendered nouns, n. for others, adj., adv., verbs with tense info\n- 1-4 short translations, comma-separated\n- Apply markers based on definition language"
},
{
"role": "user",
"content": 'Expression: "maison"\nContext: Il habite dans une petite 【maison】 près de la mer.\nSource language: fra (French)\nDefinition language: eng (English)\n\nReturn the single-line gloss now.'
}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=50,
temperature=0.3,
do_sample=True,
top_p=0.9
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response) # Output: "nf. house, home"
```
### Input Format
The model expects:
1. **Expression**: The word/phrase to define
2. **Context**: Sentence with the expression (use 【】 to highlight)
3. **Source language**: ISO 639-3 code (e.g., fra, eng, deu)
4. **Definition language**: ISO 639-3 code
### Output Format
The model returns a single line with:
- Grammar marker (nm./nf./nn./n./adj./adv./pp)
- 1-4 short translations/synonyms, comma-separated
- For verbs: glosses + grammatical info in parentheses
## Training Details
### Training Data
- **Dataset**: 1.7M synthetic dictionary entries
- **Sources**: FineWeb (English), FineWeb-2 (49 other languages)
- **Generation**: GPT-4-based teacher model for quality glosses
- **Filtering**: Proper noun filtering, quality scoring
### Training Configuration
- **Base Model**: Qwen/Qwen3-0.6B
- **Training Type**: Full fine-tuning (not LoRA)
- **Precision**: bfloat16
- **Batch Size**: 32 per device
- **Gradient Accumulation**: 8 steps
- **Total Steps**: 6,568
- **Learning Rate**: AdamW with cosine schedule
- **Hardware**: NVIDIA H100 (95GB)
- **Training Time**: ~6 hours
### Training Results
- **Final Loss**: 1.30
- **Eval Loss**: 1.34
- **Training thoroughly validated** - no zero loss issues
## Model Architecture
- **Architecture**: Qwen3ForCausalLM
- **Parameters**: 596M
- **Layers**: 28 transformer layers
- **Hidden Size**: 1024
- **Attention Heads**: 16 (8 KV heads)
- **Context Length**: 40,960 tokens (model max, trained on 512)
- **Vocabulary**: 151,936 tokens
## Limitations
- **Context**: Works best with clear, simple contexts
- **Proper nouns**: May struggle with names, places, brands
- **Rare languages**: Better performance on high-resource languages
- **Multi-word phrases**: Best for 1-6 token phrases
- **Ambiguity**: Provides common meanings, may miss context-specific nuances
## Ethical Considerations
- **Bias**: Trained on web data which may contain biases
- **Not for sensitive applications**: Dictionary glosses may have errors
- **Educational use**: Best for learning and reference, not authoritative translation
## License
Apache 2.0
## Citation
```bibtex
@misc{ikhou-dict-xs,
author = {Ikhou},
title = {Ikhou Dictionary Model (dict-xs)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ikhou/dict-xs}}
}
```
## Acknowledgments
- Based on [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by Alibaba Cloud
- Training data sourced from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
- Trained with [Hugging Face Transformers](https://github.com/huggingface/transformers)
## Contact
For issues or questions, please open an issue on the [model repository](https://huggingface.co/ikhou/dict-xs/discussions).