File size: 6,902 Bytes
---
license: apache-2.0
language:
- multilingual
- en
- de
- fr
- es
- it
- pt
- nl
- pl
- sv
- da
- no
- fi
- cs
- ro
- hu
- ca
- ru
- uk
- bg
- sr
- el
- zh
- ja
- ko
- hi
- bn
- ur
- ta
- te
- mr
- th
- vi
- id
- ms
- fil
- ar
- fa
- tr
- he
- sw
- am
- yo
- lt
- sl
- et
- lv
- sk
- hr
- az
- kk
- uz
library_name: transformers
tags:
- dictionary
- translation
- multilingual
- bilingual
- glossing
- vocabulary
base_model: Qwen/Qwen3-0.6B
datasets:
- HuggingFaceFW/fineweb
- HuggingFaceFW/fineweb-2
pipeline_tag: text-generation
---

# Ikhou Dictionary Model (dict-xs)

A lightweight multilingual dictionary model based on Qwen3-0.6B, fine-tuned on 1.7M dictionary-style glosses across 50+ languages.

## Model Description

This model provides short, dictionary-style translations and glosses for words and phrases in context. It's designed for:
- Quick word lookups in reading applications
- Vocabulary learning tools
- Translation assistance
- Language learning applications

**Key Features:**
- 🌍 **50+ languages** supported (see list below)
- 📖 **Dictionary-style glosses** with grammatical markers
- ⚡ **Fast inference** (596M parameters, bfloat16)
- 🎯 **Context-aware** translations

## Supported Languages (50)

### Major European Languages
English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Norwegian Bokmål, Finnish, Czech, Romanian, Hungarian, Catalan, Greek

### Cyrillic Script
Russian, Ukrainian, Bulgarian, Serbian

### Asian Languages
Chinese (Mandarin), Japanese, Korean, Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Thai, Vietnamese, Indonesian, Malay, Filipino

### Middle Eastern
Arabic, Persian, Turkish, Hebrew

### African
Swahili, Amharic, Yoruba

### Other
Lithuanian, Slovenian, Estonian, Latvian, Slovak, Croatian, Azerbaijani, Kazakh, Uzbek

## Grammar Markers Explained

The model outputs grammatical information using standard linguistic abbreviations:

### Noun Markers (Gender-based Languages)
- **nm.** = Masculine noun (e.g., "nm. roi, monarque" = king, monarch in French)
- **nf.** = Feminine noun (e.g., "nf. maison, demeure" = house, dwelling in French)
- **nn.** = Neuter noun (German, Russian) (e.g., "nn. Haus, Gebäude" = house, building)

### Noun Markers (Non-gendered Languages)
- **n.** = Noun (e.g., "n. house, home" in English)

### Other Parts of Speech
- **adj.** = Adjective (e.g., "adj. rapide, vite" = fast, quick)
- **adv.** = Adverb (e.g., "adv. rapidement, vite" = quickly, fast)
- **pp** = Past participle (e.g., "mangé → eaten, consumed (pp)")

### Verb Forms
For conjugated verbs, the model provides:
- Translation(s)
- Tense/mood information in parentheses
- Example: "venait → came, was coming (imparfait, il)" = imperfect tense, he

## Usage

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ikhou/dict-xs",
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ikhou/dict-xs")

# Example: Get a French→English gloss
messages = [
    {
        "role": "system",
        "content": "You are a bilingual dictionary. Given a word/phrase in context, output a short gloss.\n\nRules:\n- One line only, no labels\n- Use grammar markers: nm./nf./nn. for gendered nouns, n. for others, adj., adv., verbs with tense info\n- 1-4 short translations, comma-separated\n- Apply markers based on definition language"
    },
    {
        "role": "user",
        "content": 'Expression: "maison"\nContext: Il habite dans une petite 【maison】 près de la mer.\nSource language: fra (French)\nDefinition language: eng (English)\n\nReturn the single-line gloss now.'
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=50,
    temperature=0.3,
    do_sample=True,
    top_p=0.9
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)  # Output: "nf. house, home"
```

### Input Format

The model expects:
1. **Expression**: The word/phrase to define
2. **Context**: Sentence with the expression (use 【】 to highlight)
3. **Source language**: ISO 639-3 code (e.g., fra, eng, deu)
4. **Definition language**: ISO 639-3 code

### Output Format

The model returns a single line with:
- Grammar marker (nm./nf./nn./n./adj./adv./pp)
- 1-4 short translations/synonyms, comma-separated
- For verbs: glosses + grammatical info in parentheses

## Training Details

### Training Data
- **Dataset**: 1.7M synthetic dictionary entries
- **Sources**: FineWeb (English), FineWeb-2 (49 other languages)
- **Generation**: GPT-4-based teacher model for quality glosses
- **Filtering**: Proper noun filtering, quality scoring

### Training Configuration
- **Base Model**: Qwen/Qwen3-0.6B
- **Training Type**: Full fine-tuning (not LoRA)
- **Precision**: bfloat16
- **Batch Size**: 32 per device
- **Gradient Accumulation**: 8 steps
- **Total Steps**: 6,568
- **Learning Rate**: AdamW with cosine schedule
- **Hardware**: NVIDIA H100 (95GB)
- **Training Time**: ~6 hours

### Training Results
- **Final Loss**: 1.30
- **Eval Loss**: 1.34
- **Training thoroughly validated** - no zero loss issues

## Model Architecture

- **Architecture**: Qwen3ForCausalLM
- **Parameters**: 596M
- **Layers**: 28 transformer layers
- **Hidden Size**: 1024
- **Attention Heads**: 16 (8 KV heads)
- **Context Length**: 40,960 tokens (model max, trained on 512)
- **Vocabulary**: 151,936 tokens

## Limitations

- **Context**: Works best with clear, simple contexts
- **Proper nouns**: May struggle with names, places, brands
- **Rare languages**: Better performance on high-resource languages
- **Multi-word phrases**: Best for 1-6 token phrases
- **Ambiguity**: Provides common meanings, may miss context-specific nuances

## Ethical Considerations

- **Bias**: Trained on web data which may contain biases
- **Not for sensitive applications**: Dictionary glosses may have errors
- **Educational use**: Best for learning and reference, not authoritative translation

## License

Apache 2.0

## Citation

```bibtex
@misc{ikhou-dict-xs,
  author = {Ikhou},
  title = {Ikhou Dictionary Model (dict-xs)},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ikhou/dict-xs}}
}
```

## Acknowledgments

- Based on [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by Alibaba Cloud
- Training data sourced from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
- Trained with [Hugging Face Transformers](https://github.com/huggingface/transformers)

## Contact

For issues or questions, please open an issue on the [model repository](https://huggingface.co/ikhou/dict-xs/discussions).