dict-xs / README.md

Upload Ikhou dict-xs model - 1.7M samples, 50+ languages, zero-loss issue fixed

6313038 verified 11 days ago

6.9 kB

	---
	license: apache-2.0
	language:
	- multilingual
	- en
	- de
	- fr
	- es
	- it
	- pt
	- nl
	- pl
	- sv
	- da
	- no
	- fi
	- cs
	- ro
	- hu
	- ca
	- ru
	- uk
	- bg
	- sr
	- el
	- zh
	- ja
	- ko
	- hi
	- bn
	- ur
	- ta
	- te
	- mr
	- th
	- vi
	- id
	- ms
	- fil
	- ar
	- fa
	- tr
	- he
	- sw
	- am
	- yo
	- lt
	- sl
	- et
	- lv
	- sk
	- hr
	- az
	- kk
	- uz
	library_name: transformers
	tags:
	- dictionary
	- translation
	- multilingual
	- bilingual
	- glossing
	- vocabulary
	base_model: Qwen/Qwen3-0.6B
	datasets:
	- HuggingFaceFW/fineweb
	- HuggingFaceFW/fineweb-2
	pipeline_tag: text-generation
	---

	# Ikhou Dictionary Model (dict-xs)

	A lightweight multilingual dictionary model based on Qwen3-0.6B, fine-tuned on 1.7M dictionary-style glosses across 50+ languages.

	## Model Description

	This model provides short, dictionary-style translations and glosses for words and phrases in context. It's designed for:
	- Quick word lookups in reading applications
	- Vocabulary learning tools
	- Translation assistance
	- Language learning applications

	Key Features:
	- 🌍 50+ languages supported (see list below)
	- 📖 Dictionary-style glosses with grammatical markers
	- ⚡ Fast inference (596M parameters, bfloat16)
	- 🎯 Context-aware translations

	## Supported Languages (50)

	### Major European Languages
	English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Norwegian Bokmål, Finnish, Czech, Romanian, Hungarian, Catalan, Greek

	### Cyrillic Script
	Russian, Ukrainian, Bulgarian, Serbian

	### Asian Languages
	Chinese (Mandarin), Japanese, Korean, Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Thai, Vietnamese, Indonesian, Malay, Filipino

	### Middle Eastern
	Arabic, Persian, Turkish, Hebrew

	### African
	Swahili, Amharic, Yoruba

	### Other
	Lithuanian, Slovenian, Estonian, Latvian, Slovak, Croatian, Azerbaijani, Kazakh, Uzbek

	## Grammar Markers Explained

	The model outputs grammatical information using standard linguistic abbreviations:

	### Noun Markers (Gender-based Languages)
	- nm. = Masculine noun (e.g., "nm. roi, monarque" = king, monarch in French)
	- nf. = Feminine noun (e.g., "nf. maison, demeure" = house, dwelling in French)
	- nn. = Neuter noun (German, Russian) (e.g., "nn. Haus, Gebäude" = house, building)

	### Noun Markers (Non-gendered Languages)
	- n. = Noun (e.g., "n. house, home" in English)

	### Other Parts of Speech
	- adj. = Adjective (e.g., "adj. rapide, vite" = fast, quick)
	- adv. = Adverb (e.g., "adv. rapidement, vite" = quickly, fast)
	- pp = Past participle (e.g., "mangé → eaten, consumed (pp)")

	### Verb Forms
	For conjugated verbs, the model provides:
	- Translation(s)
	- Tense/mood information in parentheses
	- Example: "venait → came, was coming (imparfait, il)" = imperfect tense, he

	## Usage

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"ikhou/dict-xs",
	torch_dtype="bfloat16",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("ikhou/dict-xs")

	# Example: Get a French→English gloss
	messages = [
	{
	"role": "system",
	"content": "You are a bilingual dictionary. Given a word/phrase in context, output a short gloss.\n\nRules:\n- One line only, no labels\n- Use grammar markers: nm./nf./nn. for gendered nouns, n. for others, adj., adv., verbs with tense info\n- 1-4 short translations, comma-separated\n- Apply markers based on definition language"
	},
	{
	"role": "user",
	"content": 'Expression: "maison"\nContext: Il habite dans une petite 【maison】 près de la mer.\nSource language: fra (French)\nDefinition language: eng (English)\n\nReturn the single-line gloss now.'
	}
	]

	inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	outputs = model.generate(
	inputs,
	max_new_tokens=50,
	temperature=0.3,
	do_sample=True,
	top_p=0.9
	)

	response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
	print(response) # Output: "nf. house, home"
	```

	### Input Format

	The model expects:
	1. Expression: The word/phrase to define
	2. Context: Sentence with the expression (use 【】 to highlight)
	3. Source language: ISO 639-3 code (e.g., fra, eng, deu)
	4. Definition language: ISO 639-3 code

	### Output Format

	The model returns a single line with:
	- Grammar marker (nm./nf./nn./n./adj./adv./pp)
	- 1-4 short translations/synonyms, comma-separated
	- For verbs: glosses + grammatical info in parentheses

	## Training Details

	### Training Data
	- Dataset: 1.7M synthetic dictionary entries
	- Sources: FineWeb (English), FineWeb-2 (49 other languages)
	- Generation: GPT-4-based teacher model for quality glosses
	- Filtering: Proper noun filtering, quality scoring

	### Training Configuration
	- Base Model: Qwen/Qwen3-0.6B
	- Training Type: Full fine-tuning (not LoRA)
	- Precision: bfloat16
	- Batch Size: 32 per device
	- Gradient Accumulation: 8 steps
	- Total Steps: 6,568
	- Learning Rate: AdamW with cosine schedule
	- Hardware: NVIDIA H100 (95GB)
	- Training Time: ~6 hours

	### Training Results
	- Final Loss: 1.30
	- Eval Loss: 1.34
	- Training thoroughly validated - no zero loss issues

	## Model Architecture

	- Architecture: Qwen3ForCausalLM
	- Parameters: 596M
	- Layers: 28 transformer layers
	- Hidden Size: 1024
	- Attention Heads: 16 (8 KV heads)
	- Context Length: 40,960 tokens (model max, trained on 512)
	- Vocabulary: 151,936 tokens

	## Limitations

	- Context: Works best with clear, simple contexts
	- Proper nouns: May struggle with names, places, brands
	- Rare languages: Better performance on high-resource languages
	- Multi-word phrases: Best for 1-6 token phrases
	- Ambiguity: Provides common meanings, may miss context-specific nuances

	## Ethical Considerations

	- Bias: Trained on web data which may contain biases
	- Not for sensitive applications: Dictionary glosses may have errors
	- Educational use: Best for learning and reference, not authoritative translation

	## License

	Apache 2.0

	## Citation

	```bibtex
	@misc{ikhou-dict-xs,
	author = {Ikhou},
	title = {Ikhou Dictionary Model (dict-xs)},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/ikhou/dict-xs}}
	}
	```

	## Acknowledgments

	- Based on [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by Alibaba Cloud
	- Training data sourced from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
	- Trained with [Hugging Face Transformers](https://github.com/huggingface/transformers)

	## Contact

	For issues or questions, please open an issue on the [model repository](https://huggingface.co/ikhou/dict-xs/discussions).