| # IkhouDict-s | |
| ## Model Description | |
| IkhouDict-s is a small bilingual dictionary model fine-tuned from | |
| Qwen/Qwen3-1.7B for single-line gloss generation. Given a word or short phrase | |
| in context, the model returns 1 to 4 short translations or synonyms in a target | |
| language. The rubric enforces a single line, no quotes or labels, no trailing | |
| punctuation, and optional French grammatical hints when the target language is | |
| French. | |
| ## Intended Use | |
| This model is intended for lexicography support, language learning tools, and | |
| high-level draft glossing. It is not a substitute for professional translation | |
| or domain-specific terminology work. Outputs should be reviewed by humans in | |
| high-stakes settings. | |
| ## Training Data | |
| Training data are produced by the data generation pipeline in `training/` in | |
| this repository. The pipeline creates synthetic dictionary examples from web | |
| corpora, then filters and formats them for supervised fine-tuning (SFT). | |
| Pipeline summary: | |
| 1. Extract sentences from multilingual web corpora (FineWeb-2 by default; optional | |
| FineWeb for English-only supplementation). | |
| 2. Select a target word or phrase from each sentence (single token or short | |
| phrase up to 5 tokens; `phrase_ratio` controls the mix). | |
| 3. Sample target languages, including cross-lingual targets. The default config | |
| uses 10 languages (`deu`, `eng`, `spa`, `fra`, `ita`, `jpn`, `kor`, `por`, | |
| `rus`, `cmn`) and generates multiple target languages per example. | |
| 4. A teacher LLM (OpenAI-compatible endpoint) generates a short gloss under a | |
| strict rubric. Definitions are cleaned and validated. | |
| 5. Examples below a quality threshold are dropped, then remaining examples are | |
| de-duplicated by (source_lang, target_lang, selection, context). | |
| 6. Each example is written to SFT JSONL format with a system prompt, a user | |
| prompt, and a `<final>...</final>` assistant answer. | |
| The run used for this model produced: | |
| - Train: 1,521,749 examples | |
| - Eval: 15,366 examples | |
| - Test: 15,011 examples | |
| Splits are deterministic by grouping on provenance metadata to reduce leakage | |
| (see `sft/src/ikhou_sft/split.py`). | |
| ## Training Procedure | |
| Fine-tuning was performed with the `ikhou_sft` pipeline in this repository: | |
| - Base model: Qwen/Qwen3-1.7B | |
| - Full fine-tuning (no LoRA) | |
| - Supervised fine-tuning using the chat template | |
| - Max sequence length: 512 | |
| - Optimizer: Muon | |
| - 1 epoch with gradient accumulation | |
| See `sft/src/ikhou_sft/train.py` for implementation details. | |
| ## How To Use | |
| The model expects a system prompt and a user prompt that mirror the data | |
| generation pipeline. | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model_id = "ikhou/dict-s" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) | |
| system_prompt = ( | |
| "You are a bilingual dictionary assistant.\n\n" | |
| "Your job: given a word/phrase in context, output a SHORT dictionary-style gloss line.\n\n" | |
| "Hard rules:\n" | |
| "- Output EXACTLY ONE LINE and nothing else.\n" | |
| "- No quotes, no bullets, no labels (no \"Definition:\", \"Meaning:\", etc).\n" | |
| "- Do NOT repeat the original word/phrase in the output.\n" | |
| "- Keep it short (ideally <= 120 characters).\n\n" | |
| "Gloss rules:\n" | |
| "- Output 1-4 translations/synonyms in the definition language, separated by \", \".\n" | |
| "- Each gloss should be short (1-3 words). Prefer common, user-friendly glosses.\n" | |
| "- Do NOT write full sentences. No trailing period.\n\n" | |
| "French grammar hints (only if confident):\n" | |
| "IMPORTANT: The French-only formatting hints below apply ONLY when the definition language is French (fr/fra).\n" | |
| "If the definition language is NOT French, do NOT use nm./nf./adj./adv., do NOT add French tense notes, and do NOT add (pp).\n" | |
| "- Noun: prefix with \"nm.\" (masc) or \"nf.\" (fem), then a space, then glosses.\n" | |
| " Example: nm. face\n" | |
| "- Adjective: prefix with \"adj.\", then a space, then glosses.\n" | |
| " Example: adj. fragile, delicate\n" | |
| "- Adverb: prefix with \"adv.\", then a space, then glosses.\n" | |
| " Example: adv. extremely, exceedingly\n" | |
| "- Conjugated verb form: glosses, then add \"(tense, subject)\" in French.\n" | |
| " Example: came back, used to come back (imparfait, il)\n" | |
| "- Past participle: glosses, then add \"(pp)\".\n" | |
| " Example: watched over, supervised (pp)\n" | |
| ) | |
| user_prompt = ( | |
| 'Expression: "online"\n' | |
| "Context: He paid for the course online and started immediately.\n" | |
| "Source language: eng (English)\n" | |
| "Definition language: spa (Spanish)\n\n" | |
| "Return the single-line gloss now." | |
| ) | |
| messages = [ | |
| {"role": "system", "content": system_prompt}, | |
| {"role": "user", "content": user_prompt}, | |
| ] | |
| inputs = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=True, | |
| add_generation_prompt=True, | |
| return_tensors="pt", | |
| ) | |
| with torch.no_grad(): | |
| outputs = model.generate( | |
| inputs, | |
| max_new_tokens=64, | |
| do_sample=False, | |
| pad_token_id=tokenizer.eos_token_id, | |
| ) | |
| decoded = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(decoded) | |
| ``` | |
| ## Limitations and Risks | |
| - Outputs can be inaccurate, overly general, or inconsistent with the rubric. | |
| - The model inherits biases from source corpora and the teacher model. | |
| - Rare languages or specialized terminology may be poorly handled. | |
| ## Acknowledgements | |
| Base model: Qwen/Qwen3-1.7B. | |