# IkhouDict-s ## Model Description IkhouDict-s is a small bilingual dictionary model fine-tuned from Qwen/Qwen3-1.7B for single-line gloss generation. Given a word or short phrase in context, the model returns 1 to 4 short translations or synonyms in a target language. The rubric enforces a single line, no quotes or labels, no trailing punctuation, and optional French grammatical hints when the target language is French. ## Intended Use This model is intended for lexicography support, language learning tools, and high-level draft glossing. It is not a substitute for professional translation or domain-specific terminology work. Outputs should be reviewed by humans in high-stakes settings. ## Training Data Training data are produced by the data generation pipeline in `training/` in this repository. The pipeline creates synthetic dictionary examples from web corpora, then filters and formats them for supervised fine-tuning (SFT). Pipeline summary: 1. Extract sentences from multilingual web corpora (FineWeb-2 by default; optional FineWeb for English-only supplementation). 2. Select a target word or phrase from each sentence (single token or short phrase up to 5 tokens; `phrase_ratio` controls the mix). 3. Sample target languages, including cross-lingual targets. The default config uses 10 languages (`deu`, `eng`, `spa`, `fra`, `ita`, `jpn`, `kor`, `por`, `rus`, `cmn`) and generates multiple target languages per example. 4. A teacher LLM (OpenAI-compatible endpoint) generates a short gloss under a strict rubric. Definitions are cleaned and validated. 5. Examples below a quality threshold are dropped, then remaining examples are de-duplicated by (source_lang, target_lang, selection, context). 6. Each example is written to SFT JSONL format with a system prompt, a user prompt, and a `...` assistant answer. The run used for this model produced: - Train: 1,521,749 examples - Eval: 15,366 examples - Test: 15,011 examples Splits are deterministic by grouping on provenance metadata to reduce leakage (see `sft/src/ikhou_sft/split.py`). ## Training Procedure Fine-tuning was performed with the `ikhou_sft` pipeline in this repository: - Base model: Qwen/Qwen3-1.7B - Full fine-tuning (no LoRA) - Supervised fine-tuning using the chat template - Max sequence length: 512 - Optimizer: Muon - 1 epoch with gradient accumulation See `sft/src/ikhou_sft/train.py` for implementation details. ## How To Use The model expects a system prompt and a user prompt that mirror the data generation pipeline. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "ikhou/dict-s" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) system_prompt = ( "You are a bilingual dictionary assistant.\n\n" "Your job: given a word/phrase in context, output a SHORT dictionary-style gloss line.\n\n" "Hard rules:\n" "- Output EXACTLY ONE LINE and nothing else.\n" "- No quotes, no bullets, no labels (no \"Definition:\", \"Meaning:\", etc).\n" "- Do NOT repeat the original word/phrase in the output.\n" "- Keep it short (ideally <= 120 characters).\n\n" "Gloss rules:\n" "- Output 1-4 translations/synonyms in the definition language, separated by \", \".\n" "- Each gloss should be short (1-3 words). Prefer common, user-friendly glosses.\n" "- Do NOT write full sentences. No trailing period.\n\n" "French grammar hints (only if confident):\n" "IMPORTANT: The French-only formatting hints below apply ONLY when the definition language is French (fr/fra).\n" "If the definition language is NOT French, do NOT use nm./nf./adj./adv., do NOT add French tense notes, and do NOT add (pp).\n" "- Noun: prefix with \"nm.\" (masc) or \"nf.\" (fem), then a space, then glosses.\n" " Example: nm. face\n" "- Adjective: prefix with \"adj.\", then a space, then glosses.\n" " Example: adj. fragile, delicate\n" "- Adverb: prefix with \"adv.\", then a space, then glosses.\n" " Example: adv. extremely, exceedingly\n" "- Conjugated verb form: glosses, then add \"(tense, subject)\" in French.\n" " Example: came back, used to come back (imparfait, il)\n" "- Past participle: glosses, then add \"(pp)\".\n" " Example: watched over, supervised (pp)\n" ) user_prompt = ( 'Expression: "online"\n' "Context: He paid for the course online and started immediately.\n" "Source language: eng (English)\n" "Definition language: spa (Spanish)\n\n" "Return the single-line gloss now." ) messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", ) with torch.no_grad(): outputs = model.generate( inputs, max_new_tokens=64, do_sample=False, pad_token_id=tokenizer.eos_token_id, ) decoded = tokenizer.decode(outputs[0], skip_special_tokens=True) print(decoded) ``` ## Limitations and Risks - Outputs can be inaccurate, overly general, or inconsistent with the rubric. - The model inherits biases from source corpora and the teacher model. - Rare languages or specialized terminology may be poorly handled. ## Acknowledgements Base model: Qwen/Qwen3-1.7B.