dict-s / README.md
isemmanuelolowe's picture
Upload dict-s
b76996a verified
# IkhouDict-s
## Model Description
IkhouDict-s is a small bilingual dictionary model fine-tuned from
Qwen/Qwen3-1.7B for single-line gloss generation. Given a word or short phrase
in context, the model returns 1 to 4 short translations or synonyms in a target
language. The rubric enforces a single line, no quotes or labels, no trailing
punctuation, and optional French grammatical hints when the target language is
French.
## Intended Use
This model is intended for lexicography support, language learning tools, and
high-level draft glossing. It is not a substitute for professional translation
or domain-specific terminology work. Outputs should be reviewed by humans in
high-stakes settings.
## Training Data
Training data are produced by the data generation pipeline in `training/` in
this repository. The pipeline creates synthetic dictionary examples from web
corpora, then filters and formats them for supervised fine-tuning (SFT).
Pipeline summary:
1. Extract sentences from multilingual web corpora (FineWeb-2 by default; optional
FineWeb for English-only supplementation).
2. Select a target word or phrase from each sentence (single token or short
phrase up to 5 tokens; `phrase_ratio` controls the mix).
3. Sample target languages, including cross-lingual targets. The default config
uses 10 languages (`deu`, `eng`, `spa`, `fra`, `ita`, `jpn`, `kor`, `por`,
`rus`, `cmn`) and generates multiple target languages per example.
4. A teacher LLM (OpenAI-compatible endpoint) generates a short gloss under a
strict rubric. Definitions are cleaned and validated.
5. Examples below a quality threshold are dropped, then remaining examples are
de-duplicated by (source_lang, target_lang, selection, context).
6. Each example is written to SFT JSONL format with a system prompt, a user
prompt, and a `<final>...</final>` assistant answer.
The run used for this model produced:
- Train: 1,521,749 examples
- Eval: 15,366 examples
- Test: 15,011 examples
Splits are deterministic by grouping on provenance metadata to reduce leakage
(see `sft/src/ikhou_sft/split.py`).
## Training Procedure
Fine-tuning was performed with the `ikhou_sft` pipeline in this repository:
- Base model: Qwen/Qwen3-1.7B
- Full fine-tuning (no LoRA)
- Supervised fine-tuning using the chat template
- Max sequence length: 512
- Optimizer: Muon
- 1 epoch with gradient accumulation
See `sft/src/ikhou_sft/train.py` for implementation details.
## How To Use
The model expects a system prompt and a user prompt that mirror the data
generation pipeline.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ikhou/dict-s"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
system_prompt = (
"You are a bilingual dictionary assistant.\n\n"
"Your job: given a word/phrase in context, output a SHORT dictionary-style gloss line.\n\n"
"Hard rules:\n"
"- Output EXACTLY ONE LINE and nothing else.\n"
"- No quotes, no bullets, no labels (no \"Definition:\", \"Meaning:\", etc).\n"
"- Do NOT repeat the original word/phrase in the output.\n"
"- Keep it short (ideally <= 120 characters).\n\n"
"Gloss rules:\n"
"- Output 1-4 translations/synonyms in the definition language, separated by \", \".\n"
"- Each gloss should be short (1-3 words). Prefer common, user-friendly glosses.\n"
"- Do NOT write full sentences. No trailing period.\n\n"
"French grammar hints (only if confident):\n"
"IMPORTANT: The French-only formatting hints below apply ONLY when the definition language is French (fr/fra).\n"
"If the definition language is NOT French, do NOT use nm./nf./adj./adv., do NOT add French tense notes, and do NOT add (pp).\n"
"- Noun: prefix with \"nm.\" (masc) or \"nf.\" (fem), then a space, then glosses.\n"
" Example: nm. face\n"
"- Adjective: prefix with \"adj.\", then a space, then glosses.\n"
" Example: adj. fragile, delicate\n"
"- Adverb: prefix with \"adv.\", then a space, then glosses.\n"
" Example: adv. extremely, exceedingly\n"
"- Conjugated verb form: glosses, then add \"(tense, subject)\" in French.\n"
" Example: came back, used to come back (imparfait, il)\n"
"- Past participle: glosses, then add \"(pp)\".\n"
" Example: watched over, supervised (pp)\n"
)
user_prompt = (
'Expression: "online"\n'
"Context: He paid for the course online and started immediately.\n"
"Source language: eng (English)\n"
"Definition language: spa (Spanish)\n\n"
"Return the single-line gloss now."
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
)
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=64,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
```
## Limitations and Risks
- Outputs can be inaccurate, overly general, or inconsistent with the rubric.
- The model inherits biases from source corpora and the teacher model.
- Rare languages or specialized terminology may be poorly handled.
## Acknowledgements
Base model: Qwen/Qwen3-1.7B.