dict-s / README.md

Upload dict-s

b76996a verified 22 days ago

5.48 kB

	# IkhouDict-s

	## Model Description

	IkhouDict-s is a small bilingual dictionary model fine-tuned from
	Qwen/Qwen3-1.7B for single-line gloss generation. Given a word or short phrase
	in context, the model returns 1 to 4 short translations or synonyms in a target
	language. The rubric enforces a single line, no quotes or labels, no trailing
	punctuation, and optional French grammatical hints when the target language is
	French.

	## Intended Use

	This model is intended for lexicography support, language learning tools, and
	high-level draft glossing. It is not a substitute for professional translation
	or domain-specific terminology work. Outputs should be reviewed by humans in
	high-stakes settings.

	## Training Data

	Training data are produced by the data generation pipeline in `training/` in
	this repository. The pipeline creates synthetic dictionary examples from web
	corpora, then filters and formats them for supervised fine-tuning (SFT).

	Pipeline summary:

	1. Extract sentences from multilingual web corpora (FineWeb-2 by default; optional
	FineWeb for English-only supplementation).
	2. Select a target word or phrase from each sentence (single token or short
	phrase up to 5 tokens; `phrase_ratio` controls the mix).
	3. Sample target languages, including cross-lingual targets. The default config
	uses 10 languages (`deu`, `eng`, `spa`, `fra`, `ita`, `jpn`, `kor`, `por`,
	`rus`, `cmn`) and generates multiple target languages per example.
	4. A teacher LLM (OpenAI-compatible endpoint) generates a short gloss under a
	strict rubric. Definitions are cleaned and validated.
	5. Examples below a quality threshold are dropped, then remaining examples are
	de-duplicated by (source_lang, target_lang, selection, context).
	6. Each example is written to SFT JSONL format with a system prompt, a user
	prompt, and a `<final>...</final>` assistant answer.

	The run used for this model produced:

	- Train: 1,521,749 examples
	- Eval: 15,366 examples
	- Test: 15,011 examples

	Splits are deterministic by grouping on provenance metadata to reduce leakage
	(see `sft/src/ikhou_sft/split.py`).

	## Training Procedure

	Fine-tuning was performed with the `ikhou_sft` pipeline in this repository:

	- Base model: Qwen/Qwen3-1.7B
	- Full fine-tuning (no LoRA)
	- Supervised fine-tuning using the chat template
	- Max sequence length: 512
	- Optimizer: Muon
	- 1 epoch with gradient accumulation

	See `sft/src/ikhou_sft/train.py` for implementation details.

	## How To Use

	The model expects a system prompt and a user prompt that mirror the data
	generation pipeline.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "ikhou/dict-s"
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

	system_prompt = (
	"You are a bilingual dictionary assistant.\n\n"
	"Your job: given a word/phrase in context, output a SHORT dictionary-style gloss line.\n\n"
	"Hard rules:\n"
	"- Output EXACTLY ONE LINE and nothing else.\n"
	"- No quotes, no bullets, no labels (no \"Definition:\", \"Meaning:\", etc).\n"
	"- Do NOT repeat the original word/phrase in the output.\n"
	"- Keep it short (ideally <= 120 characters).\n\n"
	"Gloss rules:\n"
	"- Output 1-4 translations/synonyms in the definition language, separated by \", \".\n"
	"- Each gloss should be short (1-3 words). Prefer common, user-friendly glosses.\n"
	"- Do NOT write full sentences. No trailing period.\n\n"
	"French grammar hints (only if confident):\n"
	"IMPORTANT: The French-only formatting hints below apply ONLY when the definition language is French (fr/fra).\n"
	"If the definition language is NOT French, do NOT use nm./nf./adj./adv., do NOT add French tense notes, and do NOT add (pp).\n"
	"- Noun: prefix with \"nm.\" (masc) or \"nf.\" (fem), then a space, then glosses.\n"
	" Example: nm. face\n"
	"- Adjective: prefix with \"adj.\", then a space, then glosses.\n"
	" Example: adj. fragile, delicate\n"
	"- Adverb: prefix with \"adv.\", then a space, then glosses.\n"
	" Example: adv. extremely, exceedingly\n"
	"- Conjugated verb form: glosses, then add \"(tense, subject)\" in French.\n"
	" Example: came back, used to come back (imparfait, il)\n"
	"- Past participle: glosses, then add \"(pp)\".\n"
	" Example: watched over, supervised (pp)\n"
	)

	user_prompt = (
	'Expression: "online"\n'
	"Context: He paid for the course online and started immediately.\n"
	"Source language: eng (English)\n"
	"Definition language: spa (Spanish)\n\n"
	"Return the single-line gloss now."
	)

	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": user_prompt},
	]

	inputs = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt",
	)

	with torch.no_grad():
	outputs = model.generate(
	inputs,
	max_new_tokens=64,
	do_sample=False,
	pad_token_id=tokenizer.eos_token_id,
	)

	decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(decoded)
	```

	## Limitations and Risks

	- Outputs can be inaccurate, overly general, or inconsistent with the rubric.
	- The model inherits biases from source corpora and the teacher model.
	- Rare languages or specialized terminology may be poorly handled.

	## Acknowledgements

	Base model: Qwen/Qwen3-1.7B.