dict-s

File size: 5,476 Bytes

b76996a

# IkhouDict-s

## Model Description

IkhouDict-s is a small bilingual dictionary model fine-tuned from
Qwen/Qwen3-1.7B for single-line gloss generation. Given a word or short phrase
in context, the model returns 1 to 4 short translations or synonyms in a target
language. The rubric enforces a single line, no quotes or labels, no trailing
punctuation, and optional French grammatical hints when the target language is
French.

## Intended Use

This model is intended for lexicography support, language learning tools, and
high-level draft glossing. It is not a substitute for professional translation
or domain-specific terminology work. Outputs should be reviewed by humans in
high-stakes settings.

## Training Data

Training data are produced by the data generation pipeline in `training/` in
this repository. The pipeline creates synthetic dictionary examples from web
corpora, then filters and formats them for supervised fine-tuning (SFT).

Pipeline summary:

1. Extract sentences from multilingual web corpora (FineWeb-2 by default; optional
   FineWeb for English-only supplementation).
2. Select a target word or phrase from each sentence (single token or short
   phrase up to 5 tokens; `phrase_ratio` controls the mix).
3. Sample target languages, including cross-lingual targets. The default config
   uses 10 languages (`deu`, `eng`, `spa`, `fra`, `ita`, `jpn`, `kor`, `por`,
   `rus`, `cmn`) and generates multiple target languages per example.
4. A teacher LLM (OpenAI-compatible endpoint) generates a short gloss under a
   strict rubric. Definitions are cleaned and validated.
5. Examples below a quality threshold are dropped, then remaining examples are
   de-duplicated by (source_lang, target_lang, selection, context).
6. Each example is written to SFT JSONL format with a system prompt, a user
   prompt, and a `<final>...</final>` assistant answer.

The run used for this model produced:

- Train: 1,521,749 examples
- Eval: 15,366 examples
- Test: 15,011 examples

Splits are deterministic by grouping on provenance metadata to reduce leakage
(see `sft/src/ikhou_sft/split.py`).

## Training Procedure

Fine-tuning was performed with the `ikhou_sft` pipeline in this repository:

- Base model: Qwen/Qwen3-1.7B
- Full fine-tuning (no LoRA)
- Supervised fine-tuning using the chat template
- Max sequence length: 512
- Optimizer: Muon
- 1 epoch with gradient accumulation

See `sft/src/ikhou_sft/train.py` for implementation details.

## How To Use

The model expects a system prompt and a user prompt that mirror the data
generation pipeline.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ikhou/dict-s"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

system_prompt = (
    "You are a bilingual dictionary assistant.\n\n"
    "Your job: given a word/phrase in context, output a SHORT dictionary-style gloss line.\n\n"
    "Hard rules:\n"
    "- Output EXACTLY ONE LINE and nothing else.\n"
    "- No quotes, no bullets, no labels (no \"Definition:\", \"Meaning:\", etc).\n"
    "- Do NOT repeat the original word/phrase in the output.\n"
    "- Keep it short (ideally <= 120 characters).\n\n"
    "Gloss rules:\n"
    "- Output 1-4 translations/synonyms in the definition language, separated by \", \".\n"
    "- Each gloss should be short (1-3 words). Prefer common, user-friendly glosses.\n"
    "- Do NOT write full sentences. No trailing period.\n\n"
    "French grammar hints (only if confident):\n"
    "IMPORTANT: The French-only formatting hints below apply ONLY when the definition language is French (fr/fra).\n"
    "If the definition language is NOT French, do NOT use nm./nf./adj./adv., do NOT add French tense notes, and do NOT add (pp).\n"
    "- Noun: prefix with \"nm.\" (masc) or \"nf.\" (fem), then a space, then glosses.\n"
    "  Example: nm. face\n"
    "- Adjective: prefix with \"adj.\", then a space, then glosses.\n"
    "  Example: adj. fragile, delicate\n"
    "- Adverb: prefix with \"adv.\", then a space, then glosses.\n"
    "  Example: adv. extremely, exceedingly\n"
    "- Conjugated verb form: glosses, then add \"(tense, subject)\" in French.\n"
    "  Example: came back, used to come back (imparfait, il)\n"
    "- Past participle: glosses, then add \"(pp)\".\n"
    "  Example: watched over, supervised (pp)\n"
)

user_prompt = (
    'Expression: "online"\n'
    "Context: He paid for the course online and started immediately.\n"
    "Source language: eng (English)\n"
    "Definition language: spa (Spanish)\n\n"
    "Return the single-line gloss now."
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=64,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
```

## Limitations and Risks

- Outputs can be inaccurate, overly general, or inconsistent with the rubric.
- The model inherits biases from source corpora and the teacher model.
- Rare languages or specialized terminology may be poorly handled.

## Acknowledgements

Base model: Qwen/Qwen3-1.7B.