File size: 5,476 Bytes
b76996a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# IkhouDict-s

## Model Description

IkhouDict-s is a small bilingual dictionary model fine-tuned from
Qwen/Qwen3-1.7B for single-line gloss generation. Given a word or short phrase
in context, the model returns 1 to 4 short translations or synonyms in a target
language. The rubric enforces a single line, no quotes or labels, no trailing
punctuation, and optional French grammatical hints when the target language is
French.

## Intended Use

This model is intended for lexicography support, language learning tools, and
high-level draft glossing. It is not a substitute for professional translation
or domain-specific terminology work. Outputs should be reviewed by humans in
high-stakes settings.

## Training Data

Training data are produced by the data generation pipeline in `training/` in
this repository. The pipeline creates synthetic dictionary examples from web
corpora, then filters and formats them for supervised fine-tuning (SFT).

Pipeline summary:

1. Extract sentences from multilingual web corpora (FineWeb-2 by default; optional
   FineWeb for English-only supplementation).
2. Select a target word or phrase from each sentence (single token or short
   phrase up to 5 tokens; `phrase_ratio` controls the mix).
3. Sample target languages, including cross-lingual targets. The default config
   uses 10 languages (`deu`, `eng`, `spa`, `fra`, `ita`, `jpn`, `kor`, `por`,
   `rus`, `cmn`) and generates multiple target languages per example.
4. A teacher LLM (OpenAI-compatible endpoint) generates a short gloss under a
   strict rubric. Definitions are cleaned and validated.
5. Examples below a quality threshold are dropped, then remaining examples are
   de-duplicated by (source_lang, target_lang, selection, context).
6. Each example is written to SFT JSONL format with a system prompt, a user
   prompt, and a `<final>...</final>` assistant answer.

The run used for this model produced:

- Train: 1,521,749 examples
- Eval: 15,366 examples
- Test: 15,011 examples

Splits are deterministic by grouping on provenance metadata to reduce leakage
(see `sft/src/ikhou_sft/split.py`).

## Training Procedure

Fine-tuning was performed with the `ikhou_sft` pipeline in this repository:

- Base model: Qwen/Qwen3-1.7B
- Full fine-tuning (no LoRA)
- Supervised fine-tuning using the chat template
- Max sequence length: 512
- Optimizer: Muon
- 1 epoch with gradient accumulation

See `sft/src/ikhou_sft/train.py` for implementation details.

## How To Use

The model expects a system prompt and a user prompt that mirror the data
generation pipeline.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ikhou/dict-s"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

system_prompt = (
    "You are a bilingual dictionary assistant.\n\n"
    "Your job: given a word/phrase in context, output a SHORT dictionary-style gloss line.\n\n"
    "Hard rules:\n"
    "- Output EXACTLY ONE LINE and nothing else.\n"
    "- No quotes, no bullets, no labels (no \"Definition:\", \"Meaning:\", etc).\n"
    "- Do NOT repeat the original word/phrase in the output.\n"
    "- Keep it short (ideally <= 120 characters).\n\n"
    "Gloss rules:\n"
    "- Output 1-4 translations/synonyms in the definition language, separated by \", \".\n"
    "- Each gloss should be short (1-3 words). Prefer common, user-friendly glosses.\n"
    "- Do NOT write full sentences. No trailing period.\n\n"
    "French grammar hints (only if confident):\n"
    "IMPORTANT: The French-only formatting hints below apply ONLY when the definition language is French (fr/fra).\n"
    "If the definition language is NOT French, do NOT use nm./nf./adj./adv., do NOT add French tense notes, and do NOT add (pp).\n"
    "- Noun: prefix with \"nm.\" (masc) or \"nf.\" (fem), then a space, then glosses.\n"
    "  Example: nm. face\n"
    "- Adjective: prefix with \"adj.\", then a space, then glosses.\n"
    "  Example: adj. fragile, delicate\n"
    "- Adverb: prefix with \"adv.\", then a space, then glosses.\n"
    "  Example: adv. extremely, exceedingly\n"
    "- Conjugated verb form: glosses, then add \"(tense, subject)\" in French.\n"
    "  Example: came back, used to come back (imparfait, il)\n"
    "- Past participle: glosses, then add \"(pp)\".\n"
    "  Example: watched over, supervised (pp)\n"
)

user_prompt = (
    'Expression: "online"\n'
    "Context: He paid for the course online and started immediately.\n"
    "Source language: eng (English)\n"
    "Definition language: spa (Spanish)\n\n"
    "Return the single-line gloss now."
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=64,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
```

## Limitations and Risks

- Outputs can be inaccurate, overly general, or inconsistent with the rubric.
- The model inherits biases from source corpora and the teacher model.
- Rare languages or specialized terminology may be poorly handled.

## Acknowledgements

Base model: Qwen/Qwen3-1.7B.