phi3-khmer-distilled

Model Summary

phi3-khmer-distilled is a Khmer-capable language model obtained by instruction distillation and supervised fine-tuning (SFT) of Microsoft’s Phi-3 Mini model on a curated English–Khmer distillation dataset.

The model is designed to improve:

Khmer language generation
English → Khmer translation
Instruction following in Khmer
Question answering and reasoning in Khmer

Base Model

Base model: microsoft/phi-3-mini
Architecture: Decoder-only transformer
Model type: Causal Language Model (CLM)

Training Dataset

The model was fine-tuned using:

Dataset: nphearum/khmer-distillation-2k26
Size: ~2,000 examples
Languages: English (en), Khmer (km)
Data characteristics:
- Multiple English paraphrases per example
- Multiple Khmer reference outputs
- Instruction-style structured responses
- Translation and reasoning tasks

Dataset fields (training view)

Field	Description
`english`	One or more English inputs (flattened for training)
`khmer`	Khmer outputs (flattened from structured references)
`label`	Task direction (e.g. `en_to_km`)
`purpose`	Task type (e.g. `translate`, `instruction`)

Training Procedure

Training type: Supervised fine-tuning (SFT)
Objective: Next-token prediction
Prompting style: Instruction / input → response
Tokenizer: Base Phi-3 tokenizer
Data preprocessing:
- English variants concatenated
- Khmer structured outputs serialized to text
- No synthetic chain-of-thought added

⚠️ Exact hyperparameters (epochs, LR, batch size) are not disclosed and may vary between runs.

Intended Use

✅ Supported Use Cases

English → Khmer translation
Khmer instruction following
Khmer question answering
Educational and research use
Low-resource language experimentation

❌ Out-of-Scope Uses

Safety-critical or medical applications
Legal or financial advice
Fully production-grade translation systems

Limitations

Trained on a small, synthetic dataset
Limited coverage of real-world Khmer domains
May hallucinate facts
Performance depends heavily on prompt quality

This model should be viewed as a research and experimentation model, not a final production system.

Ethical Considerations

Dataset contains synthetic and translated content
No personal or sensitive data intentionally included
Biases present in the base model may persist
Outputs should be reviewed before real-world use

License

Model license: MIT
Base model license: As provided by Microsoft for Phi-3

Please ensure compliance with the Phi-3 base model license when using or redistributing this model.

Citation

If you use this model, please cite:

@model{nphearum_phi3_khmer_distilled,
  author = {Phearum Nop},
  title = {phi3-khmer-distilled},
  year = {2026},
  base_model = {microsoft/phi-3-mini},
  dataset = {nphearum/khmer-distillation-2k26},
  url = {https://huggingface.co/nphearum/phi3-khmer-distilled}
}

Contact

Created by Phearum Nop For questions or feedback, please open an issue on the Hugging Face model repository.

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for nphearum/phi3-khmer-distilled

Base model

microsoft/Phi-3-mini-4k-instruct

Adapter

(826)

this model

nphearum
/

phi3-khmer-distilled