phi3-khmer-distilled
Model Summary
phi3-khmer-distilled is a Khmer-capable language model obtained by instruction distillation and supervised fine-tuning (SFT) of Microsoft’s Phi-3 Mini model on a curated English–Khmer distillation dataset.
The model is designed to improve:
- Khmer language generation
- English → Khmer translation
- Instruction following in Khmer
- Question answering and reasoning in Khmer
Base Model
- Base model:
microsoft/phi-3-mini - Architecture: Decoder-only transformer
- Model type: Causal Language Model (CLM)
Training Dataset
The model was fine-tuned using:
- Dataset:
nphearum/khmer-distillation-2k26 - Size: ~2,000 examples
- Languages: English (
en), Khmer (km) - Data characteristics:
- Multiple English paraphrases per example
- Multiple Khmer reference outputs
- Instruction-style structured responses
- Translation and reasoning tasks
Dataset fields (training view)
| Field | Description |
|---|---|
english |
One or more English inputs (flattened for training) |
khmer |
Khmer outputs (flattened from structured references) |
label |
Task direction (e.g. en_to_km) |
purpose |
Task type (e.g. translate, instruction) |
Training Procedure
- Training type: Supervised fine-tuning (SFT)
- Objective: Next-token prediction
- Prompting style: Instruction / input → response
- Tokenizer: Base Phi-3 tokenizer
- Data preprocessing:
- English variants concatenated
- Khmer structured outputs serialized to text
- No synthetic chain-of-thought added
⚠️ Exact hyperparameters (epochs, LR, batch size) are not disclosed and may vary between runs.
Intended Use
✅ Supported Use Cases
- English → Khmer translation
- Khmer instruction following
- Khmer question answering
- Educational and research use
- Low-resource language experimentation
❌ Out-of-Scope Uses
- Safety-critical or medical applications
- Legal or financial advice
- Fully production-grade translation systems
Limitations
- Trained on a small, synthetic dataset
- Limited coverage of real-world Khmer domains
- May hallucinate facts
- Performance depends heavily on prompt quality
This model should be viewed as a research and experimentation model, not a final production system.
Ethical Considerations
- Dataset contains synthetic and translated content
- No personal or sensitive data intentionally included
- Biases present in the base model may persist
- Outputs should be reviewed before real-world use
License
- Model license: MIT
- Base model license: As provided by Microsoft for Phi-3
Please ensure compliance with the Phi-3 base model license when using or redistributing this model.
Citation
If you use this model, please cite:
@model{nphearum_phi3_khmer_distilled,
author = {Phearum Nop},
title = {phi3-khmer-distilled},
year = {2026},
base_model = {microsoft/phi-3-mini},
dataset = {nphearum/khmer-distillation-2k26},
url = {https://huggingface.co/nphearum/phi3-khmer-distilled}
}
Contact
Created by Phearum Nop For questions or feedback, please open an issue on the Hugging Face model repository.
- Downloads last month
- -
Model tree for nphearum/phi3-khmer-distilled
Base model
microsoft/Phi-3-mini-4k-instruct