phi3-khmer-distilled

Model Summary

phi3-khmer-distilled is a Khmer-capable language model obtained by instruction distillation and supervised fine-tuning (SFT) of Microsoft’s Phi-3 Mini model on a curated English–Khmer distillation dataset.

The model is designed to improve:

  • Khmer language generation
  • English → Khmer translation
  • Instruction following in Khmer
  • Question answering and reasoning in Khmer

Base Model

  • Base model: microsoft/phi-3-mini
  • Architecture: Decoder-only transformer
  • Model type: Causal Language Model (CLM)

Training Dataset

The model was fine-tuned using:

  • Dataset: nphearum/khmer-distillation-2k26
  • Size: ~2,000 examples
  • Languages: English (en), Khmer (km)
  • Data characteristics:
    • Multiple English paraphrases per example
    • Multiple Khmer reference outputs
    • Instruction-style structured responses
    • Translation and reasoning tasks

Dataset fields (training view)

Field Description
english One or more English inputs (flattened for training)
khmer Khmer outputs (flattened from structured references)
label Task direction (e.g. en_to_km)
purpose Task type (e.g. translate, instruction)

Training Procedure

  • Training type: Supervised fine-tuning (SFT)
  • Objective: Next-token prediction
  • Prompting style: Instruction / input → response
  • Tokenizer: Base Phi-3 tokenizer
  • Data preprocessing:
    • English variants concatenated
    • Khmer structured outputs serialized to text
    • No synthetic chain-of-thought added

⚠️ Exact hyperparameters (epochs, LR, batch size) are not disclosed and may vary between runs.


Intended Use

✅ Supported Use Cases

  • English → Khmer translation
  • Khmer instruction following
  • Khmer question answering
  • Educational and research use
  • Low-resource language experimentation

❌ Out-of-Scope Uses

  • Safety-critical or medical applications
  • Legal or financial advice
  • Fully production-grade translation systems

Limitations

  • Trained on a small, synthetic dataset
  • Limited coverage of real-world Khmer domains
  • May hallucinate facts
  • Performance depends heavily on prompt quality

This model should be viewed as a research and experimentation model, not a final production system.


Ethical Considerations

  • Dataset contains synthetic and translated content
  • No personal or sensitive data intentionally included
  • Biases present in the base model may persist
  • Outputs should be reviewed before real-world use

License

  • Model license: MIT
  • Base model license: As provided by Microsoft for Phi-3

Please ensure compliance with the Phi-3 base model license when using or redistributing this model.


Citation

If you use this model, please cite:

@model{nphearum_phi3_khmer_distilled,
  author = {Phearum Nop},
  title = {phi3-khmer-distilled},
  year = {2026},
  base_model = {microsoft/phi-3-mini},
  dataset = {nphearum/khmer-distillation-2k26},
  url = {https://huggingface.co/nphearum/phi3-khmer-distilled}
}

Contact

Created by Phearum Nop For questions or feedback, please open an issue on the Hugging Face model repository.

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nphearum/phi3-khmer-distilled

Adapter
(815)
this model

Dataset used to train nphearum/phi3-khmer-distilled