Typhoon 3.5B CPT (Distilled from typhoon-ai/typhoon-7b)
This model is a distilled and continued-pretrained version of typhoon-ai/typhoon-7b. It is a Thai Small Language Model (SLM) with ~3.5B parameters (Sub-4B), created by transferring knowledge from the larger 7B teacher model through Knowledge Distillation during Continued Pre-Training (CPT).
Model Description
| Property | Value |
|---|---|
| Base / Teacher Model | typhoon-ai/typhoon-7b (7B, 32 layers) |
| Architecture | LlamaForCausalLM (16 layers, even-layer pruned from teacher) |
| Parameters | ~3.5B |
| Language | Thai (th) |
| Training Method | Knowledge Distillation + Continued Pre-Training (CPT) |
| Context Length | 4,096 tokens |
| Precision | bfloat16 |
How It Was Created
This model was created in three stages:
1. Layer Pruning (Structural Initialization)
The 32-layer teacher model was reduced to 16 layers by extracting even-indexed layers (0, 2, 4, ..., 30). This retains the original vocabulary, token embeddings, and language modeling head while halving the parameter count.
2. Knowledge Distillation (Training Objective)
The student model was trained to mimic the teacher's behavior using a 3-part loss function:
Total Loss = 0.3 × L_CE + 0.5 × L_KD + 0.2 × L_Hidden
L_CE = Cross-Entropy Loss (next-token prediction on Thai corpus)
L_KD = KL-Divergence (student logits vs teacher logits, Temperature=2)
L_Hidden = MSE Loss (aligned student/teacher hidden states via linear projectors)
3. Training Data
Trained on a high-quality Thai corpus (~1M documents) sourced from:
- uonlp/CulturaX (th) — Deduplicated multilingual web corpus
- wikimedia/wikipedia (th) — Thai Wikipedia
Data was filtered using heuristics from the Typhoon paper (§3.1): Thai character ratio ≥ 40%, document length 200–100k characters, and mean line length 20–1,500 characters.
How to Use
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "Phonsiri/typhoon-3.5b-cpt-ckpt"
# Update 'subfolder' to the latest checkpoint step
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="step_0000275")
model = AutoModelForCausalLM.from_pretrained(
repo_id,
subfolder="step_0000275",
device_map="auto",
torch_dtype=torch.bfloat16,
)
prompt = "ประเทศไทย"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Infrastructure
- Hardware: 1x NVIDIA H100 80GB
- Training Mode: Continued Pre-Training with Auto-Resume state management
- Training Steps: Ongoing (CPT phase)
- Optimizer: AdamW with Cosine Warmup scheduler
Limitations
- This model is an intermediate CPT checkpoint and is still in the training phase.
- Performance on downstream reasoning tasks has not yet been formally benchmarked.
- Planned evaluation benchmarks: XNLI-th (natural language inference), ThaiExam (factual knowledge), and Perplexity on a held-out Thai evaluation set.
Citation
@misc{typhoon35b-cpt,
author = {Pornsiri Thabunsri},
title = {Typhoon 3.5B CPT: A Distilled Thai Small Language Model},
year = {2025},
url = {https://huggingface.co/Phonsiri/typhoon-3.5b-cpt-ckpt}
}
Developed by Pornsiri — Suranaree University of Technology
Model tree for Phonsiri/typhoon-3.5b-cpt-ckpt
Base model
typhoon-ai/typhoon-7b