Typhoon 3.5B CPT (Distilled from typhoon-ai/typhoon-7b)

This model is a distilled and continued-pretrained version of typhoon-ai/typhoon-7b. It is a Thai Small Language Model (SLM) with ~3.5B parameters (Sub-4B), created by transferring knowledge from the larger 7B teacher model through Knowledge Distillation during Continued Pre-Training (CPT).

Model Description

Property	Value
Base / Teacher Model	typhoon-ai/typhoon-7b (7B, 32 layers)
Architecture	LlamaForCausalLM (16 layers, even-layer pruned from teacher)
Parameters	~3.5B
Language	Thai (th)
Training Method	Knowledge Distillation + Continued Pre-Training (CPT)
Context Length	4,096 tokens
Precision	bfloat16

How It Was Created

This model was created in three stages:

1. Layer Pruning (Structural Initialization)

The 32-layer teacher model was reduced to 16 layers by extracting even-indexed layers (0, 2, 4, ..., 30). This retains the original vocabulary, token embeddings, and language modeling head while halving the parameter count.

2. Knowledge Distillation (Training Objective)

The student model was trained to mimic the teacher's behavior using a 3-part loss function:

Total Loss = 0.3 × L_CE + 0.5 × L_KD + 0.2 × L_Hidden

L_CE     = Cross-Entropy Loss (next-token prediction on Thai corpus)
L_KD     = KL-Divergence (student logits vs teacher logits, Temperature=2)
L_Hidden = MSE Loss (aligned student/teacher hidden states via linear projectors)

3. Training Data

Trained on a high-quality Thai corpus (~1M documents) sourced from:

uonlp/CulturaX (th) — Deduplicated multilingual web corpus
wikimedia/wikipedia (th) — Thai Wikipedia

Data was filtered using heuristics from the Typhoon paper (§3.1): Thai character ratio ≥ 40%, document length 200–100k characters, and mean line length 20–1,500 characters.

How to Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "Phonsiri/typhoon-3.5b-cpt-ckpt"
# Update 'subfolder' to the latest checkpoint step
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="step_0000275")

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    subfolder="step_0000275",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

prompt = "ประเทศไทย"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Infrastructure

Hardware: 1x NVIDIA H100 80GB
Training Mode: Continued Pre-Training with Auto-Resume state management
Training Steps: Ongoing (CPT phase)
Optimizer: AdamW with Cosine Warmup scheduler

Limitations

This model is an intermediate CPT checkpoint and is still in the training phase.
Performance on downstream reasoning tasks has not yet been formally benchmarked.
Planned evaluation benchmarks: XNLI-th (natural language inference), ThaiExam (factual knowledge), and Perplexity on a held-out Thai evaluation set.

Citation

@misc{typhoon35b-cpt,
  author = {Pornsiri Thabunsri},
  title  = {Typhoon 3.5B CPT: A Distilled Thai Small Language Model},
  year   = {2025},
  url    = {https://huggingface.co/Phonsiri/typhoon-3.5b-cpt-ckpt}
}

Developed by Pornsiri — Suranaree University of Technology

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Phonsiri/typhoon-3.5b-cpt-ckpt

Base model

typhoon-ai/typhoon-7b

Finetuned

(4)

this model