Typhoon 3.5B CPT (Distilled from typhoon-ai/typhoon-7b)

This model is a distilled and continued-pretrained version of typhoon-ai/typhoon-7b. It is a Thai Small Language Model (SLM) with ~3.5B parameters (Sub-4B), created by transferring knowledge from the larger 7B teacher model through Knowledge Distillation during Continued Pre-Training (CPT).

Model Description

Property Value
Base / Teacher Model typhoon-ai/typhoon-7b (7B, 32 layers)
Architecture LlamaForCausalLM (16 layers, even-layer pruned from teacher)
Parameters ~3.5B
Language Thai (th)
Training Method Knowledge Distillation + Continued Pre-Training (CPT)
Context Length 4,096 tokens
Precision bfloat16

How It Was Created

This model was created in three stages:

1. Layer Pruning (Structural Initialization)

The 32-layer teacher model was reduced to 16 layers by extracting even-indexed layers (0, 2, 4, ..., 30). This retains the original vocabulary, token embeddings, and language modeling head while halving the parameter count.

2. Knowledge Distillation (Training Objective)

The student model was trained to mimic the teacher's behavior using a 3-part loss function:

Total Loss = 0.3 × L_CE + 0.5 × L_KD + 0.2 × L_Hidden

L_CE     = Cross-Entropy Loss (next-token prediction on Thai corpus)
L_KD     = KL-Divergence (student logits vs teacher logits, Temperature=2)
L_Hidden = MSE Loss (aligned student/teacher hidden states via linear projectors)

3. Training Data

Trained on a high-quality Thai corpus (~1M documents) sourced from:

  • uonlp/CulturaX (th) — Deduplicated multilingual web corpus
  • wikimedia/wikipedia (th) — Thai Wikipedia

Data was filtered using heuristics from the Typhoon paper (§3.1): Thai character ratio ≥ 40%, document length 200–100k characters, and mean line length 20–1,500 characters.

How to Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "Phonsiri/typhoon-3.5b-cpt-ckpt"
# Update 'subfolder' to the latest checkpoint step
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="step_0000275")

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    subfolder="step_0000275",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

prompt = "ประเทศไทย"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Infrastructure

  • Hardware: 1x NVIDIA H100 80GB
  • Training Mode: Continued Pre-Training with Auto-Resume state management
  • Training Steps: Ongoing (CPT phase)
  • Optimizer: AdamW with Cosine Warmup scheduler

Limitations

  • This model is an intermediate CPT checkpoint and is still in the training phase.
  • Performance on downstream reasoning tasks has not yet been formally benchmarked.
  • Planned evaluation benchmarks: XNLI-th (natural language inference), ThaiExam (factual knowledge), and Perplexity on a held-out Thai evaluation set.

Citation

@misc{typhoon35b-cpt,
  author = {Pornsiri Thabunsri},
  title  = {Typhoon 3.5B CPT: A Distilled Thai Small Language Model},
  year   = {2025},
  url    = {https://huggingface.co/Phonsiri/typhoon-3.5b-cpt-ckpt}
}

Developed by Pornsiri — Suranaree University of Technology

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Phonsiri/typhoon-3.5b-cpt-ckpt

Finetuned
(4)
this model