Chinese Classical GPT-2
A 335M parameter GPT-2 model trained from scratch for style-conditioned classical Chinese text generation. The model can generate text in 5 distinct literary styles spanning over 2000 years of Chinese literature.
Model Description
This model generates classical Chinese text conditioned on a style persona, each representing a major literary genre:
| Persona | Genre | Era | Example |
|---|---|---|---|
| 李白 (Li Bai) | Poetry (诗) | Tang Dynasty | Five/seven-character verse |
| 苏轼 (Su Shi) | Ci Poetry (词) | Song Dynasty | Lyric poetry |
| 蒲松龄 (Pu Songling) | Fiction (小说) | Qing Dynasty | Supernatural tales |
| 韩愈 (Han Yu) | Prose (散文) | Tang Dynasty | Argumentative essays |
| 司马迁 (Sima Qian) | History (史传) | Han Dynasty | Biographical records |
Architecture
| Parameter | Value |
|---|---|
| Architecture | GPT-2 (Decoder-only Transformer) |
| Parameters | 335,609,856 |
| Layers | 24 |
| Attention Heads | 16 |
| Hidden Dimension | 1024 |
| Max Sequence Length | 512 tokens |
| Vocabulary | 32,000 (SentencePiece BPE) |
| Style Conditioning | Learnable style embedding added to token representations |
Training
Two-Stage Curriculum Learning
Stage 1 — General Chinese Pretraining
- Data: 1.68M samples (classical + modern Chinese, ~860M tokens)
- Epochs: 1 | Steps: 35,109 | LR: 3e-4 (cosine decay)
- Result: Loss 10.36 → 4.0, Accuracy 2.5% → 33.5%
Stage 2 — Classical Chinese Specialization
- Data: 1.60M samples (classical Chinese only, with style labels)
- Epochs: 2 | Steps: 66,522 | LR: 1e-4 (cosine decay)
- Result: Loss 4.0 → 3.85, Accuracy 33.5% → 35.3%
Training Infrastructure
- Hardware: NVIDIA A100 (Google Colab)
- Precision: Mixed (FP16)
- Optimizer: AdamW (beta1=0.9, beta2=0.95)
- Total training time: ~14 hours
Evaluation
Perplexity
| Model | Perplexity | Improvement |
|---|---|---|
| Stage 1 (pretrain only) | 52.69 | — |
| Stage 2 (+ classical finetune) | 42.43 | -19.5% |
Evaluated on 2,000 held-out classical Chinese samples.
Style Distinguishability
A linear SVM classifier trained on 14 text-level features achieves 45.5% accuracy in classifying generated text by style (random baseline: 20.0%, 5 classes).
| Style | Precision | Recall | F1-Score |
|---|---|---|---|
| 李白 (Poetry) | 0.41 | 0.55 | 0.47 |
| 苏轼 (Ci) | 0.52 | 0.49 | 0.50 |
| 蒲松龄 (Fiction) | 0.44 | 0.46 | 0.45 |
| 韩愈 (Prose) | 0.43 | 0.41 | 0.42 |
| 司马迁 (History) | 0.52 | 0.36 | 0.43 |
Statistical Analysis (150 samples)
| Style | Avg Sentence Len | Poetry Line % | Narrative % | Historical % |
|---|---|---|---|---|
| 李白 | 20.0 | 7.7% | 0.83% | 0.68% |
| 苏轼 | 21.2 | 7.2% | 1.33% | 0.55% |
| 蒲松龄 | 25.7 | 2.7% | 1.82% | 0.80% |
| 韩愈 | 19.4 | 13.9% | 1.26% | 0.78% |
| 司马迁 | 24.7 | 4.7% | 1.61% | 1.63% |
Key findings:
- Sentence length separates poetic styles (19-21) from narrative styles (24-26)
- Historical vocabulary concentrates in Sima Qian (1.63% vs <0.80% for others)
- Narrative markers peak in Pu Songling (1.82% vs 0.83% for Li Bai)
Usage
import torch
import sentencepiece as spm
from model import GPT2
from config import ProjectConfig, STYLE_ID_MAP
# Load
config = ProjectConfig()
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/chinese_sp.model")
model = GPT2(config.model, pad_token_id=config.tokenizer.pad_id)
state = torch.load("checkpoints/stage2_final.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()
# Generate with style
prompt = "太史公曰"
style_id = STYLE_ID_MAP["司马迁"]
ids = torch.tensor([sp.encode(prompt)])
output = model.generate(
ids, max_new_tokens=100, style_id=style_id,
temperature=0.85, top_k=40, top_p=0.9,
repetition_penalty=1.3,
)
print(sp.decode(output[0].tolist()))
Generation Examples
Li Bai style (Poetry) — Prompt: "月下独酌"
月下独酌 秋风送雁度孤村,万里秋声落暮烟。惟有黄花当夕照,可怜白露湿西风。
Pu Songling style (Fiction) — Prompt: "邑有王生"
邑有王生,名王生。年十九岁,曾游太学为诸生,因患心疾而殁于京师,家贫无资葬亲,佣书以供衣食者二十余年。其友怜其有行谊,每致赙钱一百余缗以葬之...
Sima Qian style (History) — Prompt: "太史公曰"
太史公曰:汉文帝初即位,置丞相官。高帝元年,置御史大夫,秩中二千石;惠帝三年,更名大司农。武帝太初二年省太仆、廷尉及中尉,复为太仆...
Datasets
- ClassicChineseTexts — 13,459 classical texts across 10 categories
- Classical-Chinese — 950,000 classical-modern Chinese pairs
- CLUE tnews — 33,462 modern Chinese news articles (Stage 1 only)
Limitations
- Local coherence only: The model generates locally fluent text but lacks long-range narrative coherence (expected for 335M parameters)
- Style bleeding: Generated text sometimes transitions between styles, especially for longer outputs
- Data imbalance: Han Yu style is overrepresented in training data (52% of samples)
- Not instruction-tuned: The model continues text, it does not follow instructions
Citation
@misc{chinese-classical-gpt-2026,
title={Chinese Classical GPT-2: Style-Conditioned Classical Chinese Text Generation},
author={Zichao Wei, Entang Wang and Zhenyu Feng},
year={2026},
howpublished={Software Project Neural Networks, Saarland University}
}
License
MIT
Datasets used to train exusiaiw/chinese-classical-gpt
Evaluation results
- Perplexity (Stage 2)self-reported42.430
- Style Classification Accuracyself-reported0.455