Chinese Classical GPT-2

A 335M parameter GPT-2 model trained from scratch for style-conditioned classical Chinese text generation. The model can generate text in 5 distinct literary styles spanning over 2000 years of Chinese literature.

Model Description

This model generates classical Chinese text conditioned on a style persona, each representing a major literary genre:

Persona Genre Era Example
李白 (Li Bai) Poetry (诗) Tang Dynasty Five/seven-character verse
苏轼 (Su Shi) Ci Poetry (词) Song Dynasty Lyric poetry
蒲松龄 (Pu Songling) Fiction (小说) Qing Dynasty Supernatural tales
韩愈 (Han Yu) Prose (散文) Tang Dynasty Argumentative essays
司马迁 (Sima Qian) History (史传) Han Dynasty Biographical records

Architecture

Parameter Value
Architecture GPT-2 (Decoder-only Transformer)
Parameters 335,609,856
Layers 24
Attention Heads 16
Hidden Dimension 1024
Max Sequence Length 512 tokens
Vocabulary 32,000 (SentencePiece BPE)
Style Conditioning Learnable style embedding added to token representations

Training

Two-Stage Curriculum Learning

Stage 1 — General Chinese Pretraining

  • Data: 1.68M samples (classical + modern Chinese, ~860M tokens)
  • Epochs: 1 | Steps: 35,109 | LR: 3e-4 (cosine decay)
  • Result: Loss 10.36 → 4.0, Accuracy 2.5% → 33.5%

Stage 2 — Classical Chinese Specialization

  • Data: 1.60M samples (classical Chinese only, with style labels)
  • Epochs: 2 | Steps: 66,522 | LR: 1e-4 (cosine decay)
  • Result: Loss 4.0 → 3.85, Accuracy 33.5% → 35.3%

Training Infrastructure

  • Hardware: NVIDIA A100 (Google Colab)
  • Precision: Mixed (FP16)
  • Optimizer: AdamW (beta1=0.9, beta2=0.95)
  • Total training time: ~14 hours

Evaluation

Perplexity

Model Perplexity Improvement
Stage 1 (pretrain only) 52.69
Stage 2 (+ classical finetune) 42.43 -19.5%

Evaluated on 2,000 held-out classical Chinese samples.

Style Distinguishability

A linear SVM classifier trained on 14 text-level features achieves 45.5% accuracy in classifying generated text by style (random baseline: 20.0%, 5 classes).

Style Precision Recall F1-Score
李白 (Poetry) 0.41 0.55 0.47
苏轼 (Ci) 0.52 0.49 0.50
蒲松龄 (Fiction) 0.44 0.46 0.45
韩愈 (Prose) 0.43 0.41 0.42
司马迁 (History) 0.52 0.36 0.43

Statistical Analysis (150 samples)

Style Avg Sentence Len Poetry Line % Narrative % Historical %
李白 20.0 7.7% 0.83% 0.68%
苏轼 21.2 7.2% 1.33% 0.55%
蒲松龄 25.7 2.7% 1.82% 0.80%
韩愈 19.4 13.9% 1.26% 0.78%
司马迁 24.7 4.7% 1.61% 1.63%

Key findings:

  • Sentence length separates poetic styles (19-21) from narrative styles (24-26)
  • Historical vocabulary concentrates in Sima Qian (1.63% vs <0.80% for others)
  • Narrative markers peak in Pu Songling (1.82% vs 0.83% for Li Bai)

Usage

import torch
import sentencepiece as spm
from model import GPT2
from config import ProjectConfig, STYLE_ID_MAP

# Load
config = ProjectConfig()
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/chinese_sp.model")

model = GPT2(config.model, pad_token_id=config.tokenizer.pad_id)
state = torch.load("checkpoints/stage2_final.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()

# Generate with style
prompt = "太史公曰"
style_id = STYLE_ID_MAP["司马迁"]
ids = torch.tensor([sp.encode(prompt)])

output = model.generate(
    ids, max_new_tokens=100, style_id=style_id,
    temperature=0.85, top_k=40, top_p=0.9,
    repetition_penalty=1.3,
)
print(sp.decode(output[0].tolist()))

Generation Examples

Li Bai style (Poetry) — Prompt: "月下独酌"

月下独酌 秋风送雁度孤村,万里秋声落暮烟。惟有黄花当夕照,可怜白露湿西风。

Pu Songling style (Fiction) — Prompt: "邑有王生"

邑有王生,名王生。年十九岁,曾游太学为诸生,因患心疾而殁于京师,家贫无资葬亲,佣书以供衣食者二十余年。其友怜其有行谊,每致赙钱一百余缗以葬之...

Sima Qian style (History) — Prompt: "太史公曰"

太史公曰:汉文帝初即位,置丞相官。高帝元年,置御史大夫,秩中二千石;惠帝三年,更名大司农。武帝太初二年省太仆、廷尉及中尉,复为太仆...

Datasets

  • ClassicChineseTexts — 13,459 classical texts across 10 categories
  • Classical-Chinese — 950,000 classical-modern Chinese pairs
  • CLUE tnews — 33,462 modern Chinese news articles (Stage 1 only)

Limitations

  • Local coherence only: The model generates locally fluent text but lacks long-range narrative coherence (expected for 335M parameters)
  • Style bleeding: Generated text sometimes transitions between styles, especially for longer outputs
  • Data imbalance: Han Yu style is overrepresented in training data (52% of samples)
  • Not instruction-tuned: The model continues text, it does not follow instructions

Citation

@misc{chinese-classical-gpt-2026,
  title={Chinese Classical GPT-2: Style-Conditioned Classical Chinese Text Generation},
  author={Zichao Wei, Entang Wang and Zhenyu Feng},
  year={2026},
  howpublished={Software Project Neural Networks, Saarland University}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train exusiaiw/chinese-classical-gpt

Evaluation results