Update README with post-training and ablation results

f7b59a5 verified 21 days ago

6.62 kB

language:
  - zh
license: mit
tags:
  - text-generation
  - classical-chinese
  - gpt2
  - style-conditioned
  - persona
  - li-bai
  - pytorch
datasets:
  - hugfaceguy0001/ClassicChineseTexts
  - dlface/Classical-Chinese
pipeline_tag: text-generation
model-index:
  - name: chinese-classical-gpt
    results:
      - task:
          type: text-generation
          name: Style-Conditioned Text Generation
        metrics:
          - name: Perplexity (Stage 2)
            type: perplexity
            value: 42.43
          - name: Style Classification Accuracy
            type: accuracy
            value: 0.455
          - name: LLM-Judge Total (GPT2-SE)
            type: custom
            value: 70.27
          - name: Li Bai Persona Identification
            type: accuracy
            value: 0.698

Chinese Classical GPT-2

A 335M parameter GPT-2 model trained from scratch for style-conditioned classical Chinese text generation, with post-training for Li Bai persona emulation.

Overview

This project implements a complete pipeline from pre-training to persona-based dialogue:

Pre-training: Two-stage curriculum learning (general Chinese → classical Chinese) with Style Embedding for 5 literary genres
Post-training: Continual Pre-training (CPT) + Supervised Fine-Tuning (SFT) for Li Bai persona

Model Variants

This repository contains multiple checkpoints from different training stages and ablation experiments:

Model	Checkpoint	Description
GPT2-SE	`checkpoints_post/sft_final.pt`	Full model: post-trained with Style Embedding (primary)
GPT2-Base	(available on request)	Post-trained without Style Embedding (fair ablation)
GPT2-Raw	`checkpoints/stage2_final.pt`	Pre-trained only, no post-training (baseline)
Stage 1	`checkpoints/stage1_final.pt`	General Chinese pre-training checkpoint

Architecture

Parameter	Value
Architecture	GPT-2 (Decoder-only Transformer)
Parameters	335,609,856
Layers	24
Attention Heads	16
Hidden Dimension	1024
Max Sequence Length	512 tokens
Vocabulary	32,000 (SentencePiece BPE)
Style Conditioning	Learnable embedding (6 styles × 1024 dim)

Style Personas (Pre-training)

Persona	Genre	Era
Li Bai (李白)	Poetry (诗)	Tang Dynasty
Su Shi (苏轼)	Ci Poetry (词)	Song Dynasty
Pu Songling (蒲松龄)	Fiction (小说)	Qing Dynasty
Han Yu (韩愈)	Prose (散文)	Tang Dynasty
Sima Qian (司马迁)	History (史传)	Han Dynasty

Training

Stage 1 — General Chinese Pre-training

Data: 1.68M samples (classical + modern Chinese)
Result: Loss 10.36 → 4.0, Accuracy 2.5% → 33.5%

Stage 2 — Classical Chinese Specialization

Data: 1.60M samples (classical Chinese only, with style labels)
Result: Loss 4.0 → 3.85, Perplexity 42.43

Post-training — Li Bai Persona (CPT + SFT)

CPT: 1,329 Li Bai texts (poems, prose, biographies), Loss 4.30 → 1.34
SFT: 1,000 multi-turn dialogues in Li Bai's voice, Loss 3.76 → 0.58
Hardware: NVIDIA RTX 4080 SUPER (16GB), ~10 min total

Evaluation

LLM-Judge Quality (Tasks 1-5, scored 0-100)

Model	Fluency	Coherence	Completeness	Style	Literary	Total
GPT2-Raw	7.47	3.81	3.88	2.65	1.91	19.72
GPT2-Base	16.03	14.01	13.20	15.27	10.50	69.01
GPT2-SE	16.30	14.32	13.39	15.74	10.52	70.27

Adversarial Robustness (Task 6, scored 0-100)

Model	Boundary	Refusal	Persona	Coherence	Fluency	Total
GPT2-Raw	2.35	2.18	5.35	4.88	11.18	25.94
GPT2-Base	10.94	10.71	17.00	15.94	18.71	73.30
GPT2-SE	14.35	13.94	18.18	16.18	18.41	81.06

Persona Identification (open-ended, by DeepSeek judge)

Model	Li Bai Identification Accuracy
GPT2-Raw	17.3%
GPT2-Base	70.1%
GPT2-SE	69.8%

Repository Structure

├── checkpoints/
│   ├── stage1_final.pt          # Pre-training Stage 1
│   └── stage2_final.pt          # Pre-training Stage 2 (GPT2-Raw)
├── checkpoints_post/
│   ├── cpt_final.pt             # Post-training CPT (with SE)
│   └── sft_final.pt             # Post-training SFT (GPT2-SE, primary)
├── tokenizer/
│   ├── chinese_sp.model          # SentencePiece BPE tokenizer
│   └── chinese_sp.vocab
└── evaluation/
    ├── questions.json            # 130 evaluation questions
    ├── results_posttrain_style.json
    ├── results_posttrain_nostyle_fair.json
    ├── results_posttrain_nostyle_unfair.json
    └── results_baseline.json

Usage

Persona Dialogue (Post-trained model)

import torch
import sentencepiece as spm
from model import GPT2
from config import ProjectConfig, STYLE_ID_MAP

sp = spm.SentencePieceProcessor()
sp.load("tokenizer/chinese_sp.model")

config = ProjectConfig()
config.model.vocab_size = sp.get_piece_size()
model = GPT2(config.model, pad_token_id=0)
state = torch.load("checkpoints_post/sft_final.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()

prompt = "[STYLE:李白]\n### 系统：你是大唐诗人李白\n### 用户：写一首思乡的诗\n### 回答："
input_ids = [sp.bos_id()] + sp.encode(prompt)
idx = torch.tensor([input_ids])

output = model.generate(
    idx, max_new_tokens=256, style_id=STYLE_ID_MAP["李白"],
    temperature=0.7, top_k=20, top_p=0.8, repetition_penalty=1.3,
)
print(sp.decode(output[0, len(input_ids):].tolist()))

Limitations

Local coherence only: 335M parameters cannot maintain long-range narrative logic
Style bleeding: Style signal attenuates in longer outputs (>200 tokens)
Potential SFT overfitting: Low SFT loss (0.58) on 1,000 examples × 10 epochs
No explicit prosodic supervision: Tonal patterns learned incidentally through statistical co-occurrence

Citation

@misc{chinese-classical-gpt-2026,
  title={Cross-Era Alignment for Emulating Ancient Chinese Literati},
  author={Zichao Wei and Entang Wang and Zhenyu Feng},
  year={2026},
  howpublished={Software Project Neural Networks, Saarland University}
}

License

MIT