---
datasets:
- RUCAIBox/Erya-dataset
language:
- en
base_model:
- google/gemma-2-2b-it
tags:
- ancient-chinese
- chinese
- literature
- unsloth
- trl
- sft
---
Ancient Chinese Translator + Phonology Model (SimaQian)

Name Origin:

The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.

This model combines two key functionalities for Ancient Chinese texts:

	1.	Translation: Converts Ancient Chinese passages into modern Chinese.
    
	2.	Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).


Model Description

	•	Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.
    
	•	Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.
    
	•	Output: Era identification (optional), phonetic renderings, and modern Chinese translations.

Training Data
	•	Translation: Erya dataset from RUCAIBox/Erya-dataset.
	•	Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions.
	•	Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.


Usage


  from transformers import AutoTokenizer, AutoModelForCausalLM
  
  tokenizer = AutoTokenizer.from_pretrained("lordChipotle/SimaQian")
  
  model = AutoModelForCausalLM.from_pretrained("lordChipotle/SimaQian")
  
  
  prompt = """
  <start_of_turn>user
  Given the ancient text: 「子曰：學而時習之，不亦說乎？」
  1) Identify the era
  2) Provide the phonetic reading
  3) Translate into modern Chinese
  <end_of_turn>
  <start_of_turn>model
  """
  
  inputs = tokenizer(prompt, return_tensors="pt")
  
  outputs = model.generate(**inputs, max_length=256)
  
  print(tokenizer.decode(outputs[0]))


Limitations and Biases

	•	Era Estimation: Model may not always correctly guess the historical era.
    
	•	Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.
    
	•	Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.