--- datasets: - RUCAIBox/Erya-dataset language: - en base_model: - google/gemma-2-2b-it tags: - ancient-chinese - chinese - literature - unsloth - trl - sft --- Ancient Chinese Translator + Phonology Model (SimaQian) Name Origin: The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years. This model combines two key functionalities for Ancient Chinese texts: 1. Translation: Converts Ancient Chinese passages into modern Chinese. 2. Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing). Model Description • Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA. • Input Format: Special tokens / define user vs. model turns. • Output: Era identification (optional), phonetic renderings, and modern Chinese translations. Training Data • Translation: Erya dataset from RUCAIBox/Erya-dataset. • Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions. • Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct. Usage from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("lordChipotle/SimaQian") model = AutoModelForCausalLM.from_pretrained("lordChipotle/SimaQian") prompt = """ user Given the ancient text: 「子曰:學而時習之,不亦說乎?」 1) Identify the era 2) Provide the phonetic reading 3) Translate into modern Chinese model """ inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=256) print(tokenizer.decode(outputs[0])) Limitations and Biases • Era Estimation: Model may not always correctly guess the historical era. • Pronunciations: Reconstructions are approximate and can vary by scholarly consensus. • Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.