lordChipotle
/

SimaQian

ancient-chinese

Model card Files Files and versions

lordChipotle commited on Jan 13, 2025

Commit

aa8f620

·

verified ·

1 Parent(s): a54160f

Create README.md

Files changed (1) hide show

README.md +52 -0

README.md ADDED Viewed

	@@ -0,0 +1,52 @@

+---
+datasets:
+- RUCAIBox/Erya-dataset
+language:
+- en
+base_model:
+- google/gemma-2-2b-it
+tags:
+- ancient-chinese
+- chinese
+- literature
+---
+Ancient Chinese Translator + Phonology Model (SimaQian)
+Name Origin:
+The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.
+This model combines two key functionalities for Ancient Chinese texts:
+	1.	Translation: Converts Ancient Chinese passages into modern Chinese.
+	2.	Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).
+Model Description
+	•	Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.
+	•	Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.
+	•	Output: Era identification (optional), phonetic renderings, and modern Chinese translations.
+Training Data
+	•	Translation: Erya dataset from RUCAIBox/Erya-dataset.
+	•	Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions.
+	•	Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.
+Usage
+  from transformers import AutoTokenizer, AutoModelForCausalLM
+  tokenizer = AutoTokenizer.from_pretrained("username/ancient-chinese-phonology")
+  model = AutoModelForCausalLM.from_pretrained("username/ancient-chinese-phonology")
+  prompt = """
+  <start_of_turn>user
+  Given the ancient text: 「子曰：學而時習之，不亦說乎？」
+  1) Identify the era
+  2) Provide the phonetic reading
+  3) Translate into modern Chinese
+  <end_of_turn>
+  <start_of_turn>model
+  """
+  inputs = tokenizer(prompt, return_tensors="pt")
+  outputs = model.generate(**inputs, max_length=256)
+  print(tokenizer.decode(outputs[0]))
+Limitations and Biases
+	•	Era Estimation: Model may not always correctly guess the historical era.
+	•	Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.
+	•	Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.