lordChipotle commited on
Commit
aa8f620
·
verified ·
1 Parent(s): a54160f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - RUCAIBox/Erya-dataset
4
+ language:
5
+ - en
6
+ base_model:
7
+ - google/gemma-2-2b-it
8
+ tags:
9
+ - ancient-chinese
10
+ - chinese
11
+ - literature
12
+ ---
13
+ Ancient Chinese Translator + Phonology Model (SimaQian)
14
+
15
+ Name Origin:
16
+ The origin of the model name comes from famous ancient chinese historian Qian Sima (司馬遷), known for his Records of the Grand Historian, a general history of China covering more than two thousand years.
17
+
18
+ This model combines two key functionalities for Ancient Chinese texts:
19
+ 1. Translation: Converts Ancient Chinese passages into modern Chinese.
20
+ 2. Phonological Reconstruction: Provides historical pronunciations for characters or entire sentences across multiple eras (e.g., Middle Tang, Song, Yuan, Ming/Qing).
21
+
22
+ Model Description
23
+ • Architecture: Fine-tuned on top of Google’s Gemma 2 model using LoRA.
24
+ • Input Format: Special tokens <start_of_turn> / <end_of_turn> define user vs. model turns.
25
+ • Output: Era identification (optional), phonetic renderings, and modern Chinese translations.
26
+ Training Data
27
+ • Translation: Erya dataset from RUCAIBox/Erya-dataset.
28
+ • Phonology: Ancient-Chinese-Phonology (ACP) for multi-era reconstructions.
29
+ • Fine-Tuning: LoRA-based parameter-efficient approach on Gemma 2 Instruct.
30
+ Usage
31
+ from transformers import AutoTokenizer, AutoModelForCausalLM
32
+
33
+ tokenizer = AutoTokenizer.from_pretrained("username/ancient-chinese-phonology")
34
+ model = AutoModelForCausalLM.from_pretrained("username/ancient-chinese-phonology")
35
+
36
+ prompt = """
37
+ <start_of_turn>user
38
+ Given the ancient text: 「子曰:學而時習之,不亦說乎?」
39
+ 1) Identify the era
40
+ 2) Provide the phonetic reading
41
+ 3) Translate into modern Chinese
42
+ <end_of_turn>
43
+ <start_of_turn>model
44
+ """
45
+ inputs = tokenizer(prompt, return_tensors="pt")
46
+ outputs = model.generate(**inputs, max_length=256)
47
+ print(tokenizer.decode(outputs[0]))
48
+
49
+ Limitations and Biases
50
+ • Era Estimation: Model may not always correctly guess the historical era.
51
+ • Pronunciations: Reconstructions are approximate and can vary by scholarly consensus.
52
+ • Contextual Accuracy: For highly contextual Ancient Chinese passages, translations may need further review by domain experts.