cbdb commited on
Commit
87a461c
·
verified ·
1 Parent(s): b43684c

Initialize README file

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ tags:
5
+ - Seq2SeqLM
6
+ - 古文
7
+ - 文言文
8
+ - 中国古代官职地名拆分
9
+ - ancient
10
+ - classical
11
+ license: cc-by-nc-sa-4.0
12
+ ---
13
+
14
+ # <font color="IndianRed"> OTAS (Office Title Address Splitter)</font>
15
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1UoG3QebyBlK6diiYckiQv-5dRB9dA4iv?usp=sharing/)
16
+
17
+ Our model <font color="cornflowerblue">OTAS (Office Title Address Splitter) </font> is a Named Entity Recognition Classical Chinese language model that is intended to <font color="IndianRed">split the address portion in Classical Chinese office titles.</font>. This model is first inherited from raynardj/classical-chinese-punctuation-guwen-biaodian Classical Chinese punctuation model, and finetuned using over a 25,000 high-quality punctuation pairs collected CBDB group (China Biographical Database).
18
+
19
+ ### <font color="IndianRed"> How to use </font>
20
+
21
+ Here is how to use this model to get the features of a given text in PyTorch:
22
+
23
+ <font color="cornflowerblue"> 1. Import model and packages </font>
24
+ ```python
25
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
26
+
27
+ device = torch.device('cuda')
28
+ model_name = 'cbdb/OfficeTitleAddressSplitter'
29
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
30
+ model = AutoModelForTokenClassification.from_pretrained(model_name).to(device)
31
+ ```
32
+
33
+ <font color="cornflowerblue"> 2. Load Data </font>
34
+ ```python
35
+ # Load your data here
36
+ tobe_splitted = ['湖南常德協中軍都司','廣東鹽運使','漢軍鑲黃旗副都統']
37
+ ```
38
+
39
+ work-in-progress
40
+
41
+ <font color="cornflowerblue"> 3. Make a prediction </font>
42
+ ```python
43
+
44
+ tokens_test = tokenizer.encode_plus(
45
+ tobe_splitted,
46
+ add_special_tokens=True,
47
+ return_attention_mask=True,
48
+ padding=True,
49
+ max_length=max_seq_len,
50
+ return_tensors='pt',
51
+ truncation=True
52
+ )
53
+
54
+ test_seq = torch.tensor(tokens_test['input_ids'])
55
+ test_mask = torch.tensor(tokens_test['attention_mask'])
56
+
57
+ # get predictions for test data
58
+ with torch.no_grad():
59
+ outputs = model(test_seq.cuda(), test_mask.cuda())
60
+ outputs = outputs.logits.detach().cpu().numpy()
61
+
62
+ softmax_score = softmax(outputs)
63
+ # pred_class_dict = {k:v for k, v in zip(label2idx.keys(), softmax_score[0])}
64
+ softmax_score = np.argmax(softmax_score, axis=2)[0]
65
+
66
+ inputs = tokenizer(tobe_splitted, return_tensors="pt", padding=True).to(device)
67
+ translated = model.generate(**inputs, max_length=128)
68
+ tran = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
69
+ for c, t in zip(tobe_translated, tran):
70
+ print(f'{c}: {t}')
71
+ ```
72
+ 講筵官: Lecturer<br>
73
+ 判司簿尉: Supervisor of the Commandant of Records<br>
74
+ 散騎常侍: Policy Advisor<br>
75
+ 殿中省尚輦奉御: Chief Steward of the Palace Administration<br>
76
+
77
+ work-in-progress
78
+
79
+
80
+ ### <font color="IndianRed">Authors </font>
81
+ Queenie Luo (queenieluo[at]g.harvard.edu)
82
+ <br>
83
+ Hongsu Wang
84
+ <br>
85
+ Peter Bol
86
+ <br>
87
+ CBDB Group
88
+
89
+ ### <font color="IndianRed">License </font>
90
+ Copyright (c) 2023 CBDB
91
+
92
+ Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
93
+ To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or
94
+ send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.