alever_sn commited on
Commit
275a48e
·
1 Parent(s): 6c4fd03
README.md CHANGED
@@ -1,3 +1,349 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations
2
+
3
+ ![workflow](docs/assets/workflow.png)
4
+
5
+ ## 🌍 Overview
6
+
7
+ This repository provides the official implementation of our paper:
8
+
9
+ > **Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations**
10
+ > [arXiv:2501.19093](https://arxiv.org/abs/2501.19093)
11
+
12
+ Low-resource sequence labeling often suffers from data sparsity and limited contextual generalization.
13
+ We propose **KnowFREE (Knowledge-Fused Representation Enhancement Framework)** — a framework that integrates **external linguistic knowledge** and **contextual label explanations** into the model’s representation space to enhance low-resource performance.
14
+
15
+ **Key Highlights:**
16
+
17
+ Combining an **LLM-based knowledge enhancement workflow** with a **span-based KnowFREE model** to effectively address these challenges.
18
+
19
+ **Pipeline 1: Label Extension Annotation**
20
+ * Objective: To leverage LLMs to generate extension entity labels, word segmentation tags, and POS tags for the original samples.
21
+ * Effect:
22
+ * Enhances the model's understanding of fine-grained contextual semantics.
23
+ * Improves the ability to distinguish entity boundaries in character-dense languages.
24
+
25
+ **Pipeline 2: Enriched Explanation Synthesis**
26
+
27
+ * Objective: Using LLMs to generate detailed, context-aware explanations for target entities, thereby synthesizing new, high-quality training samples.
28
+ * Effect:
29
+ * Effectively mitigates semantic distribution bias between synthetic samples and the target domain.
30
+ * Significantly expands the number of samples and improves model performance in extremely low-resource settings.
31
+
32
+
33
+
34
+ ---
35
+
36
+ ## 🔗 Quick Links
37
+
38
+ - [Model Checkpoints](#♠️-model-checkpoints)
39
+ - [Data Augmentation Workflow](#📊-data-augmentation-workflow)
40
+ - [Train KnowFREE](#🔥-run-knowfree-models)
41
+ - [Citation](#📚-citation)
42
+
43
+ ## ♠️ Model Checkpoints
44
+
45
+ Due to the large number of experiments, the architectural differences between the initial and reconstructed models, and the limited practical value of low-resource checkpoints sampled from the full dataset, we only release a few representative checkpoints (e.g., weibo) on Hugging Face for reference, as shown below:
46
+
47
+ | Model | F1 |
48
+ | :------------------------------------------------------------------------------------------------------------------- | :---: |
49
+ | [aleversn/KnowFREE-Weibo-BERT-base (Many shots 1000 with ChatGLM3)](https://huggingface.co/aleversn/GCSE-BERT-base) | 76.78 |
50
+ | [aleversn/KnowFREE-Youku-BERT-base (Many shots 1000 with ChatGLM3)](https://huggingface.co/aleversn/GCSE-BERT-large) | 84.50 |
51
+
52
+ ---
53
+
54
+ ## 🧩 KnowFREE Framework
55
+
56
+ ![KnowFREE](docs/assets/knowfree.png)
57
+
58
+ **Architecture**: A Biaffine-based span model that supports **nested entity** annotation.
59
+
60
+ **Core Innovations:**
61
+
62
+ * Introduces a **Local Multi-head Attention Layer** to efficiently fuse the multi-type extension label features generated in Pipeline 1.
63
+ * **No External Knowledge Needed for Inference:** The model learns to fuse knowledge during the training, the logits of extension labels will be masked during inference.
64
+
65
+ ---
66
+
67
+ ## ⚙️ Installation Guide
68
+
69
+ ### Core Dependencies
70
+
71
+ Create an environment and install dependencies:
72
+
73
+ ```bash
74
+ conda create -n knowfree python=3.8
75
+ conda activate knowfree
76
+ ```
77
+
78
+ ```bash
79
+ pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
80
+ pip install transformers==4.18.0 fastNLP==1.0.1 PrettyTable
81
+ pip install torch-scatter==2.0.8 -f https://data.pyg.org/whl/torch-1.8.0+cu111.html
82
+ ```
83
+
84
+ ## 📊 Data Augmentation Workflow
85
+
86
+ See the detailed data synthesis pipeline in [Syn_Pipelines](docs/Syn_Pipelines.md).
87
+
88
+ In KnowFREE, we employ **contextual paraphrasing and label explanation synthesis** to augment low-resource datasets.
89
+ For each entity label, LLMs generate descriptive explanations that are integrated into the learning process to mitigate label semantic sparsity.
90
+
91
+ ---
92
+
93
+ ## 🔥 Run KnowFREE Models
94
+
95
+ ### Training with `KnowFREE`
96
+
97
+ #### Dataset Format
98
+
99
+ Specify the dataset path using the `data_present_path` argument (`Default`: `./datasets/present.json`). The file should be a JSON object with the following format:
100
+
101
+ ```json
102
+ {
103
+ "weibo": {
104
+ "train": "./datasets/weibo/train.jsonl",
105
+ "dev": "./datasets/weibo/dev.jsonl",
106
+ "test": "./datasets/weibo/test.jsonl",
107
+ "labels": "./datasets/weibo/labels.txt"
108
+ }
109
+ }
110
+ ```
111
+
112
+ **Train Samples of Different Languages:**
113
+
114
+ - Chinese
115
+
116
+ ```jsonl
117
+ {"text": ["科", "技", "全", "方", "位", "资", "讯", "智", "能", ",", "快", "捷", "的", "汽", "车", "生", "活", "需", "要", "有", "三", "屏", "一", "云", "爱", "你"], "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "entities": []}
118
+ {"text": ["对", ",", "输", "给", "一", "个", "女", "人", ",", "的", "成", "绩", "。", "失", "望"], "label": ["O", "O", "O", "O", "O", "O", "B-PER.NOM", "E-PER.NOM", "O", "O", "O", "O", "O", "O", "O"], "entities": [{"start": 6, "entity": "PER.NOM", "end": 8, "text": ["女", "人"]}]}
119
+ {"text": ["今", "天", "下", "午", "起", "来", "看", "到", "外", "面", "的", "太", "阳", "。", "。", "。", "。", "我", "第", "一", "反", "应", "竟", "然", "是", "强", "烈", "的", "想", "回", "家", "泪", "想", "我", "们", "一", "起", "在", "嘉", "鱼", "个", "时", "候", "了", "。", "。", "。", "。", "有", "好", "多", "好", "多", "的", "话", "想", "对", "你", "说", "李", "巾", "凡", "想", "要", "瘦", "瘦", "瘦", "成", "李", "帆", "我", "是", "想", "切", "开", "云", "朵", "的", "心"], "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-LOC.NAM", "E-LOC.NAM", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-PER.NAM", "I-PER.NAM", "E-PER.NAM", "O", "O", "O", "O", "O", "O", "B-PER.NAM", "E-PER.NAM", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "entities": [{"start": 38, "entity": "LOC.NAM", "end": 40, "text": ["嘉", "鱼"]}, {"start": 59, "entity": "PER.NAM", "end": 62, "text": ["李", "巾", "凡"]}, {"start": 68, "entity": "PER.NAM", "end": 70, "text": ["李", "帆"]}]}
120
+ ```
121
+
122
+ - English
123
+
124
+ ```jsonl
125
+ {"text": ["im", "thinking", "of", "a", "comedy", "where", "a", "group", "of", "husbands", "receive", "one", "chance", "from", "their", "wives", "to", "engage", "with", "other", "women"], "entities": [{"start": 4, "end": 5, "entity": "GENRE", "text": ["comedy"]}, {"start": 6, "end": 21, "entity": "PLOT", "text": ["a", "group", "of", "husbands", "receive", "one", "chance", "from", "their", "wives", "to", "engage", "with", "other", "women"]}]}
126
+ {"text": ["another", "sequel", "of", "an", "action", "movie", "about", "drag", "street", "car", "races", "alcohol", "and", "gun", "violence"], "entities": [{"start": 1, "end": 2, "entity": "RELATIONSHIP", "text": ["sequel"]}, {"start": 4, "end": 5, "entity": "GENRE", "text": ["action"]}, {"start": 7, "end": 15, "entity": "PLOT", "text": ["drag", "street", "car", "races", "alcohol", "and", "gun", "violence"]}]}
127
+ {"text": ["what", "is", "the", "name", "of", "the", "movie", "in", "which", "a", "group", "of", "criminals", "begin", "to", "suspect", "that", "one", "of", "them", "is", "a", "police", "informant", "after", "a", "simple", "jewelery", "heist", "goes", "terribly", "wrong"], "entities": [{"start": 9, "end": 32, "entity": "PLOT", "text": ["a", "group", "of", "criminals", "begin", "to", "suspect", "that", "one", "of", "them", "is", "a", "police", "informant", "after", "a", "simple", "jewelery", "heist", "goes", "terribly", "wrong"]}]}
128
+ {"text": ["a", "movie", "with", "vin", "diesel", "in", "world", "war", "2", "in", "a", "foreign", "country", "shooting", "people"], "entities": [{"start": 3, "end": 5, "entity": "ACTOR", "text": ["vin", "diesel"]}, {"start": 6, "end": 9, "entity": "GENRE", "text": ["world", "war", "2"]}, {"start": 11, "end": 15, "entity": "PLOT", "text": ["foreign", "country", "shooting", "people"]}]}
129
+ {"text": ["what", "is", "the", "1991", "disney", "animated", "movie", "that", "featured", "angela", "lansbury", "as", "the", "voice", "of", "a", "teapot"], "entities": [{"start": 3, "end": 4, "entity": "YEAR", "text": ["1991"]}, {"start": 5, "end": 6, "entity": "GENRE", "text": ["animated"]}, {"start": 9, "end": 11, "entity": "ACTOR", "text": ["angela", "lansbury"]}, {"start": 16, "end": 17, "entity": "CHARACTER_NAME", "text": ["teapot"]}]}
130
+ ```
131
+
132
+ - Japanese
133
+
134
+ ```jsonl
135
+ {"text": ["I", "n", "f", "o", "r", "m", "i", "x", "の", "動", "き", "を", "み", "て", "、", "オ", "ラ", "ク", "ル", "と", "I", "B", "M", "も", "追", "随", "し", "た", "。"], "entities": [{"start": 0, "end": 8, "entity": "法人名", "text": ["I", "n", "f", "o", "r", "m", "i", "x"]}, {"start": 15, "end": 19, "entity": "法人名", "text": ["オ", "ラ", "ク", "ル"]}, {"start": 20, "end": 23, "entity": "法人名", "text": ["I", "B", "M"]}]}
136
+ {"text": ["現", "在", "は", "ア", "ニ", "メ", "ー", "シ", "ョ", "ン", "業", "界", "か", "ら", "退", "い", "て", "お", "り", "、", "水", "彩", "画", "家", "と", "し", "て", "も", "活", "動", "し", "て", "い", "る", "。"], "entities": []}
137
+ {"text": ["大", "野", "東", "イ", "ン", "タ", "ー", "チ", "ェ", "ン", "ジ", "は", "、", "大", "分", "県", "豊", "後", "大", "野", "市", "大", "野", "町", "後", "田", "に", "あ", "る", "中", "九", "州", "横", "断", "道", "路", "の", "イ", "ン", "タ", "ー", "チ", "ェ", "ン", "ジ", "で", "あ", "る", "。"], "entities": [{"start": 0, "end": 11, "entity": "施設名", "text": ["大", "野", "東", "イ", "ン", "タ", "ー", "チ", "ェ", "ン", "ジ"]}, {"start": 13, "end": 26, "entity": "地名", "text": ["大", "分", "県", "豊", "後", "大", "野", "市", "大", "野", "町", "後", "田"]}, {"start": 29, "end": 36, "entity": "施設名", "text": ["中", "九", "州", "横", "断", "道", "路"]}]}
138
+ {"text": ["2", "0", "1", "4", "年", "1", "月", "1", "5", "日", "、", "マ", "バ", "タ", "は", "ミ", "ャ", "ン", "マ", "ー", "の", "上", "座", "部", "仏", "教", "を", "擁", "護", "す", "る", "使", "命", "を", "持", "っ", "て", "、", "マ", "ン", "ダ", "レ", "ー", "の", "仏", "教", "僧", "の", "大", "規", "模", "な", "会", "議", "で", "正", "式", "に", "設", "立", "さ", "れ", "た", "。"], "entities": [{"start": 11, "end": 14, "entity": "法人名", "text": ["マ", "バ", "タ"]}, {"start": 15, "end": 20, "entity": "地名", "text": ["ミ", "ャ", "ン", "マ", "ー"]}, {"start": 38, "end": 43, "entity": "地名", "text": ["マ", "ン", "ダ", "レ", "ー"]}]}
139
+ {"text": ["永", "泰", "荘", "駅", "は", "、", "中", "華", "人", "民", "共", "和", "国", "北", "京", "市", "海", "淀", "区", "に", "位", "置", "す", "る", "北", "京", "地", "下", "鉄", "8", "号", "線", "の", "駅", "で", "あ", "る", "。"], "entities": [{"start": 0, "end": 4, "entity": "施設名", "text": ["永", "泰", "荘", "駅"]}, {"start": 6, "end": 19, "entity": "地名", "text": ["中", "華", "人", "民", "共", "和", "国", "北", "京", "市", "海", "淀", "区"]}]}
140
+ ```
141
+
142
+ - Korean
143
+
144
+ ```jsonl
145
+ {"text": ["그", "모습", "을", "보", "ㄴ", "민이", "는", "할아버지", "가", "마치", "전쟁터", "에서", "이기", "고", "돌아오", "ㄴ", "장군", "처럼", "의젓", "하", "아", "보이", "ㄴ다고", "생각", "하", "았", "습니다", "."], "entities": [{"start": 5, "end": 6, "entity": "PS", "text": ["민이"]}]}
146
+ {"text": ["내달", "18", "일", "부터", "내년", "2", "월", "20", "일", "까지", "는", "서울역", "에서", "무주리조트", "부근", "까지", "스키관광", "열차", "를", "운행", "하", "ㄴ다", "."], "entities": [{"start": 0, "end": 10, "entity": "DT", "text": ["내달", "18", "일", "부터", "내년", "2", "월", "20", "일", "까지"]}, {"start": 11, "end": 12, "entity": "LC", "text": ["서울역"]}, {"start": 13, "end": 14, "entity": "OG", "text": ["무주리조트"]}]}
147
+ {"text": ["호소력", "있", "고", "선동", "적", "이", "ㄴ", "주제", "를", "잡아내", "는", "데", "능하", "ㄴ", "즈윅", "이", "지만", "이", "영화", "에서", "는", "무엇", "이", "호소력", "이", "있", "을지", "결정", "하", "지", "못하", "고", "망설이", "ㄴ다", "."], "entities": [{"start": 14, "end": 15, "entity": "PS", "text": ["즈윅"]}]}
148
+ {"text": ["그래서", "세호", "는", "밤", "이", "면", "친구", "네", "집", "을", "돌아다니", "며", "아버지", "몰래", "연습", "을", "하", "았", "습니다", "."], "entities": [{"start": 1, "end": 2, "entity": "PS", "text": ["세호"]}, {"start": 3, "end": 4, "entity": "TI", "text": ["밤"]}]}
149
+ {"text": ["황씨", "는", "자신", "이", "어리", "어서", "듣", "은", "이", "이야기", "가", "어린이", "들", "에게", "소박", "하", "ㄴ", "효자", "의", "마음", "을", "전하", "아", "주", "ㄹ", "수", "있", "을", "것", "같", "아", "5", "분", "짜리", "구연동화", "로", "각색", "하", "았", "다고", "말", "하", "ㄴ다", "."], "entities": [{"start": 0, "end": 1, "entity": "PS", "text": ["황씨"]}, {"start": 31, "end": 33, "entity": "TI", "text": ["5", "분"]}]}
150
+ {"text": ["아버지", "가", "돌아가", "시", "ㄴ", "뒤", "어머니", "의", "편애", "를", "배경", "으로", "승주", "는", "집안", "에서", "만", "은", "대단", "하", "ㄴ", "권세", "를", "누리", "었", "다", "."], "entities": [{"start": 12, "end": 13, "entity": "PS", "text": ["승주"]}]}
151
+ ```
152
+
153
+ **Labels**
154
+
155
+ - `.txt`
156
+
157
+ ```
158
+ O
159
+ GPE.NAM
160
+ GPE.NOM
161
+ LOC.NAM
162
+ LOC.NOM
163
+ ORG.NAM
164
+ ORG.NOM
165
+ PER.NAM
166
+ PER.NOM
167
+ ```
168
+
169
+ - `.json` / `.jsonl`
170
+
171
+ ```json
172
+ {
173
+ "O": {
174
+ "idx": 0,
175
+ "count": -1,
176
+ "is_target": true
177
+ },
178
+ "GPE.NAM": {
179
+ "idx": 1,
180
+ "count": -1,
181
+ "is_target": true
182
+ },
183
+ "GPE.NOM": {
184
+ "idx": 2,
185
+ "count": -1,
186
+ "is_target": true
187
+ },
188
+ "LOC.NAM": {
189
+ "idx": 3,
190
+ "count": -1,
191
+ "is_target": true
192
+ },
193
+ "LOC.NOM": {
194
+ "idx": 4,
195
+ "count": -1,
196
+ "is_target": true
197
+ },
198
+ "ORG.NAM": {
199
+ "idx": 5,
200
+ "count": -1,
201
+ "is_target": true
202
+ },
203
+ "ORG.NOM": {
204
+ "idx": 6,
205
+ "count": -1,
206
+ "is_target": true
207
+ },
208
+ "PER.NAM": {
209
+ "idx": 7,
210
+ "count": -1,
211
+ "is_target": true
212
+ },
213
+ "PER.NOM": {
214
+ "idx": 8,
215
+ "count": -1,
216
+ "is_target": true
217
+ },
218
+ "ADJECTIVE": {
219
+ "idx": 9,
220
+ "count": 1008,
221
+ "is_target": false
222
+ },
223
+ "ADPOSITION": {
224
+ "idx": 10,
225
+ "count": 41,
226
+ "is_target": false
227
+ },
228
+ "ADVERB": {
229
+ "idx": 11,
230
+ "count": 1147,
231
+ "is_target": false
232
+ },
233
+ "APP": {
234
+ "idx": 12,
235
+ "count": 3,
236
+ "is_target": false
237
+ },
238
+ "AUXILIARY": {
239
+ "idx": 13,
240
+ "count": 4,
241
+ "is_target": false
242
+ },...
243
+ }
244
+ ```
245
+
246
+ * **Model**: BERT / RoBERTa
247
+
248
+ ```python
249
+ from main.trainers.knowfree_trainer import Trainer
250
+ from transformers import BertTokenizer, BertConfig
251
+
252
+ MODEL_PATH = "<MODEL_PATH>"
253
+ tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
254
+ config = BertConfig.from_pretrained(MODEL_PATH)
255
+ trainer = Trainer(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH,
256
+ data_name='<DATASET_NAME>',
257
+ batch_size=4,
258
+ batch_size_eval=8,
259
+ task_name='<TASK_NAME>')
260
+
261
+ for i in trainer(num_epochs=120, other_lr=1e-3, weight_decay=0.01, remove_clashed=True, nested=False, eval_call_step=lambda x: x % 125 == 0):
262
+ a = i
263
+ ```
264
+
265
+ **Key Params**
266
+
267
+ - `other_lr`: the learning rate of the non-PLM part.
268
+ - `remove_clashed`: remove the label that exists overlap (only choose the label with min start position)
269
+ - `nested`: whether support nested entities, when do sequence labeling like `CMeEE`, you should set it as true and disabled `remove_clashed`.
270
+ - `eval_call_step`: determine evaluation with `x` steps, defined with a function call.
271
+
272
+ #### Evaluation Only
273
+
274
+ Comment out the training loop to evaluate directly:
275
+
276
+ ```python
277
+ trainer.eval(0, is_eval=True)
278
+ ```
279
+
280
+ ### Train with `CNN Nested NER`
281
+
282
+ ```python
283
+ from main.trainers.cnnner_trainer import Trainer
284
+ from transformers import BertTokenizer, BertConfig
285
+
286
+ MODEL_PATH = "<MODEL_PATH>"
287
+ tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
288
+ config = BertConfig.from_pretrained(MODEL_PATH)
289
+ trainer = Trainer(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH,
290
+ data_name='<DATASET_NAME>',
291
+ batch_size=4,
292
+ batch_size_eval=8,
293
+ task_name='<TASK_NAME>')
294
+
295
+ for i in trainer(num_epochs=120, other_lr=1e-3, weight_decay=0.01, remove_clashed=True, nested=False, eval_call_step=lambda x: x % 125 == 0):
296
+ a = i
297
+ ```
298
+
299
+ #### Prediction
300
+
301
+ ```python
302
+ from main.predictor.knowfree_predictor import KnowFREEPredictor
303
+ from transformers import BertTokenizer, BertConfig
304
+
305
+ MODEL_PATH = "<MODEL_PATH>"
306
+ LABEL_FILE = '<LABEL_PATH>'
307
+ tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
308
+ config = BertConfig.from_pretrained(MODEL_PATH)
309
+ pred = KnowFREEPredictor(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH, label_file=LABEL_FILE, batch_size=4)
310
+
311
+ for entities in pred(['叶赟葆:全球时尚财运滚滚而来钱', '我要去我要去花心花心花心耶分手大师贵仔邓超四大名捕围观话筒转发邓超贴吧微博号外话筒望周知。邓超四大名捕']):
312
+ print(entities)
313
+ ```
314
+
315
+ **Result**
316
+
317
+ ```json
318
+ [
319
+ [
320
+ {'start': 0, 'end': 3, 'entity': 'PER.NAM', 'text': ['叶', '赟', '葆'
321
+ ]
322
+ }
323
+ ],
324
+ [
325
+ {'start': 45, 'end': 47, 'entity': 'PER.NAM', 'text': ['邓', '超'
326
+ ]
327
+ },
328
+ {'start': 19, 'end': 21, 'entity': 'PER.NAM', 'text': ['邓', '超'
329
+ ]
330
+ },
331
+ {'start': 31, 'end': 33, 'entity': 'PER.NAM', 'text': ['邓', '超'
332
+ ]
333
+ }
334
+ ]
335
+ ]
336
+ ```
337
+
338
+ ## 📚 Citation
339
+ ```bibtex
340
+ @misc{lai2025improvinglowresourcesequencelabeling,
341
+ title={Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations},
342
+ author={Peichao Lai and Jiaxin Gan and Feiyang Ye and Yilei Wang and Bin Cui},
343
+ year={2025},
344
+ eprint={2501.19093},
345
+ archivePrefix={arXiv},
346
+ primaryClass={cs.CL},
347
+ url={https://arxiv.org/abs/2501.19093},
348
+ }
349
+ ```
config.json ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/home/lpc/models/chinese-bert-wwm-ext/",
3
+ "architectures": [
4
+ "CNNNerv1"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "biaffine_size": 200,
8
+ "classifier_dropout": null,
9
+ "cnn_depth": 3,
10
+ "cnn_dim": 200,
11
+ "directionality": "bidi",
12
+ "hidden_act": "gelu",
13
+ "hidden_dropout_prob": 0.1,
14
+ "hidden_size": 768,
15
+ "id2label": {
16
+ "0": "LABEL_0",
17
+ "1": "LABEL_1",
18
+ "2": "LABEL_2",
19
+ "3": "LABEL_3",
20
+ "4": "LABEL_4",
21
+ "5": "LABEL_5",
22
+ "6": "LABEL_6",
23
+ "7": "LABEL_7",
24
+ "8": "LABEL_8",
25
+ "9": "LABEL_9",
26
+ "10": "LABEL_10",
27
+ "11": "LABEL_11",
28
+ "12": "LABEL_12",
29
+ "13": "LABEL_13",
30
+ "14": "LABEL_14",
31
+ "15": "LABEL_15",
32
+ "16": "LABEL_16",
33
+ "17": "LABEL_17",
34
+ "18": "LABEL_18",
35
+ "19": "LABEL_19",
36
+ "20": "LABEL_20",
37
+ "21": "LABEL_21",
38
+ "22": "LABEL_22",
39
+ "23": "LABEL_23",
40
+ "24": "LABEL_24",
41
+ "25": "LABEL_25",
42
+ "26": "LABEL_26",
43
+ "27": "LABEL_27",
44
+ "28": "LABEL_28",
45
+ "29": "LABEL_29",
46
+ "30": "LABEL_30",
47
+ "31": "LABEL_31",
48
+ "32": "LABEL_32",
49
+ "33": "LABEL_33",
50
+ "34": "LABEL_34",
51
+ "35": "LABEL_35",
52
+ "36": "LABEL_36",
53
+ "37": "LABEL_37",
54
+ "38": "LABEL_38",
55
+ "39": "LABEL_39",
56
+ "40": "LABEL_40",
57
+ "41": "LABEL_41",
58
+ "42": "LABEL_42",
59
+ "43": "LABEL_43",
60
+ "44": "LABEL_44",
61
+ "45": "LABEL_45",
62
+ "46": "LABEL_46",
63
+ "47": "LABEL_47",
64
+ "48": "LABEL_48",
65
+ "49": "LABEL_49",
66
+ "50": "LABEL_50",
67
+ "51": "LABEL_51",
68
+ "52": "LABEL_52",
69
+ "53": "LABEL_53",
70
+ "54": "LABEL_54",
71
+ "55": "LABEL_55",
72
+ "56": "LABEL_56",
73
+ "57": "LABEL_57",
74
+ "58": "LABEL_58",
75
+ "59": "LABEL_59",
76
+ "60": "LABEL_60",
77
+ "61": "LABEL_61",
78
+ "62": "LABEL_62",
79
+ "63": "LABEL_63",
80
+ "64": "LABEL_64",
81
+ "65": "LABEL_65",
82
+ "66": "LABEL_66",
83
+ "67": "LABEL_67"
84
+ },
85
+ "initializer_range": 0.02,
86
+ "intermediate_size": 3072,
87
+ "kernel_size": 3,
88
+ "label2id": {
89
+ "LABEL_0": 0,
90
+ "LABEL_1": 1,
91
+ "LABEL_10": 10,
92
+ "LABEL_11": 11,
93
+ "LABEL_12": 12,
94
+ "LABEL_13": 13,
95
+ "LABEL_14": 14,
96
+ "LABEL_15": 15,
97
+ "LABEL_16": 16,
98
+ "LABEL_17": 17,
99
+ "LABEL_18": 18,
100
+ "LABEL_19": 19,
101
+ "LABEL_2": 2,
102
+ "LABEL_20": 20,
103
+ "LABEL_21": 21,
104
+ "LABEL_22": 22,
105
+ "LABEL_23": 23,
106
+ "LABEL_24": 24,
107
+ "LABEL_25": 25,
108
+ "LABEL_26": 26,
109
+ "LABEL_27": 27,
110
+ "LABEL_28": 28,
111
+ "LABEL_29": 29,
112
+ "LABEL_3": 3,
113
+ "LABEL_30": 30,
114
+ "LABEL_31": 31,
115
+ "LABEL_32": 32,
116
+ "LABEL_33": 33,
117
+ "LABEL_34": 34,
118
+ "LABEL_35": 35,
119
+ "LABEL_36": 36,
120
+ "LABEL_37": 37,
121
+ "LABEL_38": 38,
122
+ "LABEL_39": 39,
123
+ "LABEL_4": 4,
124
+ "LABEL_40": 40,
125
+ "LABEL_41": 41,
126
+ "LABEL_42": 42,
127
+ "LABEL_43": 43,
128
+ "LABEL_44": 44,
129
+ "LABEL_45": 45,
130
+ "LABEL_46": 46,
131
+ "LABEL_47": 47,
132
+ "LABEL_48": 48,
133
+ "LABEL_49": 49,
134
+ "LABEL_5": 5,
135
+ "LABEL_50": 50,
136
+ "LABEL_51": 51,
137
+ "LABEL_52": 52,
138
+ "LABEL_53": 53,
139
+ "LABEL_54": 54,
140
+ "LABEL_55": 55,
141
+ "LABEL_56": 56,
142
+ "LABEL_57": 57,
143
+ "LABEL_58": 58,
144
+ "LABEL_59": 59,
145
+ "LABEL_6": 6,
146
+ "LABEL_60": 60,
147
+ "LABEL_61": 61,
148
+ "LABEL_62": 62,
149
+ "LABEL_63": 63,
150
+ "LABEL_64": 64,
151
+ "LABEL_65": 65,
152
+ "LABEL_66": 66,
153
+ "LABEL_67": 67,
154
+ "LABEL_7": 7,
155
+ "LABEL_8": 8,
156
+ "LABEL_9": 9
157
+ },
158
+ "layer_norm_eps": 1e-12,
159
+ "logit_drop": 0,
160
+ "max_position_embeddings": 512,
161
+ "model_type": "bert",
162
+ "n_head": 4,
163
+ "num_attention_heads": 12,
164
+ "num_hidden_layers": 12,
165
+ "num_target_labels": 8,
166
+ "output_past": true,
167
+ "pad_token_id": 0,
168
+ "pooler_fc_size": 768,
169
+ "pooler_num_attention_heads": 12,
170
+ "pooler_num_fc_layers": 3,
171
+ "pooler_size_per_head": 128,
172
+ "pooler_type": "first_token_transform",
173
+ "position_embedding_type": "absolute",
174
+ "size_embed_dim": 25,
175
+ "span_threshold": 0.5,
176
+ "torch_dtype": "float32",
177
+ "transformers_version": "4.18.0",
178
+ "type_vocab_size": 2,
179
+ "use_cache": true,
180
+ "vocab_size": 21128
181
+ }
enhance_data_info/entity_train_1000.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
enhance_data_info/entity_train_1000_synthetic.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
enhance_data_info/labels_fusion.json ADDED
@@ -0,0 +1,342 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "O": {
3
+ "idx": 0,
4
+ "count": -1,
5
+ "is_target": true
6
+ },
7
+ "GPE.NAM": {
8
+ "idx": 1,
9
+ "count": -1,
10
+ "is_target": true
11
+ },
12
+ "GPE.NOM": {
13
+ "idx": 2,
14
+ "count": -1,
15
+ "is_target": true
16
+ },
17
+ "LOC.NAM": {
18
+ "idx": 3,
19
+ "count": -1,
20
+ "is_target": true
21
+ },
22
+ "LOC.NOM": {
23
+ "idx": 4,
24
+ "count": -1,
25
+ "is_target": true
26
+ },
27
+ "ORG.NAM": {
28
+ "idx": 5,
29
+ "count": -1,
30
+ "is_target": true
31
+ },
32
+ "ORG.NOM": {
33
+ "idx": 6,
34
+ "count": -1,
35
+ "is_target": true
36
+ },
37
+ "PER.NAM": {
38
+ "idx": 7,
39
+ "count": -1,
40
+ "is_target": true
41
+ },
42
+ "PER.NOM": {
43
+ "idx": 8,
44
+ "count": -1,
45
+ "is_target": true
46
+ },
47
+ "ADJECTIVE": {
48
+ "idx": 9,
49
+ "count": 425,
50
+ "is_target": false
51
+ },
52
+ "ADPOSITION": {
53
+ "idx": 10,
54
+ "count": 50,
55
+ "is_target": false
56
+ },
57
+ "ADVERB": {
58
+ "idx": 11,
59
+ "count": 523,
60
+ "is_target": false
61
+ },
62
+ "Action": {
63
+ "idx": 12,
64
+ "count": 25,
65
+ "is_target": false
66
+ },
67
+ "Activity": {
68
+ "idx": 13,
69
+ "count": 16,
70
+ "is_target": false
71
+ },
72
+ "CONJUNCTION": {
73
+ "idx": 14,
74
+ "count": 52,
75
+ "is_target": false
76
+ },
77
+ "Country": {
78
+ "idx": 15,
79
+ "count": 4,
80
+ "is_target": false
81
+ },
82
+ "DETERMINER": {
83
+ "idx": 16,
84
+ "count": 55,
85
+ "is_target": false
86
+ },
87
+ "EXCLAMATION": {
88
+ "idx": 17,
89
+ "count": 2,
90
+ "is_target": false
91
+ },
92
+ "Emotion": {
93
+ "idx": 18,
94
+ "count": 18,
95
+ "is_target": false
96
+ },
97
+ "Event": {
98
+ "idx": 19,
99
+ "count": 22,
100
+ "is_target": false
101
+ },
102
+ "INTERJECTION": {
103
+ "idx": 20,
104
+ "count": 2,
105
+ "is_target": false
106
+ },
107
+ "Location": {
108
+ "idx": 21,
109
+ "count": 144,
110
+ "is_target": false
111
+ },
112
+ "MODAL": {
113
+ "idx": 22,
114
+ "count": 2,
115
+ "is_target": false
116
+ },
117
+ "NOUN": {
118
+ "idx": 23,
119
+ "count": 1974,
120
+ "is_target": false
121
+ },
122
+ "OCCUPATION": {
123
+ "idx": 24,
124
+ "count": 40,
125
+ "is_target": false
126
+ },
127
+ "ONOMATOPOEIA": {
128
+ "idx": 25,
129
+ "count": 2,
130
+ "is_target": false
131
+ },
132
+ "OTHER": {
133
+ "idx": 26,
134
+ "count": 2439,
135
+ "is_target": false
136
+ },
137
+ "Object": {
138
+ "idx": 27,
139
+ "count": 40,
140
+ "is_target": false
141
+ },
142
+ "Organization": {
143
+ "idx": 28,
144
+ "count": 140,
145
+ "is_target": false
146
+ },
147
+ "PARTICLE": {
148
+ "idx": 29,
149
+ "count": 6,
150
+ "is_target": false
151
+ },
152
+ "PERSON": {
153
+ "idx": 30,
154
+ "count": 635,
155
+ "is_target": false
156
+ },
157
+ "PRONOUN": {
158
+ "idx": 31,
159
+ "count": 209,
160
+ "is_target": false
161
+ },
162
+ "PROPER_NOUN": {
163
+ "idx": 32,
164
+ "count": 476,
165
+ "is_target": false
166
+ },
167
+ "PUNCTUATION": {
168
+ "idx": 33,
169
+ "count": 47,
170
+ "is_target": false
171
+ },
172
+ "Product": {
173
+ "idx": 34,
174
+ "count": 2,
175
+ "is_target": false
176
+ },
177
+ "QUANTIFIER": {
178
+ "idx": 35,
179
+ "count": 234,
180
+ "is_target": false
181
+ },
182
+ "Sympton": {
183
+ "idx": 36,
184
+ "count": 1,
185
+ "is_target": false
186
+ },
187
+ "UNKNOWN": {
188
+ "idx": 37,
189
+ "count": 1,
190
+ "is_target": false
191
+ },
192
+ "URL": {
193
+ "idx": 38,
194
+ "count": 9,
195
+ "is_target": false
196
+ },
197
+ "VERB": {
198
+ "idx": 39,
199
+ "count": 993,
200
+ "is_target": false
201
+ },
202
+ "WORD": {
203
+ "idx": 40,
204
+ "count": 9999,
205
+ "is_target": false
206
+ },
207
+ "Weather": {
208
+ "idx": 41,
209
+ "count": 2,
210
+ "is_target": false
211
+ },
212
+ "animal": {
213
+ "idx": 42,
214
+ "count": 8,
215
+ "is_target": false
216
+ },
217
+ "award": {
218
+ "idx": 43,
219
+ "count": 1,
220
+ "is_target": false
221
+ },
222
+ "color": {
223
+ "idx": 44,
224
+ "count": 1,
225
+ "is_target": false
226
+ },
227
+ "description": {
228
+ "idx": 45,
229
+ "count": 1,
230
+ "is_target": false
231
+ },
232
+ "food": {
233
+ "idx": 46,
234
+ "count": 10,
235
+ "is_target": false
236
+ },
237
+ "time": {
238
+ "idx": 47,
239
+ "count": 43,
240
+ "is_target": false
241
+ },
242
+ "Media": {
243
+ "idx": 48,
244
+ "count": 6,
245
+ "is_target": false
246
+ },
247
+ "MEASURE": {
248
+ "idx": 49,
249
+ "count": 8,
250
+ "is_target": false
251
+ },
252
+ "APP": {
253
+ "idx": 50,
254
+ "count": 3,
255
+ "is_target": false
256
+ },
257
+ "AUXILIARY": {
258
+ "idx": 51,
259
+ "count": 4,
260
+ "is_target": false
261
+ },
262
+ "EMOJI": {
263
+ "idx": 52,
264
+ "count": 5,
265
+ "is_target": false
266
+ },
267
+ "Group": {
268
+ "idx": 53,
269
+ "count": 4,
270
+ "is_target": false
271
+ },
272
+ "INTERROGATIVE": {
273
+ "idx": 54,
274
+ "count": 3,
275
+ "is_target": false
276
+ },
277
+ "Media_Data": {
278
+ "idx": 55,
279
+ "count": 1,
280
+ "is_target": false
281
+ },
282
+ "NUMERAL": {
283
+ "idx": 56,
284
+ "count": 26,
285
+ "is_target": false
286
+ },
287
+ "SYMBOL": {
288
+ "idx": 57,
289
+ "count": 4,
290
+ "is_target": false
291
+ },
292
+ "Symbol": {
293
+ "idx": 58,
294
+ "count": 2,
295
+ "is_target": false
296
+ },
297
+ "article": {
298
+ "idx": 59,
299
+ "count": 2,
300
+ "is_target": false
301
+ },
302
+ "element": {
303
+ "idx": 60,
304
+ "count": 2,
305
+ "is_target": false
306
+ },
307
+ "price": {
308
+ "idx": 61,
309
+ "count": 5,
310
+ "is_target": false
311
+ },
312
+ "news": {
313
+ "idx": 62,
314
+ "count": 1000,
315
+ "is_target": false
316
+ },
317
+ "建筑": {
318
+ "idx": 63,
319
+ "count": 1000,
320
+ "is_target": false
321
+ },
322
+ "Feature": {
323
+ "idx": 64,
324
+ "count": 1000,
325
+ "is_target": false
326
+ },
327
+ "Art": {
328
+ "idx": 65,
329
+ "count": 1000,
330
+ "is_target": false
331
+ },
332
+ "Class": {
333
+ "idx": 66,
334
+ "count": 1000,
335
+ "is_target": false
336
+ },
337
+ "CLASSIFIER": {
338
+ "idx": 67,
339
+ "count": 1000,
340
+ "is_target": false
341
+ }
342
+ }
enhance_data_info/pos_train_1000.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
enhance_data_info/pos_train_1000_synthetic.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
enhance_data_info/train_1000_fusion.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
enhance_data_info/train_1000_synthetic.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
enhance_data_info/train_1000_synthetic_fusion.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ba0fb801977f08de152eca7c94b51506af50055492cab8a9eb8ecd0a8661e29
3
+ size 418927535
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": "/home/lpc/models/chinese-bert-wwm-ext/special_tokens_map.json", "name_or_path": "/home/lpc/models/chinese-bert-wwm-ext/", "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff