[Init]

Files changed (13) hide show

README.md +349 -3
config.json +100 -0
enhance_data_info/entity_train_1000.jsonl +0 -0
enhance_data_info/entity_train_1000_synthetic.jsonl +0 -0
enhance_data_info/pos_train_1000.jsonl +0 -0
enhance_data_info/pos_train_1000_synthetic.jsonl +0 -0
enhance_data_info/train_1000_fusion.jsonl +0 -0
enhance_data_info/train_1000_synthetic.jsonl +0 -0
fusion_ner.py +394 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,349 @@
----
-license: mit
----

+# Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations
+![workflow](docs/assets/workflow.png)
+## 🌍 Overview
+This repository provides the official implementation of our paper:
+> **Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations**
+> [arXiv:2501.19093](https://arxiv.org/abs/2501.19093)
+Low-resource sequence labeling often suffers from data sparsity and limited contextual generalization.
+We propose **KnowFREE (Knowledge-Fused Representation Enhancement Framework)** — a framework that integrates **external linguistic knowledge** and **contextual label explanations** into the model’s representation space to enhance low-resource performance.
+**Key Highlights:**
+Combining an **LLM-based knowledge enhancement workflow** with a **span-based KnowFREE model** to effectively address these challenges.
+**Pipeline 1: Label Extension Annotation**
+* Objective: To leverage LLMs to generate extension entity labels, word segmentation tags, and POS tags for the original samples.
+* Effect:
+  * Enhances the model's understanding of fine-grained contextual semantics.
+  * Improves the ability to distinguish entity boundaries in character-dense languages.
+**Pipeline 2: Enriched Explanation Synthesis**
+* Objective: Using LLMs to generate detailed, context-aware explanations for target entities, thereby synthesizing new, high-quality training samples.
+* Effect:
+  * Effectively mitigates semantic distribution bias between synthetic samples and the target domain.
+  * Significantly expands the number of samples and improves model performance in extremely low-resource settings.
+---
+## 🔗 Quick Links
+  - [Model Checkpoints](#♠️-model-checkpoints)
+  - [Data Augmentation Workflow](#📊-data-augmentation-workflow)
+  - [Train KnowFREE](#🔥-run-knowfree-models)
+  - [Citation](#📚-citation)
+## ♠️ Model Checkpoints
+Due to the large number of experiments, the architectural differences between the initial and reconstructed models, and the limited practical value of low-resource checkpoints sampled from the full dataset, we only release a few representative checkpoints (e.g., weibo) on Hugging Face for reference, as shown below:
+| Model                                                                                                                |  F1   |
+| :------------------------------------------------------------------------------------------------------------------- | :---: |
+| [aleversn/KnowFREE-Weibo-BERT-base (Many shots 1000 with ChatGLM3)](https://huggingface.co/aleversn/GCSE-BERT-base)  | 76.78 |
+| [aleversn/KnowFREE-Youku-BERT-base (Many shots 1000 with ChatGLM3)](https://huggingface.co/aleversn/GCSE-BERT-large) | 84.50 |
+---
+## 🧩 KnowFREE Framework
+![KnowFREE](docs/assets/knowfree.png)
+**Architecture**: A Biaffine-based span model that supports **nested entity** annotation.
+**Core Innovations:**
+* Introduces a **Local Multi-head Attention Layer** to efficiently fuse the multi-type extension label features generated in Pipeline 1.
+* **No External Knowledge Needed for Inference:** The model learns to fuse knowledge during the training, the logits of extension labels will be masked during inference.
+---
+## ⚙️ Installation Guide
+### Core Dependencies
+Create an environment and install dependencies:
+```bash
+conda create -n knowfree python=3.8
+conda activate knowfree
+```
+```bash
+pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
+pip install transformers==4.18.0 fastNLP==1.0.1 PrettyTable
+pip install torch-scatter==2.0.8 -f https://data.pyg.org/whl/torch-1.8.0+cu111.html
+```
+## 📊 Data Augmentation Workflow
+See the detailed data synthesis pipeline in [Syn_Pipelines](docs/Syn_Pipelines.md).
+In KnowFREE, we employ **contextual paraphrasing and label explanation synthesis** to augment low-resource datasets.
+For each entity label, LLMs generate descriptive explanations that are integrated into the learning process to mitigate label semantic sparsity.
+---
+## 🔥 Run KnowFREE Models
+### Training with `KnowFREE`
+#### Dataset Format
+Specify the dataset path using the `data_present_path` argument (`Default`: `./datasets/present.json`). The file should be a JSON object with the following format:
+```json
+{
+    "weibo": {
+        "train": "./datasets/weibo/train.jsonl",
+        "dev": "./datasets/weibo/dev.jsonl",
+        "test": "./datasets/weibo/test.jsonl",
+        "labels": "./datasets/weibo/labels.txt"
+    }
+}
+```
+**Train Samples of Different Languages:**
+- Chinese
+```jsonl
+{"text": ["科", "技", "全", "方", "位", "资", "讯", "智", "能", "，", "快", "捷", "的", "汽", "车", "生", "活", "需", "要", "有", "三", "屏", "一", "云", "爱", "你"], "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "entities": []}
+{"text": ["对", "，", "输", "给", "一", "个", "女", "人", "，", "的", "成", "绩", "。", "失", "望"], "label": ["O", "O", "O", "O", "O", "O", "B-PER.NOM", "E-PER.NOM", "O", "O", "O", "O", "O", "O", "O"], "entities": [{"start": 6, "entity": "PER.NOM", "end": 8, "text": ["女", "人"]}]}
+{"text": ["今", "天", "下", "午", "起", "来", "看", "到", "外", "面", "的", "太", "阳", "。", "。", "。", "。", "我", "第", "一", "反", "应", "竟", "然", "是", "强", "烈", "的", "想", "回", "家", "泪", "想", "我", "们", "一", "起", "在", "嘉", "鱼", "个", "时", "候", "了", "。", "。", "。", "。", "有", "好", "多", "好", "多", "的", "话", "想", "对", "你", "说", "李", "巾", "凡", "想", "要", "瘦", "瘦", "瘦", "成", "李", "帆", "我", "是", "想", "切", "开", "云", "朵", "的", "心"], "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-LOC.NAM", "E-LOC.NAM", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-PER.NAM", "I-PER.NAM", "E-PER.NAM", "O", "O", "O", "O", "O", "O", "B-PER.NAM", "E-PER.NAM", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "entities": [{"start": 38, "entity": "LOC.NAM", "end": 40, "text": ["嘉", "鱼"]}, {"start": 59, "entity": "PER.NAM", "end": 62, "text": ["李", "巾", "凡"]}, {"start": 68, "entity": "PER.NAM", "end": 70, "text": ["李", "帆"]}]}
+```
+- English
+```jsonl
+{"text": ["im", "thinking", "of", "a", "comedy", "where", "a", "group", "of", "husbands", "receive", "one", "chance", "from", "their", "wives", "to", "engage", "with", "other", "women"], "entities": [{"start": 4, "end": 5, "entity": "GENRE", "text": ["comedy"]}, {"start": 6, "end": 21, "entity": "PLOT", "text": ["a", "group", "of", "husbands", "receive", "one", "chance", "from", "their", "wives", "to", "engage", "with", "other", "women"]}]}
+{"text": ["another", "sequel", "of", "an", "action", "movie", "about", "drag", "street", "car", "races", "alcohol", "and", "gun", "violence"], "entities": [{"start": 1, "end": 2, "entity": "RELATIONSHIP", "text": ["sequel"]}, {"start": 4, "end": 5, "entity": "GENRE", "text": ["action"]}, {"start": 7, "end": 15, "entity": "PLOT", "text": ["drag", "street", "car", "races", "alcohol", "and", "gun", "violence"]}]}
+{"text": ["what", "is", "the", "name", "of", "the", "movie", "in", "which", "a", "group", "of", "criminals", "begin", "to", "suspect", "that", "one", "of", "them", "is", "a", "police", "informant", "after", "a", "simple", "jewelery", "heist", "goes", "terribly", "wrong"], "entities": [{"start": 9, "end": 32, "entity": "PLOT", "text": ["a", "group", "of", "criminals", "begin", "to", "suspect", "that", "one", "of", "them", "is", "a", "police", "informant", "after", "a", "simple", "jewelery", "heist", "goes", "terribly", "wrong"]}]}
+{"text": ["a", "movie", "with", "vin", "diesel", "in", "world", "war", "2", "in", "a", "foreign", "country", "shooting", "people"], "entities": [{"start": 3, "end": 5, "entity": "ACTOR", "text": ["vin", "diesel"]}, {"start": 6, "end": 9, "entity": "GENRE", "text": ["world", "war", "2"]}, {"start": 11, "end": 15, "entity": "PLOT", "text": ["foreign", "country", "shooting", "people"]}]}
+{"text": ["what", "is", "the", "1991", "disney", "animated", "movie", "that", "featured", "angela", "lansbury", "as", "the", "voice", "of", "a", "teapot"], "entities": [{"start": 3, "end": 4, "entity": "YEAR", "text": ["1991"]}, {"start": 5, "end": 6, "entity": "GENRE", "text": ["animated"]}, {"start": 9, "end": 11, "entity": "ACTOR", "text": ["angela", "lansbury"]}, {"start": 16, "end": 17, "entity": "CHARACTER_NAME", "text": ["teapot"]}]}
+```
+- Japanese
+```jsonl
+{"text": ["I", "n", "f", "o", "r", "m", "i", "x", "の", "動", "き", "を", "み", "て", "、", "オ", "ラ", "ク", "ル", "と", "I", "B", "M", "も", "追", "随", "し", "た", "。"], "entities": [{"start": 0, "end": 8, "entity": "法人名", "text": ["I", "n", "f", "o", "r", "m", "i", "x"]}, {"start": 15, "end": 19, "entity": "法人名", "text": ["オ", "ラ", "ク", "ル"]}, {"start": 20, "end": 23, "entity": "法人名", "text": ["I", "B", "M"]}]}
+{"text": ["現", "在", "は", "ア", "ニ", "メ", "ー", "シ", "ョ", "ン", "業", "界", "か", "ら", "退", "い", "て", "お", "り", "、", "水", "彩", "画", "家", "と", "し", "て", "も", "活", "動", "し", "て", "い", "る", "。"], "entities": []}
+{"text": ["大", "野", "東", "イ", "ン", "タ", "ー", "チ", "ェ", "ン", "ジ", "は", "、", "大", "分", "県", "豊", "後", "大", "野", "市", "大", "野", "町", "後", "田", "に", "あ", "る", "中", "九", "州", "横", "断", "道", "路", "の", "イ", "ン", "タ", "ー", "チ", "ェ", "ン", "ジ", "で", "あ", "る", "。"], "entities": [{"start": 0, "end": 11, "entity": "施設名", "text": ["大", "野", "東", "イ", "ン", "タ", "ー", "チ", "ェ", "ン", "ジ"]}, {"start": 13, "end": 26, "entity": "地名", "text": ["大", "分", "県", "豊", "後", "大", "野", "市", "大", "野", "町", "後", "田"]}, {"start": 29, "end": 36, "entity": "施設名", "text": ["中", "九", "州", "横", "断", "道", "路"]}]}
+{"text": ["2", "0", "1", "4", "年", "1", "月", "1", "5", "日", "、", "マ", "バ", "タ", "は", "ミ", "ャ", "ン", "マ", "ー", "の", "上", "座", "部", "仏", "教", "を", "擁", "護", "す", "る", "使", "命", "を", "持", "っ", "て", "、", "マ", "ン", "ダ", "レ", "ー", "の", "仏", "教", "僧", "の", "大", "規", "模", "な", "会", "議", "で", "正", "式", "に", "設", "立", "さ", "れ", "た", "。"], "entities": [{"start": 11, "end": 14, "entity": "法人名", "text": ["マ", "バ", "タ"]}, {"start": 15, "end": 20, "entity": "地名", "text": ["ミ", "ャ", "ン", "マ", "ー"]}, {"start": 38, "end": 43, "entity": "地名", "text": ["マ", "ン", "ダ", "レ", "ー"]}]}
+{"text": ["永", "泰", "荘", "駅", "は", "、", "中", "華", "人", "民", "共", "和", "国", "北", "京", "市", "海", "淀", "区", "に", "位", "置", "す", "る", "北", "京", "地", "下", "鉄", "8", "号", "線", "の", "駅", "で", "あ", "る", "。"], "entities": [{"start": 0, "end": 4, "entity": "施設名", "text": ["永", "泰", "荘", "駅"]}, {"start": 6, "end": 19, "entity": "地名", "text": ["中", "華", "人", "民", "共", "和", "国", "北", "京", "市", "海", "淀", "区"]}]}
+```
+- Korean
+```jsonl
+{"text": ["그", "모습", "을", "보", "ㄴ", "민이", "는", "할아버지", "가", "마치", "전쟁터", "에서", "이기", "고", "돌아오", "ㄴ", "장군", "처럼", "의젓", "하", "아", "보이", "ㄴ다고", "생각", "하", "았", "습니다", "."], "entities": [{"start": 5, "end": 6, "entity": "PS", "text": ["민이"]}]}
+{"text": ["내달", "18", "일", "부터", "내년", "2", "월", "20", "일", "까지", "는", "서울역", "에서", "무주리조트", "부근", "까지", "스키관광", "열차", "를", "운행", "하", "ㄴ다", "."], "entities": [{"start": 0, "end": 10, "entity": "DT", "text": ["내달", "18", "일", "부터", "내년", "2", "월", "20", "일", "까지"]}, {"start": 11, "end": 12, "entity": "LC", "text": ["서울역"]}, {"start": 13, "end": 14, "entity": "OG", "text": ["무주리조트"]}]}
+{"text": ["호소력", "있", "고", "선동", "적", "이", "ㄴ", "주제", "를", "잡아내", "는", "데", "능하", "ㄴ", "즈윅", "이", "지만", "이", "영화", "에서", "는", "무엇", "이", "호소력", "이", "있", "을지", "결정", "하", "지", "못하", "고", "망설이", "ㄴ다", "."], "entities": [{"start": 14, "end": 15, "entity": "PS", "text": ["즈윅"]}]}
+{"text": ["그래서", "세호", "는", "밤", "이", "면", "친구", "네", "집", "을", "돌아다니", "며", "아버지", "몰래", "연습", "을", "하", "았", "습니다", "."], "entities": [{"start": 1, "end": 2, "entity": "PS", "text": ["세호"]}, {"start": 3, "end": 4, "entity": "TI", "text": ["밤"]}]}
+{"text": ["황씨", "는", "자신", "이", "어리", "어서", "듣", "은", "이", "이야기", "가", "어린이", "들", "에게", "소박", "하", "ㄴ", "효자", "의", "마음", "을", "전하", "아", "주", "ㄹ", "수", "있", "을", "것", "같", "아", "5", "분", "짜리", "구연동화", "로", "각색", "하", "았", "다고", "말", "하", "ㄴ다", "."], "entities": [{"start": 0, "end": 1, "entity": "PS", "text": ["황씨"]}, {"start": 31, "end": 33, "entity": "TI", "text": ["5", "분"]}]}
+{"text": ["아버지", "가", "돌아가", "시", "ㄴ", "뒤", "어머니", "의", "편애", "를", "배경", "으로", "승주", "는", "집안", "에서", "만", "은", "대단", "하", "ㄴ", "권세", "를", "누리", "었", "다", "."], "entities": [{"start": 12, "end": 13, "entity": "PS", "text": ["승주"]}]}
+```
+**Labels**
+- `.txt`
+```
+O
+GPE.NAM
+GPE.NOM
+LOC.NAM
+LOC.NOM
+ORG.NAM
+ORG.NOM
+PER.NAM
+PER.NOM
+```
+- `.json` / `.jsonl`
+```json
+{
+    "O": {
+        "idx": 0,
+        "count": -1,
+        "is_target": true
+    },
+    "GPE.NAM": {
+        "idx": 1,
+        "count": -1,
+        "is_target": true
+    },
+    "GPE.NOM": {
+        "idx": 2,
+        "count": -1,
+        "is_target": true
+    },
+    "LOC.NAM": {
+        "idx": 3,
+        "count": -1,
+        "is_target": true
+    },
+    "LOC.NOM": {
+        "idx": 4,
+        "count": -1,
+        "is_target": true
+    },
+    "ORG.NAM": {
+        "idx": 5,
+        "count": -1,
+        "is_target": true
+    },
+    "ORG.NOM": {
+        "idx": 6,
+        "count": -1,
+        "is_target": true
+    },
+    "PER.NAM": {
+        "idx": 7,
+        "count": -1,
+        "is_target": true
+    },
+    "PER.NOM": {
+        "idx": 8,
+        "count": -1,
+        "is_target": true
+    },
+    "ADJECTIVE": {
+        "idx": 9,
+        "count": 1008,
+        "is_target": false
+    },
+    "ADPOSITION": {
+        "idx": 10,
+        "count": 41,
+        "is_target": false
+    },
+    "ADVERB": {
+        "idx": 11,
+        "count": 1147,
+        "is_target": false
+    },
+    "APP": {
+        "idx": 12,
+        "count": 3,
+        "is_target": false
+    },
+    "AUXILIARY": {
+        "idx": 13,
+        "count": 4,
+        "is_target": false
+    },...
+}
+```
+* **Model**: BERT / RoBERTa
+```python
+from main.trainers.knowfree_trainer import Trainer
+from transformers import BertTokenizer, BertConfig
+MODEL_PATH = "<MODEL_PATH>"
+tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
+config = BertConfig.from_pretrained(MODEL_PATH)
+trainer = Trainer(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH,
+                  data_name='<DATASET_NAME>',
+                  batch_size=4,
+                  batch_size_eval=8,
+                  task_name='<TASK_NAME>')
+for i in trainer(num_epochs=120, other_lr=1e-3, weight_decay=0.01, remove_clashed=True, nested=False, eval_call_step=lambda x: x % 125 == 0):
+    a = i
+```
+**Key Params**
+- `other_lr`: the learning rate of the non-PLM part.
+- `remove_clashed`: remove the label that exists overlap (only choose the label with min start position)
+- `nested`: whether support nested entities, when do sequence labeling like `CMeEE`, you should set it as true and disabled `remove_clashed`.
+- `eval_call_step`: determine evaluation with `x` steps, defined with a function call.
+#### Evaluation Only
+Comment out the training loop to evaluate directly:
+```python
+trainer.eval(0, is_eval=True)
+```
+### Train with `CNN Nested NER`
+```python
+from main.trainers.cnnner_trainer import Trainer
+from transformers import BertTokenizer, BertConfig
+MODEL_PATH = "<MODEL_PATH>"
+tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
+config = BertConfig.from_pretrained(MODEL_PATH)
+trainer = Trainer(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH,
+                  data_name='<DATASET_NAME>',
+                  batch_size=4,
+                  batch_size_eval=8,
+                  task_name='<TASK_NAME>')
+for i in trainer(num_epochs=120, other_lr=1e-3, weight_decay=0.01, remove_clashed=True, nested=False, eval_call_step=lambda x: x % 125 == 0):
+    a = i
+```
+#### Prediction
+```python
+from main.predictor.knowfree_predictor import KnowFREEPredictor
+from transformers import BertTokenizer, BertConfig
+MODEL_PATH = "<MODEL_PATH>"
+LABEL_FILE = '<LABEL_PATH>'
+tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
+config = BertConfig.from_pretrained(MODEL_PATH)
+pred = KnowFREEPredictor(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH, label_file=LABEL_FILE, batch_size=4)
+for entities in pred(['叶赟葆：全球时尚财运滚滚而来钱', '我要去我要去花心花心花心耶分手大师贵仔邓超四大名捕围观话筒转发邓超贴吧微博号外话筒望周知。邓超四大名捕']):
+    print(entities)
+```
+**Result**
+```json
+[
+    [
+        {'start': 0, 'end': 3, 'entity': 'PER.NAM', 'text': ['叶', '赟', '葆'
+            ]
+        }
+    ],
+    [
+        {'start': 45, 'end': 47, 'entity': 'PER.NAM', 'text': ['邓', '超'
+            ]
+        },
+        {'start': 19, 'end': 21, 'entity': 'PER.NAM', 'text': ['邓', '超'
+            ]
+        },
+        {'start': 31, 'end': 33, 'entity': 'PER.NAM', 'text': ['邓', '超'
+            ]
+        }
+    ]
+]
+```
+## 📚 Citation
+```bibtex
+@misc{lai2025improvinglowresourcesequencelabeling,
+      title={Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations},
+      author={Peichao Lai and Jiaxin Gan and Feiyang Ye and Yilei Wang and Bin Cui},
+      year={2025},
+      eprint={2501.19093},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2501.19093},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,100 @@

+{
+  "_name_or_path": "/home/lpc/models/chinese-bert-wwm-ext/",
+  "architectures": [
+    "CNNNerv1"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "biaffine_size": 200,
+  "classifier_dropout": null,
+  "cnn_depth": 3,
+  "cnn_dim": 200,
+  "directionality": "bidi",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5",
+    "6": "LABEL_6",
+    "7": "LABEL_7",
+    "8": "LABEL_8",
+    "9": "LABEL_9",
+    "10": "LABEL_10",
+    "11": "LABEL_11",
+    "12": "LABEL_12",
+    "13": "LABEL_13",
+    "14": "LABEL_14",
+    "15": "LABEL_15",
+    "16": "LABEL_16",
+    "17": "LABEL_17",
+    "18": "LABEL_18",
+    "19": "LABEL_19",
+    "20": "LABEL_20",
+    "21": "LABEL_21",
+    "22": "LABEL_22",
+    "23": "LABEL_23",
+    "24": "LABEL_24",
+    "25": "LABEL_25",
+    "26": "LABEL_26",
+    "27": "LABEL_27"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "kernel_size": 3,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_10": 10,
+    "LABEL_11": 11,
+    "LABEL_12": 12,
+    "LABEL_13": 13,
+    "LABEL_14": 14,
+    "LABEL_15": 15,
+    "LABEL_16": 16,
+    "LABEL_17": 17,
+    "LABEL_18": 18,
+    "LABEL_19": 19,
+    "LABEL_2": 2,
+    "LABEL_20": 20,
+    "LABEL_21": 21,
+    "LABEL_22": 22,
+    "LABEL_23": 23,
+    "LABEL_24": 24,
+    "LABEL_25": 25,
+    "LABEL_26": 26,
+    "LABEL_27": 27,
+    "LABEL_3": 3,
+    "LABEL_4": 4,
+    "LABEL_5": 5,
+    "LABEL_6": 6,
+    "LABEL_7": 7,
+    "LABEL_8": 8,
+    "LABEL_9": 9
+  },
+  "layer_norm_eps": 1e-12,
+  "logit_drop": 0,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "n_head": 4,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 0,
+  "pooler_fc_size": 768,
+  "pooler_num_attention_heads": 12,
+  "pooler_num_fc_layers": 3,
+  "pooler_size_per_head": 128,
+  "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "size_embed_dim": 25,
+  "span_threshold": 0.5,
+  "torch_dtype": "float32",
+  "transformers_version": "4.18.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 21128
+}

enhance_data_info/entity_train_1000.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

enhance_data_info/entity_train_1000_synthetic.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

enhance_data_info/pos_train_1000.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

enhance_data_info/pos_train_1000_synthetic.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

enhance_data_info/train_1000_fusion.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

enhance_data_info/train_1000_synthetic.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

fusion_ner.py ADDED Viewed

	@@ -0,0 +1,394 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+from torch_scatter import scatter_max
+import numpy as np
+from transformers import BertModel, BertPreTrainedModel, BertConfig
+from fastNLP import seq_len_to_mask
+from typing import Tuple
+class CNNNerv1(BertPreTrainedModel):
+    def __init__(self, config: BertConfig):
+        #  model_name, num_ner_tag, cnn_dim=200, biaffine_size=200,
+        #  size_embed_dim=0, logit_drop=0, kernel_size=3, n_head=4, cnn_depth=3):
+        super().__init__(config)
+        self.hidden_size = config.hidden_size
+        self.size_embed_dim = config.size_embed_dim
+        self.cnn_dim = config.cnn_dim
+        self.biaffine_size = config.biaffine_size
+        self.logit_drop = config.logit_drop
+        self.kernel_size = config.kernel_size
+        self.n_head = config.n_head
+        self.cnn_depth = config.cnn_depth
+        self.num_labels = config.num_labels
+        self.span_threshold = config.span_threshold
+        self.ext_labels_start_idx = 8
+        self.bert = BertModel(config, add_pooling_layer=False)
+        if self.size_embed_dim != 0:
+            n_pos = 30 # span 跨度位置编码为-n_pos到n_pos之间
+            self.size_embedding = torch.nn.Embedding(
+                n_pos, self.size_embed_dim)
+            # `512 - 512`: 这两个生成的张量相减，得到一个矩阵，每个元素代表两个位置之间的距离（跨度）。
+            # e.g. [[0,1,2,...,512]
+            # [-1,0,1,...,511]
+            # [...]
+            # [-511,-510,...,0]]
+            _span_size_ids = torch.arange(
+                512) - torch.arange(512).unsqueeze(-1)
+            # 限制span最大距离为pos / 2
+            _span_size_ids.masked_fill_(_span_size_ids < -n_pos/2, -n_pos/2)
+            _span_size_ids = _span_size_ids.masked_fill(
+                _span_size_ids >= n_pos/2, n_pos/2-1) + n_pos/2
+            # 注册为非更新参数
+            self.register_buffer('span_size_ids', _span_size_ids.long())
+            hsz = self.biaffine_size*2 + self.size_embed_dim + 2
+        else:
+            hsz = self.biaffine_size*2+2
+        biaffine_input_size = self.hidden_size
+        self.head_mlp = nn.Sequential(
+            nn.Dropout(0.4),
+            nn.Linear(biaffine_input_size, self.biaffine_size),
+            nn.LeakyReLU(),
+        )
+        self.tail_mlp = nn.Sequential(
+            nn.Dropout(0.4),
+            nn.Linear(biaffine_input_size, self.biaffine_size),
+            nn.LeakyReLU(),
+        )
+        self.dropout = nn.Dropout(0.4)
+        if self.n_head > 0:
+            self.multi_head_biaffine = MultiHeadBiaffine(
+                self.biaffine_size, self.cnn_dim, n_head=self.n_head)
+        else:
+            self.U = nn.Parameter(torch.randn(
+                self.cnn_dim, self.biaffine_size, self.biaffine_size))
+            torch.nn.init.xavier_normal_(self.U.data)
+        self.W = torch.nn.Parameter(torch.empty(self.cnn_dim, hsz))
+        torch.nn.init.xavier_normal_(self.W.data)
+        if self.cnn_depth > 0:
+            self.cnn = MaskCNN(self.cnn_dim, self.cnn_dim,
+                               kernel_size=self.kernel_size, depth=self.cnn_depth)
+        self.attn = LocalAttentionModel(self.cnn_dim, self.kernel_size)
+        self.down_fc = nn.Linear(self.cnn_dim, self.num_labels-1)
+        self.logit_drop = self.logit_drop
+    def decode_labels(self, labels: torch.Tensor, indexes: torch.Tensor):
+        # 这里的labels不含有特殊的字符，因此不需要减去offset
+        length: np.ndarray = indexes.detach().cpu().numpy()
+        length = length.max(-1)
+        labels[:, :, :, self.ext_labels_start_idx:] = 0
+        labels: np.ndarray = labels.detach().cpu().numpy()
+        span_mask = (labels.max(-1) > self.span_threshold)
+        labels = labels.argmax(-1)
+        indexes = np.where(span_mask)
+        entities = [set() for _ in range(labels.shape[0])]
+        for batch, x, y in zip(*indexes):
+            if x <= y and x >= 0 and y >= 0 and x < length[batch] and y < length[batch]:
+                entities[batch].add(
+                    (x, y, labels[batch, x, y] + 1))  # +1 是由于有O标签
+        return entities
+    def is_span_intersect(self, a: Tuple[int, int], b: Tuple[int, int]):
+        """
+            判断两个区间是否相交，左右都是闭区间
+        """
+        return a[0] <= b[1] and b[0] <= a[1]
+    def is_span_nested(self, a: Tuple[int, int], b: Tuple[int, int]):
+        """
+            判断两个区间是否嵌套，左右都是闭区间
+        """
+        return (b[0] <= a[0] and a[1] <= b[1]) or (a[0] <= b[0] and b[1] <= a[1])
+    def decode_logits(self, scores: torch.Tensor, indexes: torch.Tensor, remove_clashed: bool = False, nested: bool = True):
+        scores = scores.sigmoid()
+        # 这里的scores也是没有特殊字符的
+        # 按照论文代码里的解码方式是上下三角取平均
+        # scores = (scores.transpose(1, 2) + scores)/2
+        scores: np.ndarray = scores.detach().cpu().numpy()
+        length: np.ndarray = indexes.detach().cpu().numpy()
+        length = length.max(-1)
+        scores[:, :, :, self.ext_labels_start_idx:] = 0
+        span_mask = (scores.max(-1) > self.span_threshold)
+        argmax = scores.argmax(-1)
+        indexes = np.where(span_mask)
+        entities = [[] for _ in range(scores.shape[0])]
+        # 同labels一样没有特殊的标签
+        # 将预测实体append到entities中
+        for batch_idx, x, y in zip(*indexes):
+            if x >= 0 and x < length[batch_idx] and y >= 0 and y < length[batch_idx] and x <= y:
+                # (start, end, label_idx, score)
+                entities[batch_idx].append(
+                    (x, y, argmax[batch_idx, x, y] + 1, scores[batch_idx, x, y, argmax[batch_idx, x, y]]))
+        # 对每一个batch, 按label_score的降序排列
+        for batch_idx in range(len(entities)):
+            entities[batch_idx].sort(key=lambda x: x[-1], reverse=True)
+        if remove_clashed:
+            for batch_idx in range(len(entities)):
+                new_entities = []
+                for entity in entities[batch_idx]:
+                    add = True
+                    for pre_entity in new_entities:
+                        if self.is_span_intersect(entity, pre_entity) and (not nested or not self.is_span_nested(entity, pre_entity)):
+                            add = False
+                            break
+                    if add:
+                        new_entities.append(entity)
+                entities[batch_idx] = new_entities
+        # 转换为set
+        for batch_idx in range(len(entities)):
+            entities[batch_idx] = set(
+                map(lambda x: (x[0], x[1], x[2]), entities[batch_idx]))
+        return entities
+    def forward(self, input_ids: torch.Tensor, bpe_len: torch.Tensor, indexes: torch.Tensor, labels: torch.Tensor = None, is_synthetic: torch.Tensor = None, **kwargs):
+        # input_ids 就是常规的input_ids, [batch_size, seq_length, hidden_dim]
+        # bpe_len 是flat tokens和[CLS]和[SEP]的长度, 不包括[PAD] [batch_size]
+        # indexes 是每个字的坐标[0,1,...], [batch_size, seq_length, hidden_dim]
+        # matrix [batch_size, seq_length, seq_length, num_labels] 的0，1矩阵
+        attention_mask = seq_len_to_mask(bpe_len)  # bsz x length x length
+        outputs = self.bert(
+            input_ids, attention_mask=attention_mask, return_dict=True)
+        last_hidden_states = outputs['last_hidden_state']
+        # 这里的效果其实跟W2NER是一样的，就是pieces2word
+        # 所有index为0的标签会被选取包含最大的hidden_dim的token, 放置在第0位, 即[CLS], [SEP]和[PAD]的标签
+        # 所有index相同的标签会被选取包含最大的hidden_dim的token, 放置在第index位
+        # 其余位置补0
+        # WARN: 这里会去除前后两个token，因此labels要提前去除前后两个token
+        state = scatter_max(last_hidden_states, index=indexes, dim=1)[
+            0][:, 1:]  # bsz x word_len x hidden_size
+        # 真实的文本-标签对长度
+        lengths, _ = indexes.max(dim=-1)
+        # 1. state先传进head和tail的MLP压一下维度得到头尾特征
+        head_state = self.head_mlp(state)
+        tail_state = self.tail_mlp(state)
+        # 2. 进单头还是多头
+        if hasattr(self, 'U'):
+            scores1 = torch.einsum(
+                'bxi, oij, byj -> boxy', head_state, self.U, tail_state) # [batch_size, out_dim , word_len, word_len]
+        else:
+            # [batch_size, out_dim, word_len, word_len]
+            scores1 = self.multi_head_biaffine(head_state, tail_state)
+        # 3. head 和 tail 自我扩展成word_len*2后将hidden_state拼接并加入偏置项和相对距离positional embedding.
+        head_state = torch.cat(
+            [head_state, torch.ones_like(head_state[..., :1])], dim=-1)
+        tail_state = torch.cat(
+            [tail_state, torch.ones_like(tail_state[..., :1])], dim=-1)
+        affined_cat = torch.cat([self.dropout(head_state).unsqueeze(2).expand(-1, -1, tail_state.size(1), -1),
+                                 self.dropout(tail_state).unsqueeze(1).expand(-1, head_state.size(1), -1, -1)], dim=-1)
+        if hasattr(self, 'size_embedding'):
+            size_embedded = self.size_embedding(
+                self.span_size_ids[:state.size(1), :state.size(1)])
+            affined_cat = torch.cat([affined_cat,
+                                     self.dropout(size_embedded).unsqueeze(0).expand(state.size(0), -1, -1, -1)], dim=-1)
+        scores2 = torch.einsum('bmnh,kh->bkmn', affined_cat,
+                               self.W)  # bsz x dim x L x L
+        scores = scores2 + scores1   # bsz x dim x L x L
+        if hasattr(self, 'cnn'):
+            mask = seq_len_to_mask(lengths)  # bsz x length x length
+            mask = mask[:, None] * mask.unsqueeze(-1)
+            pad_mask = mask[:, None].eq(0)
+            u_scores = scores.masked_fill(pad_mask, 0)
+            if self.logit_drop != 0:
+                u_scores = F.dropout(
+                    u_scores, p=self.logit_drop, training=self.training)
+            # bsz, num_label, max_len, max_len = u_scores.size()
+            # u_scores = self.cnn(u_scores, pad_mask)
+            u_scores = self.attn(u_scores.permute(0, 2, 3, 1), pad_mask=pad_mask.permute(0, 2, 3, 1))
+            scores = u_scores.permute(0, 3, 1, 2) + scores
+        # 把dim作为尾部对准fc
+        scores = self.down_fc(scores.permute(0, 2, 3, 1))
+        assert scores.size(-1) == labels.size(-1)
+        loss = None
+        if labels is not None:
+            flat_scores = scores.reshape(-1)
+            flat_matrix = labels.reshape(-1)
+            decay_weights = torch.ones(labels.size()).to(flat_matrix.device)
+            decay_weights[:, :, :, self.ext_labels_start_idx:] *= 0.13
+            decayed_weights = decay_weights.reshape(input_ids.size(0), -1)
+            synthetic_mask = torch.ones(labels.size()).to(flat_matrix.device)
+            synthetic_mask[:, is_synthetic] *= 0.15
+            synthetic_weights = synthetic_mask.reshape(input_ids.size(0), -1)
+            mask = flat_matrix.ne(-100).float().view(input_ids.size(0), -1)
+            flat_loss = F.binary_cross_entropy_with_logits(
+                flat_scores, flat_matrix.float(), reduction='none')
+            loss = ((flat_loss.view(input_ids.size(0), -1)*synthetic_weights*decayed_weights*mask).sum(dim=-1)).mean()
+        return loss, scores
+class LocalSpanAttention(nn.Module):
+    def __init__(self, dim):
+        super(LocalSpanAttention, self).__init__()
+        self.attn = nn.MultiheadAttention(embed_dim=dim, num_heads=10)
+    def forward(self, x, span_mask):
+        """
+        :param x: [bsz, len, len, dim] 输入特征矩阵
+        :param span_mask: [bsz, len, len] mask矩阵，用于控制attention的感受野
+        """
+        bsz, length, _, dim = x.shape  # 获取输入的形状
+        # 将输入 reshape 为 [bsz * len, len, dim]，使其适合 MultiheadAttention 操作
+        x = x.view(bsz * length, length, dim)  # 展平前两维，准备进行 attention
+        # 转换为 [len, bsz * len, dim]，用于MultiheadAttention
+        x = x.transpose(0, 1)
+        # 注意力计算时需要传入mask
+        attention_output, _ = self.attn(x, x, x, attn_mask=span_mask)
+        # 恢复为 [bsz * len, len, dim] 的形状
+        attention_output = attention_output.transpose(0, 1).view(bsz, length, length, dim)
+        return attention_output
+class LocalAttentionModel(nn.Module):
+    def __init__(self, dim, window_size=3):
+        super(LocalAttentionModel, self).__init__()
+        self.local_attention = LocalSpanAttention(dim)
+        self.norm = nn.LayerNorm(dim)
+        self.window_size = window_size
+    def generate_local_mask(self, seq_len, window_size):
+        # 构建局部注意力的 mask，只允许相邻的 token 进行交互
+        mask = torch.full((seq_len, seq_len), float('-inf'))  # 初始化为全 -inf
+        for i in range(seq_len):
+            start = max(0, i - window_size)
+            end = min(seq_len, i + window_size + 1)
+            mask[i, start:end] = 0  # 允许局部的 token 进行交互
+        return mask
+    def forward(self, x, pad_mask):
+        """
+        :param x: [bsz, len, len, dim] 输入特征
+        """
+        bsz, length, _, dim = x.shape
+        # 生成局部 mask，控制每个 span 的注意力范围
+        local_mask = self.generate_local_mask(length, self.window_size)
+        local_mask = local_mask.to(x.device)  # 确保 mask 和输入在同一设备上
+        # 对每个样本的局部span进行 attention
+        x = x.masked_fill(pad_mask, 0)
+        attn_output = self.local_attention(x, local_mask)
+        # 使用 LayerNorm 进行正则化
+        output = self.norm(attn_output)
+        return output
+class LayerNorm(nn.Module):
+    def __init__(self, shape=(1, 7, 1, 1), dim_index=1):
+        super(LayerNorm, self).__init__()
+        self.weight = nn.Parameter(torch.ones(shape))
+        self.bias = nn.Parameter(torch.zeros(shape))
+        self.dim_index = dim_index
+        self.eps = 1e-6
+    def forward(self, x):
+        """
+        :param x: bsz x dim x max_len x max_len
+        :param mask: bsz x dim x max_len x max_len, 为1的地方为pad
+        :return:
+        """
+        u = x.mean(dim=self.dim_index, keepdim=True)
+        s = (x - u).pow(2).mean(dim=self.dim_index, keepdim=True)
+        x = (x - u) / torch.sqrt(s + self.eps)
+        x = self.weight * x + self.bias
+        return x
+class MaskConv2d(nn.Module):
+    def __init__(self, in_ch, out_ch, kernel_size=3, padding=1, groups=1):
+        super(MaskConv2d, self).__init__()
+        self.conv2d = nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size, padding=padding,
+                                bias=False, groups=groups)
+    def forward(self, x, mask):
+        """
+        :param x:
+        :param mask:
+        :return:
+        """
+        x = x.masked_fill(mask, 0)
+        _x = self.conv2d(x)
+        return _x
+class MaskCNN(nn.Module):
+    def __init__(self, input_channels, output_channels, kernel_size=3, depth=3):
+        super(MaskCNN, self).__init__()
+        layers = []
+        for i in range(depth):
+            layers.extend([
+                MaskConv2d(input_channels, input_channels,
+                           kernel_size=kernel_size, padding=kernel_size//2),
+                LayerNorm((1, input_channels, 1, 1), dim_index=1),
+                nn.GELU()])
+        layers.append(MaskConv2d(input_channels, output_channels,
+                      kernel_size=3, padding=3//2))
+        self.cnns = nn.ModuleList(layers)
+    def forward(self, x, mask):
+        _x = x  # 用作residual
+        for layer in self.cnns:
+            if isinstance(layer, LayerNorm):
+                x = x + _x
+                x = layer(x)
+                _x = x
+            elif not isinstance(layer, nn.GELU):
+                x = layer(x, mask)
+            else:
+                x = layer(x)
+        return _x
+class MultiHeadBiaffine(nn.Module):
+    def __init__(self, dim, out=None, n_head=4):
+        super(MultiHeadBiaffine, self).__init__()
+        assert dim % n_head == 0
+        in_head_dim = dim//n_head
+        out = dim if out is None else out
+        assert out % n_head == 0
+        out_head_dim = out//n_head
+        self.n_head = n_head
+        self.W = nn.Parameter(nn.init.xavier_normal_(torch.randn(
+            self.n_head, out_head_dim, in_head_dim, in_head_dim)))
+        self.out_dim = out
+    def forward(self, h, v):
+        """
+        :param h: bsz x max_len x dim
+        :param v: bsz x max_len x dim
+        :return: bsz x max_len x max_len x out_dim
+        """
+        bsz, max_len, dim = h.size()
+        h = h.reshape(bsz, max_len, self.n_head, -1)
+        v = v.reshape(bsz, max_len, self.n_head, -1)
+        # b: bsz, l: seq_len, h: head_num, x: in_head_dim, y: In_head_dim, d: out_head_dim, k: out_dim
+        w = torch.einsum('blhx,hdxy,bkhy->bhdlk', h, self.W, v)
+        # [batch_size, out_dim, seq_len, seq_len]
+        w = w.reshape(bsz, self.out_dim, max_len, max_len)
+        return w

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb3fe79261f6683f838a0de670ca8425e9536826423cd6cc91c24851ff4e422c
+size 418892592

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": "/home/lpc/models/chinese-bert-wwm-ext/special_tokens_map.json", "name_or_path": "/home/lpc/models/chinese-bert-wwm-ext/", "tokenizer_class": "BertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff