Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations
๐ Overview
This repository provides the official implementation of our paper:
Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations arXiv:2501.19093
Low-resource sequence labeling often suffers from data sparsity and limited contextual generalization. We propose KnowFREE (Knowledge-Fused Representation Enhancement Framework) โ a framework that integrates external linguistic knowledge and contextual label explanations into the modelโs representation space to enhance low-resource performance.
Key Highlights:
Combining an LLM-based knowledge enhancement workflow with a span-based KnowFREE model to effectively address these challenges.
Pipeline 1: Label Extension Annotation
- Objective: To leverage LLMs to generate extension entity labels, word segmentation tags, and POS tags for the original samples.
- Effect:
- Enhances the model's understanding of fine-grained contextual semantics.
- Improves the ability to distinguish entity boundaries in character-dense languages.
Pipeline 2: Enriched Explanation Synthesis
- Objective: Using LLMs to generate detailed, context-aware explanations for target entities, thereby synthesizing new, high-quality training samples.
- Effect:
- Effectively mitigates semantic distribution bias between synthetic samples and the target domain.
- Significantly expands the number of samples and improves model performance in extremely low-resource settings.
๐ Quick Links
โ ๏ธ Model Checkpoints
Due to the large number of experiments, the architectural differences between the initial and reconstructed models, and the limited practical value of low-resource checkpoints sampled from the full dataset, we only release a few representative checkpoints (e.g., weibo) on Hugging Face for reference, as shown below:
| Model | F1 |
|---|---|
| aleversn/KnowFREE-Weibo-BERT-base (Many shots 1000 with ChatGLM3) | 76.78 |
| aleversn/KnowFREE-Youku-BERT-base (Many shots 1000 with ChatGLM3) | 84.50 |
๐งฉ KnowFREE Framework
Architecture: A Biaffine-based span model that supports nested entity annotation.
Core Innovations:
- Introduces a Local Multi-head Attention Layer to efficiently fuse the multi-type extension label features generated in Pipeline 1.
- No External Knowledge Needed for Inference: The model learns to fuse knowledge during the training, the logits of extension labels will be masked during inference.
โ๏ธ Installation Guide
Core Dependencies
Create an environment and install dependencies:
conda create -n knowfree python=3.8
conda activate knowfree
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers==4.18.0 fastNLP==1.0.1 PrettyTable
pip install torch-scatter==2.0.8 -f https://data.pyg.org/whl/torch-1.8.0+cu111.html
๐ Data Augmentation Workflow
See the detailed data synthesis pipeline in Syn_Pipelines.
In KnowFREE, we employ contextual paraphrasing and label explanation synthesis to augment low-resource datasets. For each entity label, LLMs generate descriptive explanations that are integrated into the learning process to mitigate label semantic sparsity.
๐ฅ Run KnowFREE Models
Training with KnowFREE
Dataset Format
Specify the dataset path using the data_present_path argument (Default: ./datasets/present.json). The file should be a JSON object with the following format:
{
"weibo": {
"train": "./datasets/weibo/train.jsonl",
"dev": "./datasets/weibo/dev.jsonl",
"test": "./datasets/weibo/test.jsonl",
"labels": "./datasets/weibo/labels.txt"
}
}
Train Samples of Different Languages:
- Chinese
{"text": ["็ง", "ๆ", "ๅ
จ", "ๆน", "ไฝ", "่ต", "่ฎฏ", "ๆบ", "่ฝ", "๏ผ", "ๅฟซ", "ๆท", "็", "ๆฑฝ", "่ฝฆ", "็", "ๆดป", "้", "่ฆ", "ๆ", "ไธ", "ๅฑ", "ไธ", "ไบ", "็ฑ", "ไฝ "], "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "entities": []}
{"text": ["ๅฏน", "๏ผ", "่พ", "็ป", "ไธ", "ไธช", "ๅฅณ", "ไบบ", "๏ผ", "็", "ๆ", "็ปฉ", "ใ", "ๅคฑ", "ๆ"], "label": ["O", "O", "O", "O", "O", "O", "B-PER.NOM", "E-PER.NOM", "O", "O", "O", "O", "O", "O", "O"], "entities": [{"start": 6, "entity": "PER.NOM", "end": 8, "text": ["ๅฅณ", "ไบบ"]}]}
{"text": ["ไป", "ๅคฉ", "ไธ", "ๅ", "่ตท", "ๆฅ", "็", "ๅฐ", "ๅค", "้ข", "็", "ๅคช", "้ณ", "ใ", "ใ", "ใ", "ใ", "ๆ", "็ฌฌ", "ไธ", "ๅ", "ๅบ", "็ซ", "็ถ", "ๆฏ", "ๅผบ", "็", "็", "ๆณ", "ๅ", "ๅฎถ", "ๆณช", "ๆณ", "ๆ", "ไปฌ", "ไธ", "่ตท", "ๅจ", "ๅ", "้ฑผ", "ไธช", "ๆถ", "ๅ", "ไบ", "ใ", "ใ", "ใ", "ใ", "ๆ", "ๅฅฝ", "ๅค", "ๅฅฝ", "ๅค", "็", "่ฏ", "ๆณ", "ๅฏน", "ไฝ ", "่ฏด", "ๆ", "ๅทพ", "ๅก", "ๆณ", "่ฆ", "็ฆ", "็ฆ", "็ฆ", "ๆ", "ๆ", "ๅธ", "ๆ", "ๆฏ", "ๆณ", "ๅ", "ๅผ", "ไบ", "ๆต", "็", "ๅฟ"], "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-LOC.NAM", "E-LOC.NAM", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-PER.NAM", "I-PER.NAM", "E-PER.NAM", "O", "O", "O", "O", "O", "O", "B-PER.NAM", "E-PER.NAM", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "entities": [{"start": 38, "entity": "LOC.NAM", "end": 40, "text": ["ๅ", "้ฑผ"]}, {"start": 59, "entity": "PER.NAM", "end": 62, "text": ["ๆ", "ๅทพ", "ๅก"]}, {"start": 68, "entity": "PER.NAM", "end": 70, "text": ["ๆ", "ๅธ"]}]}
- English
{"text": ["im", "thinking", "of", "a", "comedy", "where", "a", "group", "of", "husbands", "receive", "one", "chance", "from", "their", "wives", "to", "engage", "with", "other", "women"], "entities": [{"start": 4, "end": 5, "entity": "GENRE", "text": ["comedy"]}, {"start": 6, "end": 21, "entity": "PLOT", "text": ["a", "group", "of", "husbands", "receive", "one", "chance", "from", "their", "wives", "to", "engage", "with", "other", "women"]}]}
{"text": ["another", "sequel", "of", "an", "action", "movie", "about", "drag", "street", "car", "races", "alcohol", "and", "gun", "violence"], "entities": [{"start": 1, "end": 2, "entity": "RELATIONSHIP", "text": ["sequel"]}, {"start": 4, "end": 5, "entity": "GENRE", "text": ["action"]}, {"start": 7, "end": 15, "entity": "PLOT", "text": ["drag", "street", "car", "races", "alcohol", "and", "gun", "violence"]}]}
{"text": ["what", "is", "the", "name", "of", "the", "movie", "in", "which", "a", "group", "of", "criminals", "begin", "to", "suspect", "that", "one", "of", "them", "is", "a", "police", "informant", "after", "a", "simple", "jewelery", "heist", "goes", "terribly", "wrong"], "entities": [{"start": 9, "end": 32, "entity": "PLOT", "text": ["a", "group", "of", "criminals", "begin", "to", "suspect", "that", "one", "of", "them", "is", "a", "police", "informant", "after", "a", "simple", "jewelery", "heist", "goes", "terribly", "wrong"]}]}
{"text": ["a", "movie", "with", "vin", "diesel", "in", "world", "war", "2", "in", "a", "foreign", "country", "shooting", "people"], "entities": [{"start": 3, "end": 5, "entity": "ACTOR", "text": ["vin", "diesel"]}, {"start": 6, "end": 9, "entity": "GENRE", "text": ["world", "war", "2"]}, {"start": 11, "end": 15, "entity": "PLOT", "text": ["foreign", "country", "shooting", "people"]}]}
{"text": ["what", "is", "the", "1991", "disney", "animated", "movie", "that", "featured", "angela", "lansbury", "as", "the", "voice", "of", "a", "teapot"], "entities": [{"start": 3, "end": 4, "entity": "YEAR", "text": ["1991"]}, {"start": 5, "end": 6, "entity": "GENRE", "text": ["animated"]}, {"start": 9, "end": 11, "entity": "ACTOR", "text": ["angela", "lansbury"]}, {"start": 16, "end": 17, "entity": "CHARACTER_NAME", "text": ["teapot"]}]}
- Japanese
{"text": ["I", "n", "f", "o", "r", "m", "i", "x", "ใฎ", "ๅ", "ใ", "ใ", "ใฟ", "ใฆ", "ใ", "ใช", "ใฉ", "ใฏ", "ใซ", "ใจ", "I", "B", "M", "ใ", "่ฟฝ", "้", "ใ", "ใ", "ใ"], "entities": [{"start": 0, "end": 8, "entity": "ๆณไบบๅ", "text": ["I", "n", "f", "o", "r", "m", "i", "x"]}, {"start": 15, "end": 19, "entity": "ๆณไบบๅ", "text": ["ใช", "ใฉ", "ใฏ", "ใซ"]}, {"start": 20, "end": 23, "entity": "ๆณไบบๅ", "text": ["I", "B", "M"]}]}
{"text": ["็พ", "ๅจ", "ใฏ", "ใข", "ใ", "ใก", "ใผ", "ใท", "ใง", "ใณ", "ๆฅญ", "็", "ใ", "ใ", "้", "ใ", "ใฆ", "ใ", "ใ", "ใ", "ๆฐด", "ๅฝฉ", "็ป", "ๅฎถ", "ใจ", "ใ", "ใฆ", "ใ", "ๆดป", "ๅ", "ใ", "ใฆ", "ใ", "ใ", "ใ"], "entities": []}
{"text": ["ๅคง", "้", "ๆฑ", "ใค", "ใณ", "ใฟ", "ใผ", "ใ", "ใง", "ใณ", "ใธ", "ใฏ", "ใ", "ๅคง", "ๅ", "็", "่ฑ", "ๅพ", "ๅคง", "้", "ๅธ", "ๅคง", "้", "็บ", "ๅพ", "็ฐ", "ใซ", "ใ", "ใ", "ไธญ", "ไน", "ๅท", "ๆจช", "ๆญ", "้", "่ทฏ", "ใฎ", "ใค", "ใณ", "ใฟ", "ใผ", "ใ", "ใง", "ใณ", "ใธ", "ใง", "ใ", "ใ", "ใ"], "entities": [{"start": 0, "end": 11, "entity": "ๆฝ่จญๅ", "text": ["ๅคง", "้", "ๆฑ", "ใค", "ใณ", "ใฟ", "ใผ", "ใ", "ใง", "ใณ", "ใธ"]}, {"start": 13, "end": 26, "entity": "ๅฐๅ", "text": ["ๅคง", "ๅ", "็", "่ฑ", "ๅพ", "ๅคง", "้", "ๅธ", "ๅคง", "้", "็บ", "ๅพ", "็ฐ"]}, {"start": 29, "end": 36, "entity": "ๆฝ่จญๅ", "text": ["ไธญ", "ไน", "ๅท", "ๆจช", "ๆญ", "้", "่ทฏ"]}]}
{"text": ["2", "0", "1", "4", "ๅนด", "1", "ๆ", "1", "5", "ๆฅ", "ใ", "ใ", "ใ", "ใฟ", "ใฏ", "ใ", "ใฃ", "ใณ", "ใ", "ใผ", "ใฎ", "ไธ", "ๅบง", "้จ", "ไป", "ๆ", "ใ", "ๆ", "่ญท", "ใ", "ใ", "ไฝฟ", "ๅฝ", "ใ", "ๆ", "ใฃ", "ใฆ", "ใ", "ใ", "ใณ", "ใ", "ใฌ", "ใผ", "ใฎ", "ไป", "ๆ", "ๅง", "ใฎ", "ๅคง", "่ฆ", "ๆจก", "ใช", "ไผ", "่ญฐ", "ใง", "ๆญฃ", "ๅผ", "ใซ", "่จญ", "็ซ", "ใ", "ใ", "ใ", "ใ"], "entities": [{"start": 11, "end": 14, "entity": "ๆณไบบๅ", "text": ["ใ", "ใ", "ใฟ"]}, {"start": 15, "end": 20, "entity": "ๅฐๅ", "text": ["ใ", "ใฃ", "ใณ", "ใ", "ใผ"]}, {"start": 38, "end": 43, "entity": "ๅฐๅ", "text": ["ใ", "ใณ", "ใ", "ใฌ", "ใผ"]}]}
{"text": ["ๆฐธ", "ๆณฐ", "่", "้ง
", "ใฏ", "ใ", "ไธญ", "่ฏ", "ไบบ", "ๆฐ", "ๅ
ฑ", "ๅ", "ๅฝ", "ๅ", "ไบฌ", "ๅธ", "ๆตท", "ๆท", "ๅบ", "ใซ", "ไฝ", "็ฝฎ", "ใ", "ใ", "ๅ", "ไบฌ", "ๅฐ", "ไธ", "้", "8", "ๅท", "็ท", "ใฎ", "้ง
", "ใง", "ใ", "ใ", "ใ"], "entities": [{"start": 0, "end": 4, "entity": "ๆฝ่จญๅ", "text": ["ๆฐธ", "ๆณฐ", "่", "้ง
"]}, {"start": 6, "end": 19, "entity": "ๅฐๅ", "text": ["ไธญ", "่ฏ", "ไบบ", "ๆฐ", "ๅ
ฑ", "ๅ", "ๅฝ", "ๅ", "ไบฌ", "ๅธ", "ๆตท", "ๆท", "ๅบ"]}]}
- Korean
{"text": ["๊ทธ", "๋ชจ์ต", "์", "๋ณด", "ใด", "๋ฏผ์ด", "๋", "ํ ์๋ฒ์ง", "๊ฐ", "๋ง์น", "์ ์ํฐ", "์์", "์ด๊ธฐ", "๊ณ ", "๋์์ค", "ใด", "์ฅ๊ตฐ", "์ฒ๋ผ", "์์ ", "ํ", "์", "๋ณด์ด", "ใด๋ค๊ณ ", "์๊ฐ", "ํ", "์", "์ต๋๋ค", "."], "entities": [{"start": 5, "end": 6, "entity": "PS", "text": ["๋ฏผ์ด"]}]}
{"text": ["๋ด๋ฌ", "18", "์ผ", "๋ถํฐ", "๋ด๋
", "2", "์", "20", "์ผ", "๊น์ง", "๋", "์์ธ์ญ", "์์", "๋ฌด์ฃผ๋ฆฌ์กฐํธ", "๋ถ๊ทผ", "๊น์ง", "์คํค๊ด๊ด", "์ด์ฐจ", "๋ฅผ", "์ดํ", "ํ", "ใด๋ค", "."], "entities": [{"start": 0, "end": 10, "entity": "DT", "text": ["๋ด๋ฌ", "18", "์ผ", "๋ถํฐ", "๋ด๋
", "2", "์", "20", "์ผ", "๊น์ง"]}, {"start": 11, "end": 12, "entity": "LC", "text": ["์์ธ์ญ"]}, {"start": 13, "end": 14, "entity": "OG", "text": ["๋ฌด์ฃผ๋ฆฌ์กฐํธ"]}]}
{"text": ["ํธ์๋ ฅ", "์", "๊ณ ", "์ ๋", "์ ", "์ด", "ใด", "์ฃผ์ ", "๋ฅผ", "์ก์๋ด", "๋", "๋ฐ", "๋ฅํ", "ใด", "์ฆ์
", "์ด", "์ง๋ง", "์ด", "์ํ", "์์", "๋", "๋ฌด์", "์ด", "ํธ์๋ ฅ", "์ด", "์", "์์ง", "๊ฒฐ์ ", "ํ", "์ง", "๋ชปํ", "๊ณ ", "๋ง์ค์ด", "ใด๋ค", "."], "entities": [{"start": 14, "end": 15, "entity": "PS", "text": ["์ฆ์
"]}]}
{"text": ["๊ทธ๋์", "์ธํธ", "๋", "๋ฐค", "์ด", "๋ฉด", "์น๊ตฌ", "๋ค", "์ง", "์", "๋์๋ค๋", "๋ฉฐ", "์๋ฒ์ง", "๋ชฐ๋", "์ฐ์ต", "์", "ํ", "์", "์ต๋๋ค", "."], "entities": [{"start": 1, "end": 2, "entity": "PS", "text": ["์ธํธ"]}, {"start": 3, "end": 4, "entity": "TI", "text": ["๋ฐค"]}]}
{"text": ["ํฉ์จ", "๋", "์์ ", "์ด", "์ด๋ฆฌ", "์ด์", "๋ฃ", "์", "์ด", "์ด์ผ๊ธฐ", "๊ฐ", "์ด๋ฆฐ์ด", "๋ค", "์๊ฒ", "์๋ฐ", "ํ", "ใด", "ํจ์", "์", "๋ง์", "์", "์ ํ", "์", "์ฃผ", "ใน", "์", "์", "์", "๊ฒ", "๊ฐ", "์", "5", "๋ถ", "์ง๋ฆฌ", "๊ตฌ์ฐ๋ํ", "๋ก", "๊ฐ์", "ํ", "์", "๋ค๊ณ ", "๋ง", "ํ", "ใด๋ค", "."], "entities": [{"start": 0, "end": 1, "entity": "PS", "text": ["ํฉ์จ"]}, {"start": 31, "end": 33, "entity": "TI", "text": ["5", "๋ถ"]}]}
{"text": ["์๋ฒ์ง", "๊ฐ", "๋์๊ฐ", "์", "ใด", "๋ค", "์ด๋จธ๋", "์", "ํธ์ ", "๋ฅผ", "๋ฐฐ๊ฒฝ", "์ผ๋ก", "์น์ฃผ", "๋", "์ง์", "์์", "๋ง", "์", "๋๋จ", "ํ", "ใด", "๊ถ์ธ", "๋ฅผ", "๋๋ฆฌ", "์", "๋ค", "."], "entities": [{"start": 12, "end": 13, "entity": "PS", "text": ["์น์ฃผ"]}]}
Labels
.txt
O
GPE.NAM
GPE.NOM
LOC.NAM
LOC.NOM
ORG.NAM
ORG.NOM
PER.NAM
PER.NOM
.json/.jsonl
{
"O": {
"idx": 0,
"count": -1,
"is_target": true
},
"GPE.NAM": {
"idx": 1,
"count": -1,
"is_target": true
},
"GPE.NOM": {
"idx": 2,
"count": -1,
"is_target": true
},
"LOC.NAM": {
"idx": 3,
"count": -1,
"is_target": true
},
"LOC.NOM": {
"idx": 4,
"count": -1,
"is_target": true
},
"ORG.NAM": {
"idx": 5,
"count": -1,
"is_target": true
},
"ORG.NOM": {
"idx": 6,
"count": -1,
"is_target": true
},
"PER.NAM": {
"idx": 7,
"count": -1,
"is_target": true
},
"PER.NOM": {
"idx": 8,
"count": -1,
"is_target": true
},
"ADJECTIVE": {
"idx": 9,
"count": 1008,
"is_target": false
},
"ADPOSITION": {
"idx": 10,
"count": 41,
"is_target": false
},
"ADVERB": {
"idx": 11,
"count": 1147,
"is_target": false
},
"APP": {
"idx": 12,
"count": 3,
"is_target": false
},
"AUXILIARY": {
"idx": 13,
"count": 4,
"is_target": false
},...
}
- Model: BERT / RoBERTa
from main.trainers.knowfree_trainer import Trainer
from transformers import BertTokenizer, BertConfig
MODEL_PATH = "<MODEL_PATH>"
tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
config = BertConfig.from_pretrained(MODEL_PATH)
trainer = Trainer(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH,
data_name='<DATASET_NAME>',
batch_size=4,
batch_size_eval=8,
task_name='<TASK_NAME>')
for i in trainer(num_epochs=120, other_lr=1e-3, weight_decay=0.01, remove_clashed=True, nested=False, eval_call_step=lambda x: x % 125 == 0):
a = i
Key Params
other_lr: the learning rate of the non-PLM part.remove_clashed: remove the label that exists overlap (only choose the label with min start position)nested: whether support nested entities, when do sequence labeling likeCMeEE, you should set it as true and disabledremove_clashed.eval_call_step: determine evaluation withxsteps, defined with a function call.
Evaluation Only
Comment out the training loop to evaluate directly:
trainer.eval(0, is_eval=True)
Train with CNN Nested NER
from main.trainers.cnnner_trainer import Trainer
from transformers import BertTokenizer, BertConfig
MODEL_PATH = "<MODEL_PATH>"
tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
config = BertConfig.from_pretrained(MODEL_PATH)
trainer = Trainer(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH,
data_name='<DATASET_NAME>',
batch_size=4,
batch_size_eval=8,
task_name='<TASK_NAME>')
for i in trainer(num_epochs=120, other_lr=1e-3, weight_decay=0.01, remove_clashed=True, nested=False, eval_call_step=lambda x: x % 125 == 0):
a = i
Prediction
from main.predictor.knowfree_predictor import KnowFREEPredictor
from transformers import BertTokenizer, BertConfig
MODEL_PATH = "<MODEL_PATH>"
LABEL_FILE = '<LABEL_PATH>'
tokenizer = BertTokenizer.from_pretrained(MODEL_PATH)
config = BertConfig.from_pretrained(MODEL_PATH)
pred = KnowFREEPredictor(tokenizer=tokenizer, config=config, from_pretrained=MODEL_PATH, label_file=LABEL_FILE, batch_size=4)
for entities in pred(['ๅถ่ต่๏ผๅ
จ็ๆถๅฐ่ดข่ฟๆปๆป่ๆฅ้ฑ', 'ๆ่ฆๅปๆ่ฆๅป่ฑๅฟ่ฑๅฟ่ฑๅฟ่ถๅๆๅคงๅธ่ดตไป้่ถ
ๅๅคงๅๆๅด่ง่ฏ็ญ่ฝฌๅ้่ถ
่ดดๅงๅพฎๅๅทๅค่ฏ็ญๆๅจ็ฅใ้่ถ
ๅๅคงๅๆ']):
print(entities)
Result
[
[
{'start': 0, 'end': 3, 'entity': 'PER.NAM', 'text': ['ๅถ', '่ต', '่'
]
}
],
[
{'start': 45, 'end': 47, 'entity': 'PER.NAM', 'text': ['้', '่ถ
'
]
},
{'start': 19, 'end': 21, 'entity': 'PER.NAM', 'text': ['้', '่ถ
'
]
},
{'start': 31, 'end': 33, 'entity': 'PER.NAM', 'text': ['้', '่ถ
'
]
}
]
]
๐ Citation
@misc{lai2025improvinglowresourcesequencelabeling,
title={Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations},
author={Peichao Lai and Jiaxin Gan and Feiyang Ye and Yilei Wang and Bin Cui},
year={2025},
eprint={2501.19093},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.19093},
}
- Downloads last month
- 3
Model tree for aleversn/KnowFREE-Weibo-BERT-base
Base model
hfl/chinese-bert-wwm
