Shorten entity types section

acd5187 verified 25 days ago

5.74 kB

	---
	language:
	- vi
	license: mit
	tags:
	- token-classification
	- ner
	- vietnamese
	- legal
	- electra
	- transformers
	base_model: NlpHUST/ner-vietnamese-electra-base
	datasets:
	- NTA1802/NER-Completing-With-Legal-Dataset
	metrics:
	- seqeval
	pipeline_tag: token-classification
	---

	# NER — Vietnamese Legal Domain (Electra-base)

	Mô hình Named Entity Recognition (NER) cho văn bản pháp lý tiếng Việt, fine-tune từ [`NlpHUST/ner-vietnamese-electra-base`](https://huggingface.co/NlpHUST/ner-vietnamese-electra-base) trên bộ dữ liệu pháp lý tự xây dựng với BIO tagging scheme, 63 loại thực thể và 125 nhãn (1 `O` + các tag B-/I-; riêng `ARTICLE` và `LOCATION` chỉ xuất hiện dạng `B-`).

	Phạm vi: 3 luật — Luật Viễn thông, Luật An ninh mạng, Luật Công nghệ thông tin (cùng các văn bản hướng dẫn liên quan).

	## Các loại thực thể nhận dạng

	125 nhãn BIO (1 `O` + 62 loại × 2 prefix B/I) thuộc 7 nhóm: văn bản pháp lý, chủ thể, khái niệm pháp lý, an ninh mạng, hạ tầng & kỹ thuật, tiêu chuẩn & quy định, và các thực thể khác (địa điểm, thời gian, mức độ...).

	## Kỹ thuật fine-tune

	- Subword label alignment — chỉ gán nhãn cho first subtoken, các subtoken còn lại dùng `-100`
	- Mixed precision (bf16) + TF32 — tối ưu tốc độ trên GPU Ampere/Hopper
	- Layer-wise Learning Rate Decay (LLRD) với decay factor `0.9` — backbone LR `3e-5`, classifier head LR `1e-4`
	- Linear warmup + decay (`warmup_ratio=0.1`), weight decay `0.01`, gradient clipping `1.0`
	- Label smoothing `0.1`
	- Early stopping theo `eval_f1` (seqeval entity-level), patience `5`
	- Dynamic padding với `DataCollatorForTokenClassification`
	- Reproducible seed `42`

	## Dữ liệu huấn luyện

	\| Split \| Số câu \|
	\|---\|---\|
	\| Train \| 6 403 \|
	\| Validation \| 1 279 \|
	\| Test \| 1 288 \|

	Dataset: JSONL format, mỗi dòng gồm `tokens` (list) và `labels` (list BIO labels) — ngoài ra có thêm `id` và `sentence`. Lưu ý `tokens` đã được word-tokenize sẵn (token dạng `Bộ_Công_an`, `an_ninh_mạng`).

	## Kết quả đánh giá

	Train 30 epoch (early stopping theo `eval_f1`), GPU NVIDIA RTX PRO 6000 Blackwell, ~262 s — `train_loss ≈ 1.09` (đã gồm label smoothing 0.1). Metric seqeval entity-level:

	\| Split \| Precision \| Recall \| F1 \|
	\|---\|---\|---\|---\|
	\| Validation \| 0.9965 \| 0.9970 \| 0.9968 \|
	\| Test \| 0.9960 \| 0.9976 \| 0.9968 \|

	Trên tập test (7 804 thực thể), theo `classification_report`:

	\| \| Precision \| Recall \| F1 \|
	\|---\|---\|---\|---\|
	\| micro avg \| 0.9960 \| 0.9976 \| 0.9968 \|
	\| macro avg \| 0.9896 \| 0.9963 \| 0.9926 \|
	\| weighted avg \| 0.9961 \| 0.9976 \| 0.9968 \|

	Phần lớn entity type đạt F1 = 1.0. Các type có support lớn: `OBLIGATION` (1079, F1 1.00), `LEGAL_ACTOR` (1025, F1 0.996), `LEGAL_ACTION` (940, F1 0.997), `LAW` (690, F1 1.00), `TIME` (556, F1 0.998), `SANCTION` (485, F1 0.998), `SYSTEM` (327, F1 0.989). Các type F1 thấp nhất đều thuộc nhóm support rất nhỏ: `AUTHORIZATION` (0.857, support 3), `AUDIT` (0.933, 7), `LICENSE` (0.947, 19), `LEGAL_CONCEPT` (0.952, 31), `PERMISSION` (0.964, 28).

	> ⚠️ F1 ≈ 0.997 phản ánh dữ liệu được sinh khá theo khuôn mẫu (template) — số liệu trên domain văn bản pháp lý thực tế, đa dạng hơn, sẽ thấp hơn đáng kể.

	## Cách sử dụng

	> ⚠️ Quan trọng: model được train trên dữ liệu đã word-tokenize (token dạng
	> `Bộ_Công_an`, `an_ninh_mạng`, ...). Khi inference bắt buộc phải word-tokenize trước
	> rồi truyền với `is_split_into_words=True` — nếu truyền raw text thẳng vào tokenizer/`pipeline`,
	> entity sẽ lệch hoàn toàn.

	```python
	import torch
	from pyvi import ViTokenizer
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	MODEL_DIR = "NTA1802/NER-Completing-With-Legal-Dataset"
	tok = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=True)
	mdl = AutoModelForTokenClassification.from_pretrained(MODEL_DIR).eval()
	id2label = mdl.config.id2label

	@torch.no_grad()
	def extract_entities(text: str):
	words = ViTokenizer.tokenize(text).split()
	enc = tok(words, is_split_into_words=True, return_tensors="pt", truncation=True)
	pred_ids = mdl(**enc).logits[0].argmax(-1).tolist()
	word_ids = enc.word_ids(0)

	word_label = [None] * len(words)
	seen = set()
	for j, wid in enumerate(word_ids):
	if wid is None or wid in seen:
	continue
	seen.add(wid)
	word_label[wid] = id2label[pred_ids[j]]

	entities, cur = [], None
	for w, lab in zip(words, word_label):
	if lab in (None, "O"):
	if cur: entities.append(cur); cur = None
	continue
	prefix, _, etype = lab.partition("-")
	if prefix == "B" or cur is None or cur["type"] != etype:
	if cur: entities.append(cur)
	cur = {"type": etype, "text": w.replace("_", " ")}
	else:
	cur["text"] += " " + w.replace("_", " ")
	if cur: entities.append(cur)
	return entities

	print(extract_entities("Khoản 1 Điều 46 Luật An ninh mạng quy định về bảo vệ dữ liệu cá nhân."))
	```

	## Thông tin mô hình

	- Base model: `NlpHUST/ner-vietnamese-electra-base` (ELECTRA-base, 12 layers, hidden 768)
	- Số nhãn: 125 (BIO scheme — 63 entity types)
	- Max sequence length: 512
	- Framework: PyTorch + HuggingFace Transformers ≥ 4.41
	- Ngôn ngữ: Tiếng Việt (văn bản pháp lý — viễn thông, an ninh mạng, công nghệ thông tin)