chore: squash history — keep latest version only

56eff19 19 days ago

3.66 kB

	---
	language: vi
	tags:
	- ner
	- phobert
	- vietnamese
	- document-ai
	- cccd
	- synthetic-data
	license: mit
	base_model: vinai/phobert-base
	---

	# VietNerm - Căn cước công dân NER Model

	PhoBERT-based Named Entity Recognition model for Vietnamese Căn cước công dân documents.

	## ⚠️ DISCLAIMER: SYNTHETIC / MOCKUP DATA

	> Model này được train hoàn toàn trên dữ liệu giả lập (synthetic/mockup data), KHÔNG sử dụng dữ liệu cá nhân thật.

	- Tất cả dữ liệu training được sinh tự động bằng hệ thống template + generator
	- Không sử dụng giấy tờ thật, thông tin cá nhân thật, hoặc dữ liệu thu thập từ người dùng
	- Số định danh (ID, CCCD...) được sinh ngẫu nhiên, thiết kế để không trùng với dữ liệu thật
	- Dữ liệu có inject nhiễu OCR (noise) để giả lập điều kiện thực tế
	- Mục đích: nghiên cứu AI, Document AI, OCR/NER pipeline
	- Không được sử dụng để giả mạo giấy tờ, tạo giấy tờ giả, lừa đảo hoặc gian lận

	## Model Description

	This model is fine-tuned from [`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base) for token-level NER on Vietnamese administrative/medical documents. It extracts structured fields from OCR text output.

	- Base model: vinai/phobert-base
	- Task: Token Classification (NER)
	- Language: Vietnamese (vi)
	- Document type: Căn cước công dân
	- Number of labels: 13
	- Training data: Synthetic/Mockup (not real personal data)

	## Labels

	- `B-date_of_birth`
	- `B-date_of_expiry`
	- `B-full_name`
	- `B-gender`
	- `B-id_number`
	- `B-nationality`
	- `B-place_of_origin`
	- `B-place_of_residence`
	- `I-full_name`
	- `I-nationality`
	- `I-place_of_origin`
	- `I-place_of_residence`

	## Usage

	### With VietNerm SDK

	```python
	from vietnerm import VietNerm

	ner = VietNerm(doc_type="cccd", model_path="phatdatpq/phobert-cccd-ner")
	result = ner.extract("your document text here")
	print(result)
	```

	### With Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("phatdatpq/phobert-cccd-ner")
	model = AutoModelForTokenClassification.from_pretrained("phatdatpq/phobert-cccd-ner")

	text = "your document text here"
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.argmax(outputs.logits, dim=-1)
	```

	## Training

	- Dataset: Synthetically generated (mockup data) with OCR noise simulation
	- Data source: Auto-generated from Jinja2 templates + random generators (no real personal data)
	- Framework: HuggingFace Transformers + Trainer API
	- Optimizer: AdamW (lr=2e-5)
	- Epochs: 5-7 (with early stopping)

	## Ethical Use

	This model is built for research and development purposes only:

	- ✅ AI/NLP research
	- ✅ Document AI development
	- ✅ OCR/NER pipeline prototyping
	- ✅ Educational purposes
	- ❌ Forging documents
	- ❌ Creating fake identity papers
	- ❌ Fraud or deception

	## About VietNerm

	VietNerm is a Document AI Factory for Vietnamese documents. It provides a complete pipeline
	from template-based synthetic data generation to model training and deployment.

	- Repository: [Devhub-Solutions/VietNerm](https://github.com/Devhub-Solutions/VietNerm)
	- Training dataset: [ngocthanhdoan/vietnerm-cccd-dataset](https://huggingface.co/datasets/ngocthanhdoan/vietnerm-cccd-dataset)
	- SDK: `pip install vietnerm`
	- License: MIT — Copyright (c) 2026 Devhub Solutions