---
language: vi
tags:
  - ner
  - phobert
  - vietnamese
  - document-ai
  - cccd
  - synthetic-data
license: mit
base_model: vinai/phobert-base
---

# VietNerm - Căn cước công dân NER Model

PhoBERT-based Named Entity Recognition model for Vietnamese **Căn cước công dân** documents.

## ⚠️ DISCLAIMER: SYNTHETIC / MOCKUP DATA

> **Model này được train hoàn toàn trên dữ liệu giả lập (synthetic/mockup data), KHÔNG sử dụng dữ liệu cá nhân thật.**

- Tất cả dữ liệu training được **sinh tự động** bằng hệ thống template + generator
- **Không** sử dụng giấy tờ thật, thông tin cá nhân thật, hoặc dữ liệu thu thập từ người dùng
- Số định danh (ID, CCCD...) được sinh ngẫu nhiên, thiết kế để **không trùng** với dữ liệu thật
- Dữ liệu có inject nhiễu OCR (noise) để giả lập điều kiện thực tế
- Mục đích: **nghiên cứu AI, Document AI, OCR/NER pipeline**
- **Không** được sử dụng để giả mạo giấy tờ, tạo giấy tờ giả, lừa đảo hoặc gian lận

## Model Description

This model is fine-tuned from [`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base) for token-level NER on Vietnamese administrative/medical documents. It extracts structured fields from OCR text output.

- **Base model**: vinai/phobert-base
- **Task**: Token Classification (NER)
- **Language**: Vietnamese (vi)
- **Document type**: Căn cước công dân
- **Number of labels**: 13
- **Training data**: Synthetic/Mockup (not real personal data)

## Labels

- `B-date_of_birth`
- `B-date_of_expiry`
- `B-full_name`
- `B-gender`
- `B-id_number`
- `B-nationality`
- `B-place_of_origin`
- `B-place_of_residence`
- `I-full_name`
- `I-nationality`
- `I-place_of_origin`
- `I-place_of_residence`

## Usage

### With VietNerm SDK

```python
from vietnerm import VietNerm

ner = VietNerm(doc_type="cccd", model_path="phatdatpq/phobert-cccd-ner")
result = ner.extract("your document text here")
print(result)
```

### With Transformers

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("phatdatpq/phobert-cccd-ner")
model = AutoModelForTokenClassification.from_pretrained("phatdatpq/phobert-cccd-ner")

text = "your document text here"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
```

## Training

- **Dataset**: Synthetically generated (mockup data) with OCR noise simulation
- **Data source**: Auto-generated from Jinja2 templates + random generators (no real personal data)
- **Framework**: HuggingFace Transformers + Trainer API
- **Optimizer**: AdamW (lr=2e-5)
- **Epochs**: 5-7 (with early stopping)

## Ethical Use

This model is built for **research and development purposes only**:

- ✅ AI/NLP research
- ✅ Document AI development
- ✅ OCR/NER pipeline prototyping
- ✅ Educational purposes
- ❌ Forging documents
- ❌ Creating fake identity papers
- ❌ Fraud or deception

## About VietNerm

VietNerm is a Document AI Factory for Vietnamese documents. It provides a complete pipeline
from template-based synthetic data generation to model training and deployment.

- **Repository**: [Devhub-Solutions/VietNerm](https://github.com/Devhub-Solutions/VietNerm)
- **Training dataset**: [ngocthanhdoan/vietnerm-cccd-dataset](https://huggingface.co/datasets/ngocthanhdoan/vietnerm-cccd-dataset)
- **SDK**: `pip install vietnerm`
- **License**: MIT — Copyright (c) 2026 Devhub Solutions