|
|
--- |
|
|
language: |
|
|
- vi |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- pos |
|
|
- part-of-speech |
|
|
- vietnamese |
|
|
- crf |
|
|
- nlp |
|
|
- token-classification |
|
|
datasets: |
|
|
- undertheseanlp/UDD-v0.1 |
|
|
library_name: python-crfsuite |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# Vietnamese POS Tagger (TRE-1) |
|
|
|
|
|
A Conditional Random Field (CRF) based Part-of-Speech tagger for Vietnamese, trained on the Universal Dependencies Dataset (UDD-v0.1). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model uses CRF (Conditional Random Fields) with handcrafted features inspired by the underthesea NLP library. It achieves high accuracy on Vietnamese POS tagging tasks. |
|
|
|
|
|
### Features |
|
|
|
|
|
- **Architecture**: CRF (python-crfsuite) |
|
|
- **Language**: Vietnamese |
|
|
- **Tagset**: Universal POS tags (UPOS) |
|
|
- **Training Data**: undertheseanlp/UDD-v0.1 |
|
|
|
|
|
### Feature Templates |
|
|
|
|
|
The model uses the following feature templates: |
|
|
- Current token features: word form, lowercase, prefix/suffix (2-3 chars), character type checks |
|
|
- Context features: previous and next 1-2 tokens |
|
|
- Bigram features: adjacent token combinations |
|
|
- Dictionary features: in-vocabulary checks |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using the Inference API |
|
|
|
|
|
```python |
|
|
import requests |
|
|
|
|
|
API_URL = "https://api-inference.huggingface.co/models/undertheseanlp/tre-1" |
|
|
headers = {"Authorization": "Bearer YOUR_TOKEN"} |
|
|
|
|
|
def query(payload): |
|
|
response = requests.post(API_URL, headers=headers, json=payload) |
|
|
return response.json() |
|
|
|
|
|
output = query({"inputs": "Tôi yêu Việt Nam"}) |
|
|
print(output) |
|
|
# [{"token": "Tôi", "tag": "PRON"}, {"token": "yêu", "tag": "VERB"}, ...] |
|
|
``` |
|
|
|
|
|
### Local Usage |
|
|
|
|
|
```python |
|
|
import pycrfsuite |
|
|
from handler import EndpointHandler |
|
|
|
|
|
handler = EndpointHandler(path="./") |
|
|
result = handler({"inputs": "Tôi yêu Việt Nam"}) |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
The model was trained using: |
|
|
- L1 regularization (c1): 1.0 |
|
|
- L2 regularization (c2): 1e-3 |
|
|
- Max iterations: 100 |
|
|
|
|
|
## Performance |
|
|
|
|
|
Evaluated on a held-out test set from UDD-v0.1: |
|
|
- Accuracy: ~94% |
|
|
- F1 (macro): ~90% |
|
|
- F1 (weighted): ~94% |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Requires pre-tokenized input (whitespace-separated tokens) |
|
|
- Performance may vary on out-of-domain text |
|
|
- Does not handle Vietnamese word segmentation |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{tre1-pos-tagger, |
|
|
author = {undertheseanlp}, |
|
|
title = {Vietnamese POS Tagger TRE-1}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/undertheseanlp/tre-1} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|