File size: 2,416 Bytes
d734595
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
language:
- vi
license: apache-2.0
tags:
- pos
- part-of-speech
- vietnamese
- crf
- nlp
- token-classification
datasets:
- undertheseanlp/UDD-v0.1
library_name: python-crfsuite
pipeline_tag: token-classification
---

# Vietnamese POS Tagger (TRE-1)

A Conditional Random Field (CRF) based Part-of-Speech tagger for Vietnamese, trained on the Universal Dependencies Dataset (UDD-v0.1).

## Model Description

This model uses CRF (Conditional Random Fields) with handcrafted features inspired by the underthesea NLP library. It achieves high accuracy on Vietnamese POS tagging tasks.

### Features

- **Architecture**: CRF (python-crfsuite)
- **Language**: Vietnamese
- **Tagset**: Universal POS tags (UPOS)
- **Training Data**: undertheseanlp/UDD-v0.1

### Feature Templates

The model uses the following feature templates:
- Current token features: word form, lowercase, prefix/suffix (2-3 chars), character type checks
- Context features: previous and next 1-2 tokens
- Bigram features: adjacent token combinations
- Dictionary features: in-vocabulary checks

## Usage

### Using the Inference API

```python
import requests

API_URL = "https://api-inference.huggingface.co/models/undertheseanlp/tre-1"
headers = {"Authorization": "Bearer YOUR_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({"inputs": "T么i y锚u Vi峄噒 Nam"})
print(output)
# [{"token": "T么i", "tag": "PRON"}, {"token": "y锚u", "tag": "VERB"}, ...]
```

### Local Usage

```python
import pycrfsuite
from handler import EndpointHandler

handler = EndpointHandler(path="./")
result = handler({"inputs": "T么i y锚u Vi峄噒 Nam"})
print(result)
```

## Training

The model was trained using:
- L1 regularization (c1): 1.0
- L2 regularization (c2): 1e-3
- Max iterations: 100

## Performance

Evaluated on a held-out test set from UDD-v0.1:
- Accuracy: ~94%
- F1 (macro): ~90%
- F1 (weighted): ~94%

## Limitations

- Requires pre-tokenized input (whitespace-separated tokens)
- Performance may vary on out-of-domain text
- Does not handle Vietnamese word segmentation

## Citation

If you use this model, please cite:

```bibtex
@misc{tre1-pos-tagger,
  author = {undertheseanlp},
  title = {Vietnamese POS Tagger TRE-1},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/undertheseanlp/tre-1}
}
```

## License

Apache 2.0