undertheseanlp
/

tre-1

Token Classification

python-crfsuite

Model card Files Files and versions

tre-1 / README.md

rain1024's picture

Initial model upload: Vietnamese POS Tagger

d734595 unverified 16 days ago

|

history blame contribute delete

2.42 kB

	---
	language:
	- vi
	license: apache-2.0
	tags:
	- pos
	- part-of-speech
	- vietnamese
	- crf
	- nlp
	- token-classification
	datasets:
	- undertheseanlp/UDD-v0.1
	library_name: python-crfsuite
	pipeline_tag: token-classification
	---

	# Vietnamese POS Tagger (TRE-1)

	A Conditional Random Field (CRF) based Part-of-Speech tagger for Vietnamese, trained on the Universal Dependencies Dataset (UDD-v0.1).

	## Model Description

	This model uses CRF (Conditional Random Fields) with handcrafted features inspired by the underthesea NLP library. It achieves high accuracy on Vietnamese POS tagging tasks.

	### Features

	- Architecture: CRF (python-crfsuite)
	- Language: Vietnamese
	- Tagset: Universal POS tags (UPOS)
	- Training Data: undertheseanlp/UDD-v0.1

	### Feature Templates

	The model uses the following feature templates:
	- Current token features: word form, lowercase, prefix/suffix (2-3 chars), character type checks
	- Context features: previous and next 1-2 tokens
	- Bigram features: adjacent token combinations
	- Dictionary features: in-vocabulary checks

	## Usage

	### Using the Inference API

	```python
	import requests

	API_URL = "https://api-inference.huggingface.co/models/undertheseanlp/tre-1"
	headers = {"Authorization": "Bearer YOUR_TOKEN"}

	def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

	output = query({"inputs": "Tôi yêu Việt Nam"})
	print(output)
	# [{"token": "Tôi", "tag": "PRON"}, {"token": "yêu", "tag": "VERB"}, ...]
	```

	### Local Usage

	```python
	import pycrfsuite
	from handler import EndpointHandler

	handler = EndpointHandler(path="./")
	result = handler({"inputs": "Tôi yêu Việt Nam"})
	print(result)
	```

	## Training

	The model was trained using:
	- L1 regularization (c1): 1.0
	- L2 regularization (c2): 1e-3
	- Max iterations: 100

	## Performance

	Evaluated on a held-out test set from UDD-v0.1:
	- Accuracy: ~94%
	- F1 (macro): ~90%
	- F1 (weighted): ~94%

	## Limitations

	- Requires pre-tokenized input (whitespace-separated tokens)
	- Performance may vary on out-of-domain text
	- Does not handle Vietnamese word segmentation

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{tre1-pos-tagger,
	author = {undertheseanlp},
	title = {Vietnamese POS Tagger TRE-1},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/undertheseanlp/tre-1}
	}
	```

	## License

	Apache 2.0