--- language: - vi license: apache-2.0 tags: - pos - part-of-speech - vietnamese - crf - nlp - token-classification datasets: - undertheseanlp/UDD-v0.1 library_name: python-crfsuite pipeline_tag: token-classification --- # Vietnamese POS Tagger (TRE-1) A Conditional Random Field (CRF) based Part-of-Speech tagger for Vietnamese, trained on the Universal Dependencies Dataset (UDD-v0.1). ## Model Description This model uses CRF (Conditional Random Fields) with handcrafted features inspired by the underthesea NLP library. It achieves high accuracy on Vietnamese POS tagging tasks. ### Features - **Architecture**: CRF (python-crfsuite) - **Language**: Vietnamese - **Tagset**: Universal POS tags (UPOS) - **Training Data**: undertheseanlp/UDD-v0.1 ### Feature Templates The model uses the following feature templates: - Current token features: word form, lowercase, prefix/suffix (2-3 chars), character type checks - Context features: previous and next 1-2 tokens - Bigram features: adjacent token combinations - Dictionary features: in-vocabulary checks ## Usage ### Using the Inference API ```python import requests API_URL = "https://api-inference.huggingface.co/models/undertheseanlp/tre-1" headers = {"Authorization": "Bearer YOUR_TOKEN"} def query(payload): response = requests.post(API_URL, headers=headers, json=payload) return response.json() output = query({"inputs": "Tôi yêu Việt Nam"}) print(output) # [{"token": "Tôi", "tag": "PRON"}, {"token": "yêu", "tag": "VERB"}, ...] ``` ### Local Usage ```python import pycrfsuite from handler import EndpointHandler handler = EndpointHandler(path="./") result = handler({"inputs": "Tôi yêu Việt Nam"}) print(result) ``` ## Training The model was trained using: - L1 regularization (c1): 1.0 - L2 regularization (c2): 1e-3 - Max iterations: 100 ## Performance Evaluated on a held-out test set from UDD-v0.1: - Accuracy: ~94% - F1 (macro): ~90% - F1 (weighted): ~94% ## Limitations - Requires pre-tokenized input (whitespace-separated tokens) - Performance may vary on out-of-domain text - Does not handle Vietnamese word segmentation ## Citation If you use this model, please cite: ```bibtex @misc{tre1-pos-tagger, author = {undertheseanlp}, title = {Vietnamese POS Tagger TRE-1}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/undertheseanlp/tre-1} } ``` ## License Apache 2.0