metadata
title: 'Underthesea: Vietnamese NLP Toolkit'
type: resource
url: https://github.com/undertheseanlp/underthesea
Overview
Underthesea is an open-source Python library providing a suite of tools for Vietnamese natural language processing.
Key Information
| Field | Value |
|---|---|
| Website | https://undertheseanlp.com/ |
| GitHub | https://github.com/undertheseanlp/underthesea |
| Documentation | https://undertheseanlp.github.io/underthesea/ |
| PyPI | https://pypi.org/project/underthesea/ |
| License | GPL-3.0 |
| Language | Python 3.6+ |
Features
Core NLP Tasks
| Task | Description |
|---|---|
| Sentence Segmentation | Split text into sentences |
| Text Normalization | Normalize Vietnamese text |
| Word Tokenization | Segment words (with compound word support) |
| POS Tagging | Part-of-speech tagging |
| Chunking | Phrase grouping |
| NER | Named Entity Recognition |
| Dependency Parsing | Syntactic dependency analysis |
| Text Classification | Document classification |
| Sentiment Analysis | Sentiment detection |
Advanced Features
- Machine Translation (Vietnamese ↔ English)
- Language Detection
- Text-to-Speech
- Conversational AI Agent
Installation
# Basic installation
pip install underthesea
# With deep learning support
pip install "underthesea[deep]"
# With text-to-speech
pip install "underthesea[voice]"
# With AI agent
pip install "underthesea[agent]"
Usage Examples
Word Segmentation
from underthesea import word_tokenize
text = "Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò"
tokens = word_tokenize(text)
# ['Chàng trai', '9X', 'Quảng Trị', 'khởi nghiệp', 'từ', 'nấm sò']
POS Tagging
from underthesea import pos_tag
text = "Tôi yêu Việt Nam"
tagged = pos_tag(text)
# [('Tôi', 'P'), ('yêu', 'V'), ('Việt Nam', 'Np')]
Named Entity Recognition
from underthesea import ner
text = "Bộ Công Thương xóa một tổng cục"
entities = ner(text)
# [('Bộ Công Thương', 'B-ORG'), ...]
Text Classification
from underthesea import classify
text = "HLV đầu tiên ở Premier League bị sa thải"
category = classify(text)
# 'Thể thao'
POS Tag Set
Underthesea uses Vietnamese-specific POS tags:
| Tag | Description |
|---|---|
| N | Noun |
| V | Verb |
| A | Adjective |
| P | Pronoun |
| Np | Proper noun |
| E | Preposition |
| C | Conjunction |
| R | Adverb |
| M | Numeral |
| L | Determiner |
| T | Particle |
| X | Unknown |
Related Projects
- underthesea-core - Core algorithms
- NLP-Vietnamese-progress - Vietnamese NLP benchmarks
Citation
@misc{underthesea,
author = {Underthesea Team},
title = {Underthesea: Vietnamese NLP Toolkit},
year = {2018},
publisher = {GitHub},
url = {https://github.com/undertheseanlp/underthesea}
}