tre-1 / references /underthesea.md
rain1024's picture
Add references folder with research papers (markdown, tex, source files)
36a70ab
metadata
title: 'Underthesea: Vietnamese NLP Toolkit'
type: resource
url: https://github.com/undertheseanlp/underthesea

Overview

Underthesea is an open-source Python library providing a suite of tools for Vietnamese natural language processing.

Key Information

Features

Core NLP Tasks

Task Description
Sentence Segmentation Split text into sentences
Text Normalization Normalize Vietnamese text
Word Tokenization Segment words (with compound word support)
POS Tagging Part-of-speech tagging
Chunking Phrase grouping
NER Named Entity Recognition
Dependency Parsing Syntactic dependency analysis
Text Classification Document classification
Sentiment Analysis Sentiment detection

Advanced Features

  • Machine Translation (Vietnamese ↔ English)
  • Language Detection
  • Text-to-Speech
  • Conversational AI Agent

Installation

# Basic installation
pip install underthesea

# With deep learning support
pip install "underthesea[deep]"

# With text-to-speech
pip install "underthesea[voice]"

# With AI agent
pip install "underthesea[agent]"

Usage Examples

Word Segmentation

from underthesea import word_tokenize

text = "Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò"
tokens = word_tokenize(text)
# ['Chàng trai', '9X', 'Quảng Trị', 'khởi nghiệp', 'từ', 'nấm sò']

POS Tagging

from underthesea import pos_tag

text = "Tôi yêu Việt Nam"
tagged = pos_tag(text)
# [('Tôi', 'P'), ('yêu', 'V'), ('Việt Nam', 'Np')]

Named Entity Recognition

from underthesea import ner

text = "Bộ Công Thương xóa một tổng cục"
entities = ner(text)
# [('Bộ Công Thương', 'B-ORG'), ...]

Text Classification

from underthesea import classify

text = "HLV đầu tiên ở Premier League bị sa thải"
category = classify(text)
# 'Thể thao'

POS Tag Set

Underthesea uses Vietnamese-specific POS tags:

Tag Description
N Noun
V Verb
A Adjective
P Pronoun
Np Proper noun
E Preposition
C Conjunction
R Adverb
M Numeral
L Determiner
T Particle
X Unknown

Related Projects

Citation

@misc{underthesea,
  author = {Underthesea Team},
  title = {Underthesea: Vietnamese NLP Toolkit},
  year = {2018},
  publisher = {GitHub},
  url = {https://github.com/undertheseanlp/underthesea}
}