Update README.md
Browse files
README.md
CHANGED
|
@@ -7,54 +7,52 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
# 📦
|
| 11 |
|
| 12 |
-
**
|
| 13 |
-
Các tài nguyên (datasets, models) được lưu trữ và quản lý trực tiếp trên [Hugging Face Hub](https://huggingface.co/visolex).
|
| 14 |
|
| 15 |
---
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
* **Tone normalization**: chuẩn hóa dấu thanh tiếng Việt.
|
| 23 |
-
* **Basic preprocessing**: loại bỏ khoảng trắng thừa, ký tự đặc biệt, định dạng câu.
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
* **Split emoji text**: tách emoji ra khỏi câu.
|
| 29 |
-
* **Remove emojis**: loại bỏ toàn bộ emoji.
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
-
* `load_dataset()` — Tải dataset từ Hugging Face.
|
| 35 |
-
* `get_dataset_info()` — Xem thông tin chi tiết dataset.
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
-
* **HateSpeechDetection** — Phát hiện hate speech.
|
| 41 |
-
* **EmotionRecognition** — Nhận diện cảm xúc.
|
| 42 |
-
* **AspectSentimentAnalysis** — Phân tích sentiment theo từng khía cạnh.
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
*
|
| 47 |
-
*
|
| 48 |
-
|
| 49 |
-
### 6. ✏ **Lexical Normalization** — Chuẩn hóa văn bản mạng xã hội
|
| 50 |
-
|
| 51 |
-
* `detect_nsw()` — Phát hiện từ phi chuẩn (non-standard words).
|
| 52 |
-
* `normalize_sentence()` — Chuẩn hóa câu chứa từ phi chuẩn.
|
| 53 |
|
| 54 |
---
|
| 55 |
|
| 56 |
-
## 📥
|
|
|
|
|
|
|
| 57 |
|
| 58 |
```bash
|
| 59 |
-
pip install
|
| 60 |
```
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# 📦 ViSoNorm Toolkit — Vietnamese Text Normalization & Processing
|
| 11 |
|
| 12 |
+
**ViSoNorm** is a specialized toolkit for **Vietnamese text normalization and processing**, optimized for **NLP** environments and easily installable via **PyPI**. Resources (datasets, models) are stored and managed directly on **Hugging Face Hub** and **GitHub Releases**.
|
|
|
|
| 13 |
|
| 14 |
---
|
| 15 |
+
## 🚀 Key Features
|
| 16 |
|
| 17 |
+
### 1. 🔧 **BasicNormalizer** — Basic Text Normalization
|
| 18 |
|
| 19 |
+
* **Case folding**: convert entire text to lowercase/uppercase/capitalize.
|
| 20 |
+
* **Tone normalization**: normalize Vietnamese tone marks.
|
| 21 |
+
* **Basic preprocessing**: remove extra whitespace, special characters, sentence formatting.
|
| 22 |
|
| 23 |
+
### 2. 😀 **EmojiHandler** — Emoji Processing
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
* **Detect emojis**: detect emojis in text.
|
| 26 |
+
* **Split emoji text**: separate emojis from sentences.
|
| 27 |
+
* **Remove emojis**: remove all emojis.
|
| 28 |
|
| 29 |
+
### 3. ✏️ **Lexical Normalization** — Social Media Text Normalization
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
* **ViSoLexNormalizer**: Normalize text using deep learning models from HuggingFace.
|
| 32 |
+
* **NswDetector**: Detect non-standard words (NSW).
|
| 33 |
+
* **detect_nsw()**: Utility function to detect NSW.
|
| 34 |
+
* **normalize_sentence()**: Utility function to normalize sentences.
|
| 35 |
|
| 36 |
+
### 4. 📊 **Resource Management** — Dataset Management
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
* `list_datasets()` — List available datasets.
|
| 39 |
+
* `load_dataset()` — Load dataset from GitHub Releases.
|
| 40 |
+
* `get_dataset_info()` — View detailed dataset information.
|
| 41 |
|
| 42 |
+
### 5. 🧠 **Task Models** — Task Processing Models
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
* **SpamReviewDetection** — Spam detection.
|
| 45 |
+
* **HateSpeechDetection** — Hate speech detection.
|
| 46 |
+
* **HateSpeechSpanDetection** — Hate speech span detection.
|
| 47 |
+
* **EmotionRecognition** — Emotion recognition.
|
| 48 |
+
* **AspectSentimentAnalysis** — Aspect-based sentiment analysis.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
---
|
| 51 |
|
| 52 |
+
## 📥 Installation
|
| 53 |
+
|
| 54 |
+
### Install from PyPI (Recommended)
|
| 55 |
|
| 56 |
```bash
|
| 57 |
+
pip install visonorm
|
| 58 |
```
|