|
|
--- |
|
|
language: |
|
|
- vi |
|
|
- en |
|
|
--- |
|
|
|
|
|
# NLPT |
|
|
|
|
|
| Language | Dataset | Source | Download | |
|
|
|----------|-------------|-------------------------------------------------------------|--------------------------------------------------------------------------------------------| |
|
|
| `all` | Punctuation | | [`PUNCTUATION.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/PUNCTUATION.txt) | |
|
|
| `vi` | Synonyms | [source](https://tudiendongnghia.com) | [`VI_SYNONYMS.json`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_SYNONYMS.json) | |
|
|
| `vi` | Vocab | [source](https://github.com/duyet/vietnamese-wordlist) | [`VI_VOCAB.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_VOCAB.txt) | |
|
|
| `vi` | Diacritics | | [`VI_DIACRITICS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_DIACRITICS.txt) | |
|
|
| `vi` | Stopwords | [source](https://github.com/stopwords/vietnamese-stopwords) | [`VI_STOPWORDS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_STOPWORDS.txt) | |
|
|
| `en` | Stopwords | nltk | [`EN_STOPWORDS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/EN_STOPWORDS.txt) | |
|
|
|
|
|
## Short-term Usage |
|
|
```python |
|
|
import requests |
|
|
punctuation = requests.get("https://huggingface.co/onelevelstudio/NLPT/raw/main/PUNCTUATION.txt").text.splitlines() |
|
|
``` |
|
|
|
|
|
## Long-term Usage |
|
|
```python |
|
|
from huggingface_hub import hf_hub_download as HF_Download |
|
|
import json |
|
|
|
|
|
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="PUNCTUATION.txt"), mode="r", encoding="utf-8") as f: |
|
|
DATASET_punctuation = set(f.read().splitlines()) |
|
|
|
|
|
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_DIACRITICS.txt"), mode="r", encoding="utf-8") as f: |
|
|
DATASET_diacritics_vi = f.read().splitlines() |
|
|
|
|
|
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_VOCAB.txt"), mode="r", encoding="utf-8") as f: |
|
|
DATASET_vocab_vi = f.read().splitlines() |
|
|
|
|
|
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_STOPWORDS.txt"), mode="r", encoding="utf-8") as f: |
|
|
DATASET_stopwords_vi = f.read().splitlines() |
|
|
|
|
|
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="EN_STOPWORDS.txt"), mode="r", encoding="utf-8") as f: |
|
|
DATASET_stopwords_en = f.read().splitlines() |
|
|
|
|
|
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_SYNONYMS.json"), mode="r", encoding="utf-8") as f: |
|
|
DATASET_synonyms_vi = json.load(f) |
|
|
``` |