NLPT / README.md
baobuiquang's picture
Update README.md
eb75d8d verified
---
language:
- vi
- en
---
# NLPT
| Language | Dataset | Source | Download |
|----------|-------------|-------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| `all` | Punctuation | | [`PUNCTUATION.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/PUNCTUATION.txt) |
| `vi` | Synonyms | [source](https://tudiendongnghia.com) | [`VI_SYNONYMS.json`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_SYNONYMS.json) |
| `vi` | Vocab | [source](https://github.com/duyet/vietnamese-wordlist) | [`VI_VOCAB.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_VOCAB.txt) |
| `vi` | Diacritics | | [`VI_DIACRITICS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_DIACRITICS.txt) |
| `vi` | Stopwords | [source](https://github.com/stopwords/vietnamese-stopwords) | [`VI_STOPWORDS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_STOPWORDS.txt) |
| `en` | Stopwords | nltk | [`EN_STOPWORDS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/EN_STOPWORDS.txt) |
## Short-term Usage
```python
import requests
punctuation = requests.get("https://huggingface.co/onelevelstudio/NLPT/raw/main/PUNCTUATION.txt").text.splitlines()
```
## Long-term Usage
```python
from huggingface_hub import hf_hub_download as HF_Download
import json
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="PUNCTUATION.txt"), mode="r", encoding="utf-8") as f:
DATASET_punctuation = set(f.read().splitlines())
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_DIACRITICS.txt"), mode="r", encoding="utf-8") as f:
DATASET_diacritics_vi = f.read().splitlines()
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_VOCAB.txt"), mode="r", encoding="utf-8") as f:
DATASET_vocab_vi = f.read().splitlines()
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_STOPWORDS.txt"), mode="r", encoding="utf-8") as f:
DATASET_stopwords_vi = f.read().splitlines()
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="EN_STOPWORDS.txt"), mode="r", encoding="utf-8") as f:
DATASET_stopwords_en = f.read().splitlines()
with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_SYNONYMS.json"), mode="r", encoding="utf-8") as f:
DATASET_synonyms_vi = json.load(f)
```