File size: 2,765 Bytes
fb47cf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb75d8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
language:
- vi
- en
---

# NLPT

| Language | Dataset     | Source                                                      | Download                                                                                   |
|----------|-------------|-------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| `all`    | Punctuation |                                                             | [`PUNCTUATION.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/PUNCTUATION.txt)     |
| `vi`     | Synonyms    | [source](https://tudiendongnghia.com)                       | [`VI_SYNONYMS.json`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_SYNONYMS.json)   |
| `vi`     | Vocab       | [source](https://github.com/duyet/vietnamese-wordlist)      | [`VI_VOCAB.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_VOCAB.txt)           |
| `vi`     | Diacritics  |                                                             | [`VI_DIACRITICS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_DIACRITICS.txt) |
| `vi`     | Stopwords   | [source](https://github.com/stopwords/vietnamese-stopwords) | [`VI_STOPWORDS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_STOPWORDS.txt)   |
| `en`     | Stopwords   | nltk                                                        | [`EN_STOPWORDS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/EN_STOPWORDS.txt)   |

## Short-term Usage
```python
import requests
punctuation = requests.get("https://huggingface.co/onelevelstudio/NLPT/raw/main/PUNCTUATION.txt").text.splitlines()
```

## Long-term Usage
```python
from huggingface_hub import hf_hub_download as HF_Download
import json

with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="PUNCTUATION.txt"), mode="r", encoding="utf-8") as f:
    DATASET_punctuation = set(f.read().splitlines())

with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_DIACRITICS.txt"), mode="r", encoding="utf-8") as f:
    DATASET_diacritics_vi = f.read().splitlines()

with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_VOCAB.txt"), mode="r", encoding="utf-8") as f:
    DATASET_vocab_vi = f.read().splitlines()

with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_STOPWORDS.txt"), mode="r", encoding="utf-8") as f:
    DATASET_stopwords_vi = f.read().splitlines()

with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="EN_STOPWORDS.txt"), mode="r", encoding="utf-8") as f:
    DATASET_stopwords_en = f.read().splitlines()

with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_SYNONYMS.json"), mode="r", encoding="utf-8") as f:
    DATASET_synonyms_vi = json.load(f)
```