--- language: - vi - en --- # NLPT | Language | Dataset | Source | Download | |----------|-------------|-------------------------------------------------------------|--------------------------------------------------------------------------------------------| | `all` | Punctuation | | [`PUNCTUATION.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/PUNCTUATION.txt) | | `vi` | Synonyms | [source](https://tudiendongnghia.com) | [`VI_SYNONYMS.json`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_SYNONYMS.json) | | `vi` | Vocab | [source](https://github.com/duyet/vietnamese-wordlist) | [`VI_VOCAB.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_VOCAB.txt) | | `vi` | Diacritics | | [`VI_DIACRITICS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_DIACRITICS.txt) | | `vi` | Stopwords | [source](https://github.com/stopwords/vietnamese-stopwords) | [`VI_STOPWORDS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/VI_STOPWORDS.txt) | | `en` | Stopwords | nltk | [`EN_STOPWORDS.txt`](https://huggingface.co/onelevelstudio/NLPT/raw/main/EN_STOPWORDS.txt) | ## Short-term Usage ```python import requests punctuation = requests.get("https://huggingface.co/onelevelstudio/NLPT/raw/main/PUNCTUATION.txt").text.splitlines() ``` ## Long-term Usage ```python from huggingface_hub import hf_hub_download as HF_Download import json with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="PUNCTUATION.txt"), mode="r", encoding="utf-8") as f: DATASET_punctuation = set(f.read().splitlines()) with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_DIACRITICS.txt"), mode="r", encoding="utf-8") as f: DATASET_diacritics_vi = f.read().splitlines() with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_VOCAB.txt"), mode="r", encoding="utf-8") as f: DATASET_vocab_vi = f.read().splitlines() with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_STOPWORDS.txt"), mode="r", encoding="utf-8") as f: DATASET_stopwords_vi = f.read().splitlines() with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="EN_STOPWORDS.txt"), mode="r", encoding="utf-8") as f: DATASET_stopwords_en = f.read().splitlines() with open(HF_Download(repo_id="onelevelstudio/NLPT", filename="VI_SYNONYMS.json"), mode="r", encoding="utf-8") as f: DATASET_synonyms_vi = json.load(f) ```