Tibetan
English
Tibetan
nlp
dharma
Buddhism
File size: 2,388 Bytes
4789a9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6702ce
4789a9c
 
 
 
 
9e49640
 
4789a9c
 
 
 
 
 
 
 
75ee70f
 
 
 
 
 
de7780f
75ee70f
 
 
 
4789a9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
020b650
4789a9c
 
 
 
 
 
 
75ee70f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: mit
datasets:
- billingsmoore/84000-bo-en
- billingsmoore/LotsawaHouse-bo-en
- openpecha/cleaned_MT_v1.0.3
language:
- bo
- en
tags:
- Tibetan
- nlp
- dharma
- Buddhism
---
# Getok: Custom Tokenizer for Tibetan Buddhist Texts

This is a custom Byte Pair Encoding (BPE) tokenizer specifically for Tibetan Buddhist texts. It was trained using the SentencePiece / Hugging Face Tokenizers library. It is designed to tokenize text data efficiently for downstream NLP tasks. The tokenizer supports Unicode text in both Tibetan and English and was trained on a domain-specific corpus of Tibetan Buddhist texts.

This model was developed as part of the MLotsawa project. [More information can be found here.](https://github.com/billingsmoore/MLotsawa)

*Special thanks to Andres Montano for suggesting the name of this tokenizer.*

## Use Cases

- Preprocessing for text classification, translation, summarization, or language modeling tasks
- Training/fine-tuning language models for Tibetan Buddhism related tasks.


## Details

**Tokenizer Type**: BPE (Byte Pair Encoding)

**Vocabulary Size**: 32,000

**Normalization**: None

**Special Tokens**: "[PAD]", "[BOS]", "[EOS]", "\<unk>"

**Tokenization Level**: Subword

**Languages**: Tibetan, English

## Training Data

The tokenizer was trained on a corpus consisting of:

- 879,132 sentences of Tibetan from Buddhist texts
- 879,132 sentences of English from translations of the Tibetan texts


## Usage

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained('billingsmoore/getok-v0')

tokenizer.encode('འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔')
```

## Limitations

The tokenizer currently supports unicode text in Tibetan or Latin script. However, it was only trained on Tibetan and English texts and should not be expected to perform well on other languages that use those scripts (i.e. Dzongkha, French)

This tokenizer is not suitable for languages that are written in other scripts (i.e. Greek, Russian)

Finetuning a pretrained model using this tokenizer should be expected to take longer than finetuning using the model's own tokenizer because the model will need to adapt to the new encodings.

## Author & Contact

**Author**: billingsmoore

**Contact**: billingsmoore [at] gmail [dot] com