Kurdish
Collection
All Kurdish related datasets, models • 3 items • Updated
This tokenizer is trained on Kurdish Kurmanji text data using Byte-Pair Encoding (BPE). It's trained using this dataset
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer", revision="v1.0")
text = "Havîn bi gelemperî bi betlane, derya û tavê tê naskirin, lê her kes nikare biçe betlaneyê"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['Hav', 'în bi ', 'gelemperî bi ', 'bet', 'l', 'ane, ', 'der', 'ya û ', 'tavê ', 'tê nas', 'kirin, lê ', 'her kes', ' nikare ', 'biçe ', 'bet', 'lan', 'ey', 'ê']
ids = tokenizer.encode(text)
print(ids)
# Output: [19889, 6494, 19754, 2055, 82, 8227, 337, 3349, 16407, 5790, 11235, 2584, 5479, 5732, 2055, 9479, 287, 140]
This tokenizer includes spaces within some tokens (e.g., 'Ez ê ', 'di vê '), which causes the default tokenizer.decode() method from the transformers library to add extra spaces between tokens during decoding. To decode text correctly and preserve the original formatting, decode each token ID individually and join the results without spaces.
Use the following code example:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer", revision="v1.0")
text = "Havîn bi gelemperî bi betlane, derya û tavê tê naskirin, lê her kes nikare biçe betlaneyê"
ids = tokenizer.encode(text)
individual_tokens = [tokenizer.decode([id]) for id in ids]
decoded_text = "".join(individual_tokens)
print(decoded_text) # Output: Havîn bi gelemperî bi betlane, derya û tavê tê naskirin, lê her kes nikare biçe betlaneyê