oguzatas
/

mamba-tr-project-tokenizer

Model card Files Files and versions

mamba-tr-project-tokenizer / README.md

oguzatas's picture

Update README.md

91d95ac verified about 2 months ago

|

history blame contribute delete

920 Bytes

	---
	license: mit
	task_categories:
	- text-generation
	language:
	- tr
	tags:
	- tokenizer
	- bpe
	- turkish
	- wikipedia
	size_categories:
	- 10K<n<100K
	---

	# Turkish BPE Tokenizer for Mamba & GPT Models

	This is a custom Byte-Pair Encoding (BPE) tokenizer trained specifically for the Turkish language to support the `mamba-tr-project` research.

	## Model Details
	- Vocabulary Size: 32,000 tokens
	- Training Corpus: Turkish Wikipedia (2023 Dump)
	- Algorithm: BPE (Byte-Level)
	- Special Tokens: `<\|endoftext\|>`, `<\|padding\|>`

	## Usage

	```python
	from transformers import PreTrainedTokenizerFast

	tokenizer = PreTrainedTokenizerFast.from_pretrained("oguzatas/mamba-tr-project-tokenizer")

	text = "Yapay zeka günümüzde çok gelişti."
	tokens = tokenizer.encode(text)
	decoded = tokenizer.decode(tokens)

	print(tokens)
	# Output: [Token IDs...]
	print(decoded)
	# Output: "Yapay zeka günümüzde çok gelişti."