keural-tokenizer / README.md
mkd-hossain's picture
Update README.md
b27955f verified
# Keural Tokenizer
**Keural Tokenizer** is the official tokenizer used for training the **Keural Foundation Model**, a large-scale Mixture-of-Experts (MoE) language model architecture designed for enterprise AI, long-context reasoning, and multilingual language understanding.
This repository provides the tokenizer used during the **pretraining stage of the Keural model**, including configuration files, vocabulary, and metadata required to reproduce tokenization behavior during training and inference.
---
# Overview
Large Language Models rely heavily on efficient tokenization.
The Keural tokenizer was designed with the following goals:
* Efficient token representation for large-scale training
* Balanced multilingual support
* Compatibility with scientific, web, and code corpora
* High vocabulary capacity for long-context modeling
* Robust normalization and byte fallback support
The tokenizer was trained using the **SentencePiece Unigram model** on a curated multilingual corpus.
---
# Tokenizer Specifications
| Property | Value |
| --------------- | --------------------- |
| Tokenizer Type | SentencePiece Unigram |
| Vocabulary Size | 131072 tokens |
| Normalization | NFKC |
| Byte Fallback | Enabled |
| Digit Splitting | Enabled |
| Unknown Token | `<unk>` |
| Padding Token | `<pad>` |
| BOS Token | `<bos>` |
| EOS Token | `<eos>` |
The tokenizer supports multilingual text including:
* English
* Korean
* Scientific documents
* Literature
* Programming languages
* Web-scale data
---
# Training Corpus
The tokenizer was trained on a **54.77 GB multilingual corpus** consisting of multiple domains to ensure robust token coverage.
### Domain Distribution
| Domain | Description |
| ------------------- | ------------------------------ |
| Web Text | Large-scale English web corpus |
| Scientific Papers | ArXiv and PubMed datasets |
| Literature | PG19 and BookCorpus |
| Wikipedia | Clean Korean Wikipedia |
| Source Code | Large-scale code repositories |
| Korean Web Data | Korean web text corpora |
| Multilingual Corpus | CC100 Korean |
The dataset pipeline was designed to reduce noise while preserving linguistic diversity across domains.
---
# Tokenizer Files
This repository contains the following tokenizer artifacts:
```text
keural_tokenizer.model
keural_tokenizer.vocab
tokenizer_config.json
tokenizer_metadata.json
tokenizer.sha256
```
### File Description
**keural_tokenizer.model**
Binary SentencePiece tokenizer model used for tokenization.
**keural_tokenizer.vocab**
Vocabulary mapping tokens to IDs.
**tokenizer_config.json**
Tokenizer configuration used during model training.
**tokenizer_metadata.json**
Metadata including training corpus information.
**tokenizer.sha256**
Checksum file for verifying tokenizer integrity.
---
# Example Usage
### Using SentencePiece
```python
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("keural_tokenizer.model")
tokens = sp.encode("Keural is a foundation model.", out_type=int)
print(tokens)
```
### Using HuggingFace Transformers
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mkd-ai/keural-tokenizer")
tokens = tokenizer("Keural foundation model tokenizer example")
print(tokens)
```
---
# Model Compatibility
This tokenizer is used for training the **Keural Foundation Model**, which uses the following architecture:
| Parameter | Value |
| ------------------------ | ------------------------------ |
| Architecture | Transformer Mixture-of-Experts |
| Hidden Size | 4096 |
| Layers | 32 |
| Attention Heads | 32 |
| Experts per Layer | 32 |
| Active Experts per Token | 4 |
| Context Length | 4096 (scalable) |
| Vocabulary Size | 131072 |
Estimated model capacity:
* Total parameters: ~120B
* Active parameters per token: ~13B
---
# Context Length Roadmap
The Keural model is designed to scale context length progressively using **YaRN positional scaling**.
| Stage | Context Length |
| ------- | -------------- |
| Stage 1 | 4096 |
| Stage 2 | 8192 |
| Stage 3 | 32768 |
| Stage 4 | 131072 |
| Stage 5 | 262144 |
| Stage 6 | 524288 |
| Stage 7 | 1,048,576 |
This staged context expansion enables efficient training while supporting ultra-long context inference.
---
# Training Pipeline
The tokenizer was trained as part of the Keural dataset pipeline, which includes:
* Streaming dataset ingestion
* Text normalization and cleaning
* Multithreaded tokenization
* Domain-based token balancing
* Fault-tolerant dataset checkpointing
* Large-scale corpus collection
The dataset preparation pipeline is available in the Keural model repository.
---
# Roadmap
The Keural project roadmap includes the following stages.
### Stage 1 β€” Tokenizer Development
* Multilingual tokenizer training
* Vocabulary optimization
* Token coverage validation
### Stage 2 β€” Dataset Preparation
* Large-scale corpus collection
* Domain balancing
* Token budget enforcement
### Stage 3 β€” Foundation Model Training
* Mixture-of-Experts transformer architecture
* Long-context support
* Distributed GPU training
### Stage 4 β€” Instruction Tuning
* Alignment with instruction datasets
* conversational fine-tuning
* domain adaptation
### Stage 5 β€” Deployment
* vLLM inference support
* enterprise deployment
* retrieval-augmented reasoning
---
# Hardware Environment
Tokenizer development and dataset processing were performed on a high-performance server environment:
* CPU: 32 cores
* RAM: ~480 GB
* Storage: NVMe SSD
* GPU: 2 H200 class GPUs used during model training (not yet)
---
# License
This tokenizer is part of the **Keural Foundation Model project**.
Usage and distribution may be subject to project licensing terms.
---
# Organization
Developed by
**MKD Corp AI Research**
Republic of Korea
---
# Citation
If you use the Keural tokenizer in research, please cite the Keural project repository.
```bibtex
@misc{keural_tokenizer,
title={Keural Tokenizer},
author={MKD Corp AI Research, Md. Najmul Hossain},
year={2026},
url={https://huggingface.co/mkd-ai/keural-tokenizer}
}
```