# Keural Tokenizer

**Keural Tokenizer** is the official tokenizer used for training the **Keural Foundation Model**, a large-scale Mixture-of-Experts (MoE) language model architecture designed for enterprise AI, long-context reasoning, and multilingual language understanding.

This repository provides the tokenizer used during the **pretraining stage of the Keural model**, including configuration files, vocabulary, and metadata required to reproduce tokenization behavior during training and inference.

---

# Overview

Large Language Models rely heavily on efficient tokenization.
The Keural tokenizer was designed with the following goals:

* Efficient token representation for large-scale training
* Balanced multilingual support
* Compatibility with scientific, web, and code corpora
* High vocabulary capacity for long-context modeling
* Robust normalization and byte fallback support

The tokenizer was trained using the **SentencePiece Unigram model** on a curated multilingual corpus.

---

# Tokenizer Specifications

| Property        | Value                 |
| --------------- | --------------------- |
| Tokenizer Type  | SentencePiece Unigram |
| Vocabulary Size | 131072 tokens         |
| Normalization   | NFKC                  |
| Byte Fallback   | Enabled               |
| Digit Splitting | Enabled               |
| Unknown Token   | `<unk>`               |
| Padding Token   | `<pad>`               |
| BOS Token       | `<bos>`               |
| EOS Token       | `<eos>`               |

The tokenizer supports multilingual text including:

* English
* Korean
* Scientific documents
* Literature
* Programming languages
* Web-scale data

---

# Training Corpus

The tokenizer was trained on a **54.77 GB multilingual corpus** consisting of multiple domains to ensure robust token coverage.

### Domain Distribution

| Domain              | Description                    |
| ------------------- | ------------------------------ |
| Web Text            | Large-scale English web corpus |
| Scientific Papers   | ArXiv and PubMed datasets      |
| Literature          | PG19 and BookCorpus            |
| Wikipedia           | Clean Korean Wikipedia         |
| Source Code         | Large-scale code repositories  |
| Korean Web Data     | Korean web text corpora        |
| Multilingual Corpus | CC100 Korean                   |

The dataset pipeline was designed to reduce noise while preserving linguistic diversity across domains.

---

# Tokenizer Files

This repository contains the following tokenizer artifacts:

```text
keural_tokenizer.model
keural_tokenizer.vocab
tokenizer_config.json
tokenizer_metadata.json
tokenizer.sha256
```

### File Description

**keural_tokenizer.model**
Binary SentencePiece tokenizer model used for tokenization.

**keural_tokenizer.vocab**
Vocabulary mapping tokens to IDs.

**tokenizer_config.json**
Tokenizer configuration used during model training.

**tokenizer_metadata.json**
Metadata including training corpus information.

**tokenizer.sha256**
Checksum file for verifying tokenizer integrity.

---

# Example Usage

### Using SentencePiece

```python
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("keural_tokenizer.model")

tokens = sp.encode("Keural is a foundation model.", out_type=int)
print(tokens)
```

### Using HuggingFace Transformers

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mkd-ai/keural-tokenizer")

tokens = tokenizer("Keural foundation model tokenizer example")
print(tokens)
```

---

# Model Compatibility

This tokenizer is used for training the **Keural Foundation Model**, which uses the following architecture:

| Parameter                | Value                          |
| ------------------------ | ------------------------------ |
| Architecture             | Transformer Mixture-of-Experts |
| Hidden Size              | 4096                           |
| Layers                   | 32                             |
| Attention Heads          | 32                             |
| Experts per Layer        | 32                             |
| Active Experts per Token | 4                              |
| Context Length           | 4096 (scalable)                |
| Vocabulary Size          | 131072                         |

Estimated model capacity:

* Total parameters: ~120B
* Active parameters per token: ~13B

---

# Context Length Roadmap

The Keural model is designed to scale context length progressively using **YaRN positional scaling**.

| Stage   | Context Length |
| ------- | -------------- |
| Stage 1 | 4096           |
| Stage 2 | 8192           |
| Stage 3 | 32768          |
| Stage 4 | 131072         |
| Stage 5 | 262144         |
| Stage 6 | 524288         |
| Stage 7 | 1,048,576      |

This staged context expansion enables efficient training while supporting ultra-long context inference.

---

# Training Pipeline

The tokenizer was trained as part of the Keural dataset pipeline, which includes:

* Streaming dataset ingestion
* Text normalization and cleaning
* Multithreaded tokenization
* Domain-based token balancing
* Fault-tolerant dataset checkpointing
* Large-scale corpus collection

The dataset preparation pipeline is available in the Keural model repository.

---

# Roadmap

The Keural project roadmap includes the following stages.

### Stage 1 — Tokenizer Development

* Multilingual tokenizer training
* Vocabulary optimization
* Token coverage validation

### Stage 2 — Dataset Preparation

* Large-scale corpus collection
* Domain balancing
* Token budget enforcement

### Stage 3 — Foundation Model Training

* Mixture-of-Experts transformer architecture
* Long-context support
* Distributed GPU training

### Stage 4 — Instruction Tuning

* Alignment with instruction datasets
* conversational fine-tuning
* domain adaptation

### Stage 5 — Deployment

* vLLM inference support
* enterprise deployment
* retrieval-augmented reasoning

---

# Hardware Environment

Tokenizer development and dataset processing were performed on a high-performance server environment:

* CPU: 32 cores
* RAM: ~480 GB
* Storage: NVMe SSD
* GPU: 2 H200 class GPUs used during model training (not yet)

---

# License

This tokenizer is part of the **Keural Foundation Model project**.

Usage and distribution may be subject to project licensing terms.

---

# Organization

Developed by

**MKD Corp AI Research**

Republic of Korea

---

# Citation

If you use the Keural tokenizer in research, please cite the Keural project repository.

```bibtex
@misc{keural_tokenizer,
  title={Keural Tokenizer},
  author={MKD Corp AI Research, Md. Najmul Hossain},
  year={2026},
  url={https://huggingface.co/mkd-ai/keural-tokenizer}
}
```