# Keural Tokenizer **Keural Tokenizer** is the official tokenizer used for training the **Keural Foundation Model**, a large-scale Mixture-of-Experts (MoE) language model architecture designed for enterprise AI, long-context reasoning, and multilingual language understanding. This repository provides the tokenizer used during the **pretraining stage of the Keural model**, including configuration files, vocabulary, and metadata required to reproduce tokenization behavior during training and inference. --- # Overview Large Language Models rely heavily on efficient tokenization. The Keural tokenizer was designed with the following goals: * Efficient token representation for large-scale training * Balanced multilingual support * Compatibility with scientific, web, and code corpora * High vocabulary capacity for long-context modeling * Robust normalization and byte fallback support The tokenizer was trained using the **SentencePiece Unigram model** on a curated multilingual corpus. --- # Tokenizer Specifications | Property | Value | | --------------- | --------------------- | | Tokenizer Type | SentencePiece Unigram | | Vocabulary Size | 131072 tokens | | Normalization | NFKC | | Byte Fallback | Enabled | | Digit Splitting | Enabled | | Unknown Token | `` | | Padding Token | `` | | BOS Token | `` | | EOS Token | `` | The tokenizer supports multilingual text including: * English * Korean * Scientific documents * Literature * Programming languages * Web-scale data --- # Training Corpus The tokenizer was trained on a **54.77 GB multilingual corpus** consisting of multiple domains to ensure robust token coverage. ### Domain Distribution | Domain | Description | | ------------------- | ------------------------------ | | Web Text | Large-scale English web corpus | | Scientific Papers | ArXiv and PubMed datasets | | Literature | PG19 and BookCorpus | | Wikipedia | Clean Korean Wikipedia | | Source Code | Large-scale code repositories | | Korean Web Data | Korean web text corpora | | Multilingual Corpus | CC100 Korean | The dataset pipeline was designed to reduce noise while preserving linguistic diversity across domains. --- # Tokenizer Files This repository contains the following tokenizer artifacts: ```text keural_tokenizer.model keural_tokenizer.vocab tokenizer_config.json tokenizer_metadata.json tokenizer.sha256 ``` ### File Description **keural_tokenizer.model** Binary SentencePiece tokenizer model used for tokenization. **keural_tokenizer.vocab** Vocabulary mapping tokens to IDs. **tokenizer_config.json** Tokenizer configuration used during model training. **tokenizer_metadata.json** Metadata including training corpus information. **tokenizer.sha256** Checksum file for verifying tokenizer integrity. --- # Example Usage ### Using SentencePiece ```python import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.load("keural_tokenizer.model") tokens = sp.encode("Keural is a foundation model.", out_type=int) print(tokens) ``` ### Using HuggingFace Transformers ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("mkd-ai/keural-tokenizer") tokens = tokenizer("Keural foundation model tokenizer example") print(tokens) ``` --- # Model Compatibility This tokenizer is used for training the **Keural Foundation Model**, which uses the following architecture: | Parameter | Value | | ------------------------ | ------------------------------ | | Architecture | Transformer Mixture-of-Experts | | Hidden Size | 4096 | | Layers | 32 | | Attention Heads | 32 | | Experts per Layer | 32 | | Active Experts per Token | 4 | | Context Length | 4096 (scalable) | | Vocabulary Size | 131072 | Estimated model capacity: * Total parameters: ~120B * Active parameters per token: ~13B --- # Context Length Roadmap The Keural model is designed to scale context length progressively using **YaRN positional scaling**. | Stage | Context Length | | ------- | -------------- | | Stage 1 | 4096 | | Stage 2 | 8192 | | Stage 3 | 32768 | | Stage 4 | 131072 | | Stage 5 | 262144 | | Stage 6 | 524288 | | Stage 7 | 1,048,576 | This staged context expansion enables efficient training while supporting ultra-long context inference. --- # Training Pipeline The tokenizer was trained as part of the Keural dataset pipeline, which includes: * Streaming dataset ingestion * Text normalization and cleaning * Multithreaded tokenization * Domain-based token balancing * Fault-tolerant dataset checkpointing * Large-scale corpus collection The dataset preparation pipeline is available in the Keural model repository. --- # Roadmap The Keural project roadmap includes the following stages. ### Stage 1 — Tokenizer Development * Multilingual tokenizer training * Vocabulary optimization * Token coverage validation ### Stage 2 — Dataset Preparation * Large-scale corpus collection * Domain balancing * Token budget enforcement ### Stage 3 — Foundation Model Training * Mixture-of-Experts transformer architecture * Long-context support * Distributed GPU training ### Stage 4 — Instruction Tuning * Alignment with instruction datasets * conversational fine-tuning * domain adaptation ### Stage 5 — Deployment * vLLM inference support * enterprise deployment * retrieval-augmented reasoning --- # Hardware Environment Tokenizer development and dataset processing were performed on a high-performance server environment: * CPU: 32 cores * RAM: ~480 GB * Storage: NVMe SSD * GPU: 2 H200 class GPUs used during model training (not yet) --- # License This tokenizer is part of the **Keural Foundation Model project**. Usage and distribution may be subject to project licensing terms. --- # Organization Developed by **MKD Corp AI Research** Republic of Korea --- # Citation If you use the Keural tokenizer in research, please cite the Keural project repository. ```bibtex @misc{keural_tokenizer, title={Keural Tokenizer}, author={MKD Corp AI Research, Md. Najmul Hossain}, year={2026}, url={https://huggingface.co/mkd-ai/keural-tokenizer} } ```