| # Keural Tokenizer |
|
|
| **Keural Tokenizer** is the official tokenizer used for training the **Keural Foundation Model**, a large-scale Mixture-of-Experts (MoE) language model architecture designed for enterprise AI, long-context reasoning, and multilingual language understanding. |
|
|
| This repository provides the tokenizer used during the **pretraining stage of the Keural model**, including configuration files, vocabulary, and metadata required to reproduce tokenization behavior during training and inference. |
|
|
| --- |
|
|
| # Overview |
|
|
| Large Language Models rely heavily on efficient tokenization. |
| The Keural tokenizer was designed with the following goals: |
|
|
| * Efficient token representation for large-scale training |
| * Balanced multilingual support |
| * Compatibility with scientific, web, and code corpora |
| * High vocabulary capacity for long-context modeling |
| * Robust normalization and byte fallback support |
|
|
| The tokenizer was trained using the **SentencePiece Unigram model** on a curated multilingual corpus. |
|
|
| --- |
|
|
| # Tokenizer Specifications |
|
|
| | Property | Value | |
| | --------------- | --------------------- | |
| | Tokenizer Type | SentencePiece Unigram | |
| | Vocabulary Size | 131072 tokens | |
| | Normalization | NFKC | |
| | Byte Fallback | Enabled | |
| | Digit Splitting | Enabled | |
| | Unknown Token | `<unk>` | |
| | Padding Token | `<pad>` | |
| | BOS Token | `<bos>` | |
| | EOS Token | `<eos>` | |
|
|
| The tokenizer supports multilingual text including: |
|
|
| * English |
| * Korean |
| * Scientific documents |
| * Literature |
| * Programming languages |
| * Web-scale data |
|
|
| --- |
|
|
| # Training Corpus |
|
|
| The tokenizer was trained on a **54.77 GB multilingual corpus** consisting of multiple domains to ensure robust token coverage. |
|
|
| ### Domain Distribution |
|
|
| | Domain | Description | |
| | ------------------- | ------------------------------ | |
| | Web Text | Large-scale English web corpus | |
| | Scientific Papers | ArXiv and PubMed datasets | |
| | Literature | PG19 and BookCorpus | |
| | Wikipedia | Clean Korean Wikipedia | |
| | Source Code | Large-scale code repositories | |
| | Korean Web Data | Korean web text corpora | |
| | Multilingual Corpus | CC100 Korean | |
|
|
| The dataset pipeline was designed to reduce noise while preserving linguistic diversity across domains. |
|
|
| --- |
|
|
| # Tokenizer Files |
|
|
| This repository contains the following tokenizer artifacts: |
|
|
| ```text |
| keural_tokenizer.model |
| keural_tokenizer.vocab |
| tokenizer_config.json |
| tokenizer_metadata.json |
| tokenizer.sha256 |
| ``` |
|
|
| ### File Description |
|
|
| **keural_tokenizer.model** |
| Binary SentencePiece tokenizer model used for tokenization. |
| |
| **keural_tokenizer.vocab** |
| Vocabulary mapping tokens to IDs. |
|
|
| **tokenizer_config.json** |
| Tokenizer configuration used during model training. |
| |
| **tokenizer_metadata.json** |
| Metadata including training corpus information. |
|
|
| **tokenizer.sha256** |
| Checksum file for verifying tokenizer integrity. |
|
|
| --- |
|
|
| # Example Usage |
|
|
| ### Using SentencePiece |
|
|
| ```python |
| import sentencepiece as spm |
| |
| sp = spm.SentencePieceProcessor() |
| sp.load("keural_tokenizer.model") |
| |
| tokens = sp.encode("Keural is a foundation model.", out_type=int) |
| print(tokens) |
| ``` |
|
|
| ### Using HuggingFace Transformers |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("mkd-ai/keural-tokenizer") |
| |
| tokens = tokenizer("Keural foundation model tokenizer example") |
| print(tokens) |
| ``` |
|
|
| --- |
|
|
| # Model Compatibility |
|
|
| This tokenizer is used for training the **Keural Foundation Model**, which uses the following architecture: |
|
|
| | Parameter | Value | |
| | ------------------------ | ------------------------------ | |
| | Architecture | Transformer Mixture-of-Experts | |
| | Hidden Size | 4096 | |
| | Layers | 32 | |
| | Attention Heads | 32 | |
| | Experts per Layer | 32 | |
| | Active Experts per Token | 4 | |
| | Context Length | 4096 (scalable) | |
| | Vocabulary Size | 131072 | |
|
|
| Estimated model capacity: |
|
|
| * Total parameters: ~120B |
| * Active parameters per token: ~13B |
|
|
| --- |
|
|
| # Context Length Roadmap |
|
|
| The Keural model is designed to scale context length progressively using **YaRN positional scaling**. |
|
|
| | Stage | Context Length | |
| | ------- | -------------- | |
| | Stage 1 | 4096 | |
| | Stage 2 | 8192 | |
| | Stage 3 | 32768 | |
| | Stage 4 | 131072 | |
| | Stage 5 | 262144 | |
| | Stage 6 | 524288 | |
| | Stage 7 | 1,048,576 | |
|
|
| This staged context expansion enables efficient training while supporting ultra-long context inference. |
|
|
| --- |
|
|
| # Training Pipeline |
|
|
| The tokenizer was trained as part of the Keural dataset pipeline, which includes: |
|
|
| * Streaming dataset ingestion |
| * Text normalization and cleaning |
| * Multithreaded tokenization |
| * Domain-based token balancing |
| * Fault-tolerant dataset checkpointing |
| * Large-scale corpus collection |
|
|
| The dataset preparation pipeline is available in the Keural model repository. |
|
|
| --- |
|
|
| # Roadmap |
|
|
| The Keural project roadmap includes the following stages. |
|
|
| ### Stage 1 β Tokenizer Development |
|
|
| * Multilingual tokenizer training |
| * Vocabulary optimization |
| * Token coverage validation |
|
|
| ### Stage 2 β Dataset Preparation |
|
|
| * Large-scale corpus collection |
| * Domain balancing |
| * Token budget enforcement |
|
|
| ### Stage 3 β Foundation Model Training |
|
|
| * Mixture-of-Experts transformer architecture |
| * Long-context support |
| * Distributed GPU training |
|
|
| ### Stage 4 β Instruction Tuning |
|
|
| * Alignment with instruction datasets |
| * conversational fine-tuning |
| * domain adaptation |
|
|
| ### Stage 5 β Deployment |
|
|
| * vLLM inference support |
| * enterprise deployment |
| * retrieval-augmented reasoning |
|
|
| --- |
|
|
| # Hardware Environment |
|
|
| Tokenizer development and dataset processing were performed on a high-performance server environment: |
|
|
| * CPU: 32 cores |
| * RAM: ~480 GB |
| * Storage: NVMe SSD |
| * GPU: 2 H200 class GPUs used during model training (not yet) |
|
|
| --- |
|
|
| # License |
|
|
| This tokenizer is part of the **Keural Foundation Model project**. |
|
|
| Usage and distribution may be subject to project licensing terms. |
|
|
| --- |
|
|
| # Organization |
|
|
| Developed by |
|
|
| **MKD Corp AI Research** |
|
|
| Republic of Korea |
|
|
| --- |
|
|
| # Citation |
|
|
| If you use the Keural tokenizer in research, please cite the Keural project repository. |
|
|
| ```bibtex |
| @misc{keural_tokenizer, |
| title={Keural Tokenizer}, |
| author={MKD Corp AI Research, Md. Najmul Hossain}, |
| year={2026}, |
| url={https://huggingface.co/mkd-ai/keural-tokenizer} |
| } |
| ``` |
|
|