---
language:
- th
- en
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: text-generation
tags:
- thai
- text-generation
- Hanuman
- pytorch
- reasoning
datasets:
- HelpingAI/Dhanishtha-2.0-SUPERTHINKER
- HuggingFaceH4/no_robots
model-index:
- name: ZombitX64/Hanuman
results:
- task:
name: Text Generation
type: text-generation
dataset:
name: HelpingAI/Dhanishtha-2.0-SUPERTHINKER
type: text
metrics: []
- task:
name: Text Generation
type: text-generation
dataset:
name: HuggingFaceH4/no_robots
type: text
metrics: []
widget:
- text: Hello
example_title: Simple greeting
- text: Thailand is located in
example_title: Geography
- text: Artificial intelligence technology is
example_title: Technology
inference:
parameters:
max_length: 100
temperature: 0.7
top_p: 0.9
do_sample: true
---
# Hanuman
Hanuman β A Small Language Model for Thai
Tokenizer advisor: Koichi Yasuoka
---
## π Model Details
### Overview
- **Name**: Hanuman
- **Language**: Thai (th)
- **Task**: Text Generation (Causal LM)
- **Framework**: PyTorch + π€ Transformers
- **License**: CC BY-NC 4.0 (Non-commercial use only)
### Training Datasets
- [HelpingAI/Dhanishtha-2.0-SUPERTHINKER](https://huggingface.co/datasets/HelpingAI/Dhanishtha-2.0-SUPERTHINKER)
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
### Architecture
- Based on a **Small Language Model (SLM) with Mixture of Experts**
- Context length: **4,096 tokens** (extended via RoPE scaling)
- Custom tokenizer for Thai language (handles whitespace, newline, tab, ``, ``, `` etc.)
---
## β
Intended Use
### Primary Use Cases
- Thai text generation (blogs, articles, captions, chatbots)
- Creative and reasoning-oriented text assistance
- Thai NLP research
### Limitations
- This model is **research-oriented** and may require additional fine-tuning for production use.
- May generate incorrect or biased outputs. Human verification is recommended.
---
## π§° Tokenizer & Context
- Custom fast tokenizer (no `trust_remote_code` needed)
- Ensures **round-trip encode/decode correctness**
- Unicode NFC normalization included
- Handles ThaiβLatin spacing consistently
---
## π Usage Examples
### Basic Text Generation
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "ZombitX64/Hanuman"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
def generate_thai_text(prompt, max_length=100):
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=max_length,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generate_thai_text("Artificial intelligence technology"))
````
### Batch Processing
```python
prompts = ["Hello", "Thailand has an area of", "Education in the digital era"]
for p in prompts:
print(generate_thai_text(p, max_length=80))
print("-"*50)
```
---
## ποΈ Training Process
### Dataset Preparation
* Source: Wikipedia Thai and reasoning-style datasets
* Preprocessing: Cleaning, Unicode normalization, tokenization
* Training mode: Streaming
### Example Training Configuration
```python
training_args = {
"per_device_train_batch_size": 2,
"per_device_eval_batch_size": 2,
"gradient_accumulation_steps": 4,
"num_train_epochs": 2,
"learning_rate": 5e-5,
"warmup_steps": 10,
"logging_steps": 10,
"eval_steps": 50,
"save_steps": 50,
"fp16": False, # CPU training
"dataloader_num_workers": 0
}
```
---
## π Evaluation
The model is currently in **research phase**.
Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future.
---
## π€ Contributing
This project is part of ongoing Thai NLP research.
Feedback, issues, and contributions are welcome!
---
## π Citation
```bibtex
@misc{Hanuman2025,
title = {Hanuman: Thai Small Language Model},
author = {JonusNattapong and Koichi Yasuoka},
year = {2025},
howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}},
note = {Tokenizer advisor: Koichi Yasuoka}
}
```
---
> β οΈ **Disclaimer**: This model is intended for research and educational purposes only.
> Use in commercial applications requires prior permission under the CC BY-NC 4.0 license.