File size: 4,767 Bytes

---
language:
- th
- en
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: text-generation
tags:
- thai
- text-generation
- Hanuman
- pytorch
- reasoning
datasets:
- HelpingAI/Dhanishtha-2.0-SUPERTHINKER
- HuggingFaceH4/no_robots
widget:
- text: Hello
  example_title: Simple greeting
- text: Thailand is located in
  example_title: Geography
- text: Artificial intelligence technology is
  example_title: Technology
inference:
  parameters:
    max_length: 100
    temperature: 0.7
    top_p: 0.9
    do_sample: true
model-index:
- name: ZombitX64/Hanuman
  results: []
---

# Hanuman

<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/673eef9c4edfc6d3b58ba3aa/KTtdrLMU89iCuMU9jzuhL.png" width="300" alt="Hanuman">

<strong>Hanuman — A Small Language Model for Thai</strong>

<em>Tokenizer advisor: <a href="https://huggingface.co/KoichiYasuoka">Koichi Yasuoka</a></em>

<a href="https://creativecommons.org/licenses/by-nc/4.0/"><img src="https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg"></a>
<a href="https://huggingface.co/JonusNattapong/Hanuman"><img src="https://img.shields.io/badge/🤗%20HF-Model-yellow"></a>
</div>

---

## 🔎 Model Details

### Overview
- **Name**: Hanuman  
- **Language**: Thai (th)  
- **Task**: Text Generation (Causal LM)  
- **Framework**: PyTorch + 🤗 Transformers  
- **License**: CC BY-NC 4.0 (Non-commercial use only)  

### Training Datasets
- [HelpingAI/Dhanishtha-2.0-SUPERTHINKER](https://huggingface.co/datasets/HelpingAI/Dhanishtha-2.0-SUPERTHINKER)  
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)  

### Architecture 
- Custom tokenizer for Thai language (handles whitespace, newline, tab, `<NL>`, `<SPACE>`, `<TAB>` etc.)  

---

## ✅ Intended Use

### Primary Use Cases
- Thai text generation (blogs, articles, captions, chatbots)  
- Creative and reasoning-oriented text assistance  
- Thai NLP research  

### Limitations
- This model is **research-oriented** and may require additional fine-tuning for production use.  
- May generate incorrect or biased outputs. Human verification is recommended.  

---

## 🧰 Tokenizer & Context

- Custom fast tokenizer (no `trust_remote_code` needed)  
- Ensures **round-trip encode/decode correctness**  
- Unicode NFC normalization included  
- Handles Thai–Latin spacing consistently  

---

## 🚀 Usage Examples

### Basic Text Generation
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "ZombitX64/Hanuman"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

def generate_thai_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate_thai_text("Artificial intelligence technology"))
````

### Batch Processing

```python
prompts = ["Hello", "Thailand has an area of", "Education in the digital era"]
for p in prompts:
    print(generate_thai_text(p, max_length=80))
    print("-"*50)
```

---

## 🏗️ Training Process

### Dataset Preparation

* Source: Wikipedia Thai and reasoning-style datasets
* Preprocessing: Cleaning, Unicode normalization, tokenization
* Training mode: Streaming

### Example Training Configuration

```python
training_args = {
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "num_train_epochs": 2,
    "learning_rate": 5e-5,
    "warmup_steps": 10,
    "logging_steps": 10,
    "eval_steps": 50,
    "save_steps": 50,
    "fp16": False,  # CPU training
    "dataloader_num_workers": 0
}
```

---

## 📊 Evaluation

The model is currently in **research phase**.
Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future.

---

## 🤝 Contributing

This project is part of ongoing Thai NLP research.
Feedback, issues, and contributions are welcome!

---

## 📄 Citation

```bibtex
@misc{Hanuman2025,
  title        = {Hanuman: Thai Small Language Model},
  author       = {JonusNattapong and Koichi Yasuoka},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}},
  note         = {Tokenizer advisor: Koichi Yasuoka}
}
```
 
---

> ⚠️ **Disclaimer**: This model is intended for research and educational purposes only.
> Use in commercial applications requires prior permission under the CC BY-NC 4.0 license.