File size: 4,767 Bytes
57579f7 215140d 57579f7 b18c332 57579f7 f886350 57579f7 b18c332 57579f7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | ---
language:
- th
- en
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: text-generation
tags:
- thai
- text-generation
- Hanuman
- pytorch
- reasoning
datasets:
- HelpingAI/Dhanishtha-2.0-SUPERTHINKER
- HuggingFaceH4/no_robots
widget:
- text: Hello
example_title: Simple greeting
- text: Thailand is located in
example_title: Geography
- text: Artificial intelligence technology is
example_title: Technology
inference:
parameters:
max_length: 100
temperature: 0.7
top_p: 0.9
do_sample: true
model-index:
- name: ZombitX64/Hanuman
results: []
---
# Hanuman
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/673eef9c4edfc6d3b58ba3aa/KTtdrLMU89iCuMU9jzuhL.png" width="300" alt="Hanuman">
<strong>Hanuman โ A Small Language Model for Thai</strong>
<em>Tokenizer advisor: <a href="https://huggingface.co/KoichiYasuoka">Koichi Yasuoka</a></em>
<a href="https://creativecommons.org/licenses/by-nc/4.0/"><img src="https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg"></a>
<a href="https://huggingface.co/JonusNattapong/Hanuman"><img src="https://img.shields.io/badge/๐ค%20HF-Model-yellow"></a>
</div>
---
## ๐ Model Details
### Overview
- **Name**: Hanuman
- **Language**: Thai (th)
- **Task**: Text Generation (Causal LM)
- **Framework**: PyTorch + ๐ค Transformers
- **License**: CC BY-NC 4.0 (Non-commercial use only)
### Training Datasets
- [HelpingAI/Dhanishtha-2.0-SUPERTHINKER](https://huggingface.co/datasets/HelpingAI/Dhanishtha-2.0-SUPERTHINKER)
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
### Architecture
- Custom tokenizer for Thai language (handles whitespace, newline, tab, `<NL>`, `<SPACE>`, `<TAB>` etc.)
---
## โ
Intended Use
### Primary Use Cases
- Thai text generation (blogs, articles, captions, chatbots)
- Creative and reasoning-oriented text assistance
- Thai NLP research
### Limitations
- This model is **research-oriented** and may require additional fine-tuning for production use.
- May generate incorrect or biased outputs. Human verification is recommended.
---
## ๐งฐ Tokenizer & Context
- Custom fast tokenizer (no `trust_remote_code` needed)
- Ensures **round-trip encode/decode correctness**
- Unicode NFC normalization included
- Handles ThaiโLatin spacing consistently
---
## ๐ Usage Examples
### Basic Text Generation
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "ZombitX64/Hanuman"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
def generate_thai_text(prompt, max_length=100):
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=max_length,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generate_thai_text("Artificial intelligence technology"))
````
### Batch Processing
```python
prompts = ["Hello", "Thailand has an area of", "Education in the digital era"]
for p in prompts:
print(generate_thai_text(p, max_length=80))
print("-"*50)
```
---
## ๐๏ธ Training Process
### Dataset Preparation
* Source: Wikipedia Thai and reasoning-style datasets
* Preprocessing: Cleaning, Unicode normalization, tokenization
* Training mode: Streaming
### Example Training Configuration
```python
training_args = {
"per_device_train_batch_size": 2,
"per_device_eval_batch_size": 2,
"gradient_accumulation_steps": 4,
"num_train_epochs": 2,
"learning_rate": 5e-5,
"warmup_steps": 10,
"logging_steps": 10,
"eval_steps": 50,
"save_steps": 50,
"fp16": False, # CPU training
"dataloader_num_workers": 0
}
```
---
## ๐ Evaluation
The model is currently in **research phase**.
Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future.
---
## ๐ค Contributing
This project is part of ongoing Thai NLP research.
Feedback, issues, and contributions are welcome!
---
## ๐ Citation
```bibtex
@misc{Hanuman2025,
title = {Hanuman: Thai Small Language Model},
author = {JonusNattapong and Koichi Yasuoka},
year = {2025},
howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}},
note = {Tokenizer advisor: Koichi Yasuoka}
}
```
---
> โ ๏ธ **Disclaimer**: This model is intended for research and educational purposes only.
> Use in commercial applications requires prior permission under the CC BY-NC 4.0 license. |