|
|
--- |
|
|
language: |
|
|
- th |
|
|
- en |
|
|
license: cc-by-nc-4.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- thai |
|
|
- text-generation |
|
|
- Hanuman |
|
|
- pytorch |
|
|
- reasoning |
|
|
datasets: |
|
|
- HelpingAI/Dhanishtha-2.0-SUPERTHINKER |
|
|
- HuggingFaceH4/no_robots |
|
|
model-index: |
|
|
- name: ZombitX64/Hanuman |
|
|
results: |
|
|
- task: |
|
|
name: Text Generation |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: HelpingAI/Dhanishtha-2.0-SUPERTHINKER |
|
|
type: text |
|
|
metrics: [] |
|
|
- task: |
|
|
name: Text Generation |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: HuggingFaceH4/no_robots |
|
|
type: text |
|
|
metrics: [] |
|
|
widget: |
|
|
- text: Hello |
|
|
example_title: Simple greeting |
|
|
- text: Thailand is located in |
|
|
example_title: Geography |
|
|
- text: Artificial intelligence technology is |
|
|
example_title: Technology |
|
|
inference: |
|
|
parameters: |
|
|
max_length: 100 |
|
|
temperature: 0.7 |
|
|
top_p: 0.9 |
|
|
do_sample: true |
|
|
--- |
|
|
|
|
|
# Hanuman |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/673eef9c4edfc6d3b58ba3aa/phqwy_ASNiDUo0DVqW30x.png" width="300" alt="Hanuman"> |
|
|
|
|
|
<strong>Hanuman — A Small Language Model for Thai</strong> |
|
|
|
|
|
<em>Tokenizer advisor: <a href="https://huggingface.co/KoichiYasuoka">Koichi Yasuoka</a></em> |
|
|
|
|
|
<a href="https://creativecommons.org/licenses/by-nc/4.0/"><img src="https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg"></a> |
|
|
<a href="https://huggingface.co/JonusNattapong/Hanuman"><img src="https://img.shields.io/badge/🤗%20HF-Model-yellow"></a> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔎 Model Details |
|
|
|
|
|
### Overview |
|
|
- **Name**: Hanuman |
|
|
- **Language**: Thai (th) |
|
|
- **Task**: Text Generation (Causal LM) |
|
|
- **Framework**: PyTorch + 🤗 Transformers |
|
|
- **License**: CC BY-NC 4.0 (Non-commercial use only) |
|
|
|
|
|
### Training Datasets |
|
|
- [HelpingAI/Dhanishtha-2.0-SUPERTHINKER](https://huggingface.co/datasets/HelpingAI/Dhanishtha-2.0-SUPERTHINKER) |
|
|
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) |
|
|
|
|
|
### Architecture |
|
|
- Based on a **Small Language Model (SLM) with Mixture of Experts** |
|
|
- Context length: **4,096 tokens** (extended via RoPE scaling) |
|
|
- Custom tokenizer for Thai language (handles whitespace, newline, tab, `<NL>`, `<SPACE>`, `<TAB>` etc.) |
|
|
|
|
|
--- |
|
|
|
|
|
## ✅ Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
- Thai text generation (blogs, articles, captions, chatbots) |
|
|
- Creative and reasoning-oriented text assistance |
|
|
- Thai NLP research |
|
|
|
|
|
### Limitations |
|
|
- This model is **research-oriented** and may require additional fine-tuning for production use. |
|
|
- May generate incorrect or biased outputs. Human verification is recommended. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧰 Tokenizer & Context |
|
|
|
|
|
- Custom fast tokenizer (no `trust_remote_code` needed) |
|
|
- Ensures **round-trip encode/decode correctness** |
|
|
- Unicode NFC normalization included |
|
|
- Handles Thai–Latin spacing consistently |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Usage Examples |
|
|
|
|
|
### Basic Text Generation |
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
MODEL_ID = "ZombitX64/Hanuman" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
|
|
model = AutoModelForCausalLM.from_pretrained(MODEL_ID) |
|
|
|
|
|
def generate_thai_text(prompt, max_length=100): |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_length=max_length, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
print(generate_thai_text("Artificial intelligence technology")) |
|
|
```` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
```python |
|
|
prompts = ["Hello", "Thailand has an area of", "Education in the digital era"] |
|
|
for p in prompts: |
|
|
print(generate_thai_text(p, max_length=80)) |
|
|
print("-"*50) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏗️ Training Process |
|
|
|
|
|
### Dataset Preparation |
|
|
|
|
|
* Source: Wikipedia Thai and reasoning-style datasets |
|
|
* Preprocessing: Cleaning, Unicode normalization, tokenization |
|
|
* Training mode: Streaming |
|
|
|
|
|
### Example Training Configuration |
|
|
|
|
|
```python |
|
|
training_args = { |
|
|
"per_device_train_batch_size": 2, |
|
|
"per_device_eval_batch_size": 2, |
|
|
"gradient_accumulation_steps": 4, |
|
|
"num_train_epochs": 2, |
|
|
"learning_rate": 5e-5, |
|
|
"warmup_steps": 10, |
|
|
"logging_steps": 10, |
|
|
"eval_steps": 50, |
|
|
"save_steps": 50, |
|
|
"fp16": False, # CPU training |
|
|
"dataloader_num_workers": 0 |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Evaluation |
|
|
|
|
|
The model is currently in **research phase**. |
|
|
Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🤝 Contributing |
|
|
|
|
|
This project is part of ongoing Thai NLP research. |
|
|
Feedback, issues, and contributions are welcome! |
|
|
|
|
|
--- |
|
|
|
|
|
## 📄 Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{Hanuman2025, |
|
|
title = {Hanuman: Thai Small Language Model}, |
|
|
author = {JonusNattapong and Koichi Yasuoka}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}}, |
|
|
note = {Tokenizer advisor: Koichi Yasuoka} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
> ⚠️ **Disclaimer**: This model is intended for research and educational purposes only. |
|
|
> Use in commercial applications requires prior permission under the CC BY-NC 4.0 license. |