ZombitX64
/

Hanuman

+---
+language:
+- th
+- en
+license: cc-by-nc-4.0
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- thai
+- text-generation
+- Hanuman
+- pytorch
+- reasoning
+datasets:
+- HelpingAI/Dhanishtha-2.0-SUPERTHINKER
+- HuggingFaceH4/no_robots
+model-index:
+- name: ZombitX64/Hanuman
+  results:
+  - task:
+      name: Text Generation
+      type: text-generation
+    dataset:
+      name: HelpingAI/Dhanishtha-2.0-SUPERTHINKER
+      type: text
+    metrics: []
+  - task:
+      name: Text Generation
+      type: text-generation
+    dataset:
+      name: HuggingFaceH4/no_robots
+      type: text
+    metrics: []
+widget:
+- text: Hello
+  example_title: Simple greeting
+- text: Thailand is located in
+  example_title: Geography
+- text: Artificial intelligence technology is
+  example_title: Technology
+inference:
+  parameters:
+    max_length: 100
+    temperature: 0.7
+    top_p: 0.9
+    do_sample: true
+---
+# Hanuman
+<div align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/673eef9c4edfc6d3b58ba3aa/phqwy_ASNiDUo0DVqW30x.png" width="300" alt="Hanuman">
+<strong>Hanuman — A Small Language Model for Thai</strong>
+<em>Tokenizer advisor: <a href="https://huggingface.co/KoichiYasuoka">Koichi Yasuoka</a></em>
+<a href="https://creativecommons.org/licenses/by-nc/4.0/"><img src="https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg"></a>
+<a href="https://huggingface.co/JonusNattapong/Hanuman"><img src="https://img.shields.io/badge/🤗%20HF-Model-yellow"></a>
+</div>
+---
+## 🔎 Model Details
+### Overview
+- **Name**: Hanuman
+- **Language**: Thai (th)
+- **Task**: Text Generation (Causal LM)
+- **Framework**: PyTorch + 🤗 Transformers
+- **License**: CC BY-NC 4.0 (Non-commercial use only)
+### Training Datasets
+- [HelpingAI/Dhanishtha-2.0-SUPERTHINKER](https://huggingface.co/datasets/HelpingAI/Dhanishtha-2.0-SUPERTHINKER)
+- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
+### Architecture
+- Based on a **Small Language Model (SLM) with Mixture of Experts**
+- Context length: **4,096 tokens** (extended via RoPE scaling)
+- Custom tokenizer for Thai language (handles whitespace, newline, tab, `<NL>`, `<SPACE>`, `<TAB>` etc.)
+---
+## ✅ Intended Use
+### Primary Use Cases
+- Thai text generation (blogs, articles, captions, chatbots)
+- Creative and reasoning-oriented text assistance
+- Thai NLP research
+### Limitations
+- This model is **research-oriented** and may require additional fine-tuning for production use.
+- May generate incorrect or biased outputs. Human verification is recommended.
+---
+## 🧰 Tokenizer & Context
+- Custom fast tokenizer (no `trust_remote_code` needed)
+- Ensures **round-trip encode/decode correctness**
+- Unicode NFC normalization included
+- Handles Thai–Latin spacing consistently
+---
+## 🚀 Usage Examples
+### Basic Text Generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+MODEL_ID = "ZombitX64/Hanuman"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
+def generate_thai_text(prompt, max_length=100):
+    inputs = tokenizer(prompt, return_tensors="pt")
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_length=max_length,
+            temperature=0.7,
+            top_p=0.9,
+            do_sample=True,
+            pad_token_id=tokenizer.eos_token_id
+        )
+    return tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generate_thai_text("Artificial intelligence technology"))
+````
+### Batch Processing
+```python
+prompts = ["Hello", "Thailand has an area of", "Education in the digital era"]
+for p in prompts:
+    print(generate_thai_text(p, max_length=80))
+    print("-"*50)
+```
+---
+## 🏗️ Training Process
+### Dataset Preparation
+* Source: Wikipedia Thai and reasoning-style datasets
+* Preprocessing: Cleaning, Unicode normalization, tokenization
+* Training mode: Streaming
+### Example Training Configuration
+```python
+training_args = {
+    "per_device_train_batch_size": 2,
+    "per_device_eval_batch_size": 2,
+    "gradient_accumulation_steps": 4,
+    "num_train_epochs": 2,
+    "learning_rate": 5e-5,
+    "warmup_steps": 10,
+    "logging_steps": 10,
+    "eval_steps": 50,
+    "save_steps": 50,
+    "fp16": False,  # CPU training
+    "dataloader_num_workers": 0
+}
+```
+---
+## 📊 Evaluation
+The model is currently in **research phase**.
+Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future.
+---
+## 🤝 Contributing
+This project is part of ongoing Thai NLP research.
+Feedback, issues, and contributions are welcome!
+---
+## 📄 Citation
+```bibtex
+@misc{Hanuman2025,
+  title        = {Hanuman: Thai Small Language Model},
+  author       = {JonusNattapong and Koichi Yasuoka},
+  year         = {2025},
+  howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}},
+  note         = {Tokenizer advisor: Koichi Yasuoka}
+}
+```
+---
+> ⚠️ **Disclaimer**: This model is intended for research and educational purposes only.
+> Use in commercial applications requires prior permission under the CC BY-NC 4.0 license.