HanumanGPT / README.md

Update README.md

4a97cb1 verified 5 months ago

5.21 kB

	---
	language:
	- th
	- en
	license: cc-by-nc-4.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- thai
	- text-generation
	- Hanuman
	- pytorch
	- reasoning
	datasets:
	- HelpingAI/Dhanishtha-2.0-SUPERTHINKER
	- HuggingFaceH4/no_robots
	model-index:
	- name: ZombitX64/Hanuman
	results:
	- task:
	name: Text Generation
	type: text-generation
	dataset:
	name: HelpingAI/Dhanishtha-2.0-SUPERTHINKER
	type: text
	metrics: []
	- task:
	name: Text Generation
	type: text-generation
	dataset:
	name: HuggingFaceH4/no_robots
	type: text
	metrics: []
	widget:
	- text: Hello
	example_title: Simple greeting
	- text: Thailand is located in
	example_title: Geography
	- text: Artificial intelligence technology is
	example_title: Technology
	inference:
	parameters:
	max_length: 100
	temperature: 0.7
	top_p: 0.9
	do_sample: true
	---

	# Hanuman

	<div align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/673eef9c4edfc6d3b58ba3aa/phqwy_ASNiDUo0DVqW30x.png" width="300" alt="Hanuman">

	<strong>Hanuman — A Small Language Model for Thai</strong>

	<em>Tokenizer advisor: <a href="https://huggingface.co/KoichiYasuoka">Koichi Yasuoka</a></em>

	<a href="https://creativecommons.org/licenses/by-nc/4.0/"><img src="https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg"></a>
	<a href="https://huggingface.co/JonusNattapong/Hanuman"><img src="https://img.shields.io/badge/🤗%20HF-Model-yellow"></a>
	</div>

	---

	## 🔎 Model Details

	### Overview
	- Name: Hanuman
	- Language: Thai (th)
	- Task: Text Generation (Causal LM)
	- Framework: PyTorch + 🤗 Transformers
	- License: CC BY-NC 4.0 (Non-commercial use only)

	### Training Datasets
	- [HelpingAI/Dhanishtha-2.0-SUPERTHINKER](https://huggingface.co/datasets/HelpingAI/Dhanishtha-2.0-SUPERTHINKER)
	- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)

	### Architecture
	- Based on a Small Language Model (SLM) with Mixture of Experts
	- Context length: 4,096 tokens (extended via RoPE scaling)
	- Custom tokenizer for Thai language (handles whitespace, newline, tab, `<NL>`, `<SPACE>`, `<TAB>` etc.)

	---

	## ✅ Intended Use

	### Primary Use Cases
	- Thai text generation (blogs, articles, captions, chatbots)
	- Creative and reasoning-oriented text assistance
	- Thai NLP research

	### Limitations
	- This model is research-oriented and may require additional fine-tuning for production use.
	- May generate incorrect or biased outputs. Human verification is recommended.

	---

	## 🧰 Tokenizer & Context

	- Custom fast tokenizer (no `trust_remote_code` needed)
	- Ensures round-trip encode/decode correctness
	- Unicode NFC normalization included
	- Handles Thai–Latin spacing consistently

	---

	## 🚀 Usage Examples

	### Basic Text Generation
	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	MODEL_ID = "ZombitX64/Hanuman"

	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

	def generate_thai_text(prompt, max_length=100):
	inputs = tokenizer(prompt, return_tensors="pt")
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_length=max_length,
	temperature=0.7,
	top_p=0.9,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)
	return tokenizer.decode(outputs[0], skip_special_tokens=True)

	print(generate_thai_text("Artificial intelligence technology"))
	````

	### Batch Processing

	```python
	prompts = ["Hello", "Thailand has an area of", "Education in the digital era"]
	for p in prompts:
	print(generate_thai_text(p, max_length=80))
	print("-"*50)
	```

	---

	## 🏗️ Training Process

	### Dataset Preparation

	* Source: Wikipedia Thai and reasoning-style datasets
	* Preprocessing: Cleaning, Unicode normalization, tokenization
	* Training mode: Streaming

	### Example Training Configuration

	```python
	training_args = {
	"per_device_train_batch_size": 2,
	"per_device_eval_batch_size": 2,
	"gradient_accumulation_steps": 4,
	"num_train_epochs": 2,
	"learning_rate": 5e-5,
	"warmup_steps": 10,
	"logging_steps": 10,
	"eval_steps": 50,
	"save_steps": 50,
	"fp16": False, # CPU training
	"dataloader_num_workers": 0
	}
	```

	---

	## 📊 Evaluation

	The model is currently in research phase.
	Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future.

	---

	## 🤝 Contributing

	This project is part of ongoing Thai NLP research.
	Feedback, issues, and contributions are welcome!

	---

	## 📄 Citation

	```bibtex
	@misc{Hanuman2025,
	title = {Hanuman: Thai Small Language Model},
	author = {JonusNattapong and Koichi Yasuoka},
	year = {2025},
	howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}},
	note = {Tokenizer advisor: Koichi Yasuoka}
	}
	```

	---

	> ⚠️ Disclaimer: This model is intended for research and educational purposes only.
	> Use in commercial applications requires prior permission under the CC BY-NC 4.0 license.