--- language: - th - en license: cc-by-nc-4.0 library_name: transformers pipeline_tag: text-generation tags: - thai - text-generation - Hanuman - pytorch - reasoning datasets: - HelpingAI/Dhanishtha-2.0-SUPERTHINKER - HuggingFaceH4/no_robots model-index: - name: ZombitX64/Hanuman results: - task: name: Text Generation type: text-generation dataset: name: HelpingAI/Dhanishtha-2.0-SUPERTHINKER type: text metrics: [] - task: name: Text Generation type: text-generation dataset: name: HuggingFaceH4/no_robots type: text metrics: [] widget: - text: Hello example_title: Simple greeting - text: Thailand is located in example_title: Geography - text: Artificial intelligence technology is example_title: Technology inference: parameters: max_length: 100 temperature: 0.7 top_p: 0.9 do_sample: true --- # Hanuman

Hanuman — A Small Language Model for Thai Tokenizer advisor: Koichi Yasuoka

--- ## 🔎 Model Details ### Overview - **Name**: Hanuman - **Language**: Thai (th) - **Task**: Text Generation (Causal LM) - **Framework**: PyTorch + 🤗 Transformers - **License**: CC BY-NC 4.0 (Non-commercial use only) ### Training Datasets - [HelpingAI/Dhanishtha-2.0-SUPERTHINKER](https://huggingface.co/datasets/HelpingAI/Dhanishtha-2.0-SUPERTHINKER) - [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) ### Architecture - Based on a **Small Language Model (SLM) with Mixture of Experts** - Context length: **4,096 tokens** (extended via RoPE scaling) - Custom tokenizer for Thai language (handles whitespace, newline, tab, ``, ``, `` etc.) --- ## ✅ Intended Use ### Primary Use Cases - Thai text generation (blogs, articles, captions, chatbots) - Creative and reasoning-oriented text assistance - Thai NLP research ### Limitations - This model is **research-oriented** and may require additional fine-tuning for production use. - May generate incorrect or biased outputs. Human verification is recommended. --- ## 🧰 Tokenizer & Context - Custom fast tokenizer (no `trust_remote_code` needed) - Ensures **round-trip encode/decode correctness** - Unicode NFC normalization included - Handles Thai–Latin spacing consistently --- ## 🚀 Usage Examples ### Basic Text Generation ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM MODEL_ID = "ZombitX64/Hanuman" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) model = AutoModelForCausalLM.from_pretrained(MODEL_ID) def generate_thai_text(prompt, max_length=100): inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( **inputs, max_length=max_length, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.eos_token_id ) return tokenizer.decode(outputs[0], skip_special_tokens=True) print(generate_thai_text("Artificial intelligence technology")) ```` ### Batch Processing ```python prompts = ["Hello", "Thailand has an area of", "Education in the digital era"] for p in prompts: print(generate_thai_text(p, max_length=80)) print("-"*50) ``` --- ## 🏗️ Training Process ### Dataset Preparation * Source: Wikipedia Thai and reasoning-style datasets * Preprocessing: Cleaning, Unicode normalization, tokenization * Training mode: Streaming ### Example Training Configuration ```python training_args = { "per_device_train_batch_size": 2, "per_device_eval_batch_size": 2, "gradient_accumulation_steps": 4, "num_train_epochs": 2, "learning_rate": 5e-5, "warmup_steps": 10, "logging_steps": 10, "eval_steps": 50, "save_steps": 50, "fp16": False, # CPU training "dataloader_num_workers": 0 } ``` --- ## 📊 Evaluation The model is currently in **research phase**. Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future. --- ## 🤝 Contributing This project is part of ongoing Thai NLP research. Feedback, issues, and contributions are welcome! --- ## 📄 Citation ```bibtex @misc{Hanuman2025, title = {Hanuman: Thai Small Language Model}, author = {JonusNattapong and Koichi Yasuoka}, year = {2025}, howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}}, note = {Tokenizer advisor: Koichi Yasuoka} } ``` --- > ⚠️ **Disclaimer**: This model is intended for research and educational purposes only. > Use in commercial applications requires prior permission under the CC BY-NC 4.0 license.