Text Generation
Transformers
PyTorch
Thai
English
gpt2
thai
Hanuman
reasoning
JonusNattapong commited on
Commit
57579f7
·
verified ·
1 Parent(s): 6d19dd3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -0
README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - th
4
+ - en
5
+ license: cc-by-nc-4.0
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - thai
10
+ - text-generation
11
+ - Hanuman
12
+ - pytorch
13
+ - reasoning
14
+ datasets:
15
+ - HelpingAI/Dhanishtha-2.0-SUPERTHINKER
16
+ - HuggingFaceH4/no_robots
17
+ model-index:
18
+ - name: ZombitX64/Hanuman
19
+ results:
20
+ - task:
21
+ name: Text Generation
22
+ type: text-generation
23
+ dataset:
24
+ name: HelpingAI/Dhanishtha-2.0-SUPERTHINKER
25
+ type: text
26
+ metrics: []
27
+ - task:
28
+ name: Text Generation
29
+ type: text-generation
30
+ dataset:
31
+ name: HuggingFaceH4/no_robots
32
+ type: text
33
+ metrics: []
34
+ widget:
35
+ - text: Hello
36
+ example_title: Simple greeting
37
+ - text: Thailand is located in
38
+ example_title: Geography
39
+ - text: Artificial intelligence technology is
40
+ example_title: Technology
41
+ inference:
42
+ parameters:
43
+ max_length: 100
44
+ temperature: 0.7
45
+ top_p: 0.9
46
+ do_sample: true
47
+ ---
48
+
49
+ # Hanuman
50
+
51
+ <div align="center">
52
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/673eef9c4edfc6d3b58ba3aa/phqwy_ASNiDUo0DVqW30x.png" width="300" alt="Hanuman">
53
+
54
+ <strong>Hanuman — A Small Language Model for Thai</strong>
55
+
56
+ <em>Tokenizer advisor: <a href="https://huggingface.co/KoichiYasuoka">Koichi Yasuoka</a></em>
57
+
58
+ <a href="https://creativecommons.org/licenses/by-nc/4.0/"><img src="https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg"></a>
59
+ <a href="https://huggingface.co/JonusNattapong/Hanuman"><img src="https://img.shields.io/badge/🤗%20HF-Model-yellow"></a>
60
+ </div>
61
+
62
+ ---
63
+
64
+ ## 🔎 Model Details
65
+
66
+ ### Overview
67
+ - **Name**: Hanuman
68
+ - **Language**: Thai (th)
69
+ - **Task**: Text Generation (Causal LM)
70
+ - **Framework**: PyTorch + 🤗 Transformers
71
+ - **License**: CC BY-NC 4.0 (Non-commercial use only)
72
+
73
+ ### Training Datasets
74
+ - [HelpingAI/Dhanishtha-2.0-SUPERTHINKER](https://huggingface.co/datasets/HelpingAI/Dhanishtha-2.0-SUPERTHINKER)
75
+ - [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
76
+
77
+ ### Architecture
78
+ - Based on a **Small Language Model (SLM) with Mixture of Experts**
79
+ - Context length: **4,096 tokens** (extended via RoPE scaling)
80
+ - Custom tokenizer for Thai language (handles whitespace, newline, tab, `<NL>`, `<SPACE>`, `<TAB>` etc.)
81
+
82
+ ---
83
+
84
+ ## ✅ Intended Use
85
+
86
+ ### Primary Use Cases
87
+ - Thai text generation (blogs, articles, captions, chatbots)
88
+ - Creative and reasoning-oriented text assistance
89
+ - Thai NLP research
90
+
91
+ ### Limitations
92
+ - This model is **research-oriented** and may require additional fine-tuning for production use.
93
+ - May generate incorrect or biased outputs. Human verification is recommended.
94
+
95
+ ---
96
+
97
+ ## 🧰 Tokenizer & Context
98
+
99
+ - Custom fast tokenizer (no `trust_remote_code` needed)
100
+ - Ensures **round-trip encode/decode correctness**
101
+ - Unicode NFC normalization included
102
+ - Handles Thai–Latin spacing consistently
103
+
104
+ ---
105
+
106
+ ## 🚀 Usage Examples
107
+
108
+ ### Basic Text Generation
109
+ ```python
110
+ import torch
111
+ from transformers import AutoTokenizer, AutoModelForCausalLM
112
+
113
+ MODEL_ID = "ZombitX64/Hanuman"
114
+
115
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
116
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
117
+
118
+ def generate_thai_text(prompt, max_length=100):
119
+ inputs = tokenizer(prompt, return_tensors="pt")
120
+ with torch.no_grad():
121
+ outputs = model.generate(
122
+ **inputs,
123
+ max_length=max_length,
124
+ temperature=0.7,
125
+ top_p=0.9,
126
+ do_sample=True,
127
+ pad_token_id=tokenizer.eos_token_id
128
+ )
129
+ return tokenizer.decode(outputs[0], skip_special_tokens=True)
130
+
131
+ print(generate_thai_text("Artificial intelligence technology"))
132
+ ````
133
+
134
+ ### Batch Processing
135
+
136
+ ```python
137
+ prompts = ["Hello", "Thailand has an area of", "Education in the digital era"]
138
+ for p in prompts:
139
+ print(generate_thai_text(p, max_length=80))
140
+ print("-"*50)
141
+ ```
142
+
143
+ ---
144
+
145
+ ## 🏗️ Training Process
146
+
147
+ ### Dataset Preparation
148
+
149
+ * Source: Wikipedia Thai and reasoning-style datasets
150
+ * Preprocessing: Cleaning, Unicode normalization, tokenization
151
+ * Training mode: Streaming
152
+
153
+ ### Example Training Configuration
154
+
155
+ ```python
156
+ training_args = {
157
+ "per_device_train_batch_size": 2,
158
+ "per_device_eval_batch_size": 2,
159
+ "gradient_accumulation_steps": 4,
160
+ "num_train_epochs": 2,
161
+ "learning_rate": 5e-5,
162
+ "warmup_steps": 10,
163
+ "logging_steps": 10,
164
+ "eval_steps": 50,
165
+ "save_steps": 50,
166
+ "fp16": False, # CPU training
167
+ "dataloader_num_workers": 0
168
+ }
169
+ ```
170
+
171
+ ---
172
+
173
+ ## 📊 Evaluation
174
+
175
+ The model is currently in **research phase**.
176
+ Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future.
177
+
178
+ ---
179
+
180
+ ## 🤝 Contributing
181
+
182
+ This project is part of ongoing Thai NLP research.
183
+ Feedback, issues, and contributions are welcome!
184
+
185
+ ---
186
+
187
+ ## 📄 Citation
188
+
189
+ ```bibtex
190
+ @misc{Hanuman2025,
191
+ title = {Hanuman: Thai Small Language Model},
192
+ author = {JonusNattapong and Koichi Yasuoka},
193
+ year = {2025},
194
+ howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}},
195
+ note = {Tokenizer advisor: Koichi Yasuoka}
196
+ }
197
+ ```
198
+
199
+ ---
200
+
201
+ > ⚠️ **Disclaimer**: This model is intended for research and educational purposes only.
202
+ > Use in commercial applications requires prior permission under the CC BY-NC 4.0 license.