GPT-2 Mongolian Language Model
This is a GPT-2 model fine-tuned for Mongolian text generation. It was trained on the mn-mgltol-data dataset available on Kaggle, specifically using the Cleaned_Description field from cleaned_data.csv.
Model Description
The model is a GPT-2 architecture with the following configuration:
vocab_size: 32000 (derived from thegoryden/gpt2-mn-tokenizer)n_positions: 512n_ctx: 512n_embd: 512n_layer: 6n_head: 8
It is designed for causal language modeling, meaning it predicts the next token in a sequence.
Usage
You can use this model for various text generation tasks in Mongolian. Here's how to use it with the transformers library:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "goryden/gpt2-mn"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Set pad token for GPT-2
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
prompt = "Монгол хэлний үүсэл хөгжил нь"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.9,
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output
output = 'Монгол хэлний үүсэл хөгжил нь тэдгээрийн хөгжил нэгтэй элементийг судалдаг шүлгийн судлал эртний ухаан угсаатны шинжлэх ухаан урлагийн өвөрмөц цогц зүй дэлгэрэнгүй хэл шинжлэлийн салбар ухаан угсаатны холбоотой төрөл хэл шинжлэлийн салбар ухаан түүхэн хэл шинжлэл хэл зүйн онцлог хэлбэрийн хэлбэр хэлний хэлбэр хэл шинжлэл үгийн тайлбар толь бичиг хэлний тогтолцоо хэл шинжлэл үгийн язгуур хэлний тогтолцоо нь түүхэн язгуур махбодын хөгжлийн тухай түүхэн хөгжлийгуурын үгийн нэг хэлбэр хэл шинжлэл хэлний зүй болон үндсэн үг үгийн гарал үгийн зүй үгийн хэл зүй үгийн язгуур махбод зүй тогтлыг ийнхүү нэг аргаианы нэг арга ухаан үзэгдэл'
Dataset
The model was trained on the mn-mgltol-data dataset from Kaggle, specifically using the cleaned_data.csv file. The Cleaned_Description column was used as the primary text source.
Tokenizer
The model uses the goryden/gpt2-mn-tokenizer tokenizer, which was specifically designed for Mongolian text.
Train
Training Hyperparameters
The model was trained with the following TrainingArguments:
output_dir:./gpt2-mneval_strategy:"steps"eval_steps:2000logging_steps:500save_steps:2000save_total_limit:2num_train_epochs:20per_device_train_batch_size:2per_device_eval_batch_size:2gradient_accumulation_steps:8learning_rate:5e-4warmup_steps:500weight_decay:0.01fp16:Truereport_to:"wandb"
Training Metrics
| Step | Training Loss | Validation Loss |
|---|---|---|
| 2000 | 3.865199 | 4.184773 |
| 4000 | 3.780115 | 4.176939 |
| 6000 | 3.754177 | 4.168733 |
| 8000 | 3.665107 | 4.161174 |
| 10000 | 3.642579 | 4.157396 |
| 12000 | 3.563909 | 4.152925 |
| 14000 | 3.569792 | 4.147260 |
| 16000 | 3.500197 | 4.144615 |
| 18000 | 3.633731 | 4.136218 |
| 20000 | 3.617897 | 4.124136 |
| 22000 | 3.649087 | 4.117109 |
| 24000 | 3.589488 | 4.112113 |
| 26000 | 3.629827 | 4.105901 |
| 28000 | 3.565568 | 4.101184 |
| 30000 | 3.586900 | 4.095827 |
| 32000 | 3.537389 | 4.091309 |
| 34000 | 3.557306 | 4.087384 |
| 36000 | 3.531081 | 4.084917 |
| 38000 | 3.528582 | 4.079904 |
| 40000 | 3.480876 | 4.078234 |
| 42000 | 3.489738 | 4.073891 |
| 44000 | 3.470903 | 4.073234 |
| 46000 | 3.465326 | 4.069700 |
| 48000 | 3.431285 | 4.068748 |
| 50000 | 3.455316 | 4.065001 |
| 52000 | 3.413579 | 4.064515 |
| 54000 | 3.426512 | 4.061472 |
| 56000 | 3.386969 | 4.060637 |
| 58000 | 3.395722 | 4.058526 |
| 60000 | 3.386943 | 4.057773 |
| 62000 | 3.385369 | 4.056948 |
| 64000 | 3.377199 | 4.055539 |
| 66000 | 3.368386 | 4.054784 |
| 68000 | 3.354294 | 4.054374 |
| 70000 | 3.343501 | 4.053655 |
| 72000 | 3.355486 | 4.053338 |
| 74000 | 3.368476 | 4.053087 |
| 76000 | 3.362126 | 4.052906 |
- Downloads last month
- 53
Model tree for goryden/gpt2-mn
Base model
openai-community/gpt2