Gujarati Language Model (GPT-2 Small)
This is a compact Gujarati language model based on the GPT-2 architecture. It has been trained from scratch on a dataset of Gujarati headlines to demonstrate language modeling capabilities in low-resource settings.
Model Description
- Architecture: GPT-2 (Decoder-only Transformer)
- Model Size: ~52 Million Parameters
- Context Window: 256 tokens
- Vocabulary Size: 64,000
- Tokenizer: SentencePiece (Unigram)
- Training Framework: Hugging Face Transformers & Tokenizers
Intended Use
This model is intended for:
- Text generation in Gujarati.
- Educational purposes to understand LLM training from scratch.
- Fine-tuning on specific Gujarati downstream tasks.
Note: Due to the small model size and limited training data (headlines), the generated text may not always be coherent for long paragraphs.
How to Use
You can use this model directly with the Hugging Face pipeline or AutoModel classes.
Using Pipeline
from transformers import pipeline
generator = pipeline("text-generation", model="Naman0807/gujarati-lm")
prompt = "સરકાર"
output = generator(prompt, max_length=50, num_return_sequences=1)
print(output[0]['generated_text'])
Using Model and Tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Naman0807/gujarati-lm")
model = AutoModelForCausalLM.from_pretrained("Naman0807/gujarati-lm")
input_text = "ગુજરાત રાજ્ય"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
Training Data
The model was trained on a dataset of Gujarati news headlines.
- Source: Custom CSV dataset (
train.csv,valid.csv) - Content: News headlines and short text snippets.
- Preprocessing: Text was normalized using NFKC and tokenized using a custom SentencePiece tokenizer trained on the corpus.
Training Procedure
Hyperparameters
- Epochs: 5
- Batch Size: 8
- Learning Rate: 5e-4
- Weight Decay: 0.01
- Optimizer: AdamW
- Scheduler: Linear decay
- Precision: Mixed Precision (FP16)
Model Configuration
GPT2Config(
vocab_size=64000,
n_positions=256,
n_ctx=256,
n_embd=512,
n_layer=6,
n_head=8
)
Limitations
- Context Length: The model has a short context window of 256 tokens, limiting its ability to generate long, coherent articles.
- Dataset Bias: Trained primarily on headlines, so the style is terse and news-oriented.
- Size: As a "compact" model (6 layers, 512 hidden dim), it lacks the reasoning capabilities of larger LLMs.
- Downloads last month
- 29