Gujarati Language Model (GPT-2 Small)

This is a compact Gujarati language model based on the GPT-2 architecture. It has been trained from scratch on a dataset of Gujarati headlines to demonstrate language modeling capabilities in low-resource settings.

Model Description

  • Architecture: GPT-2 (Decoder-only Transformer)
  • Model Size: ~52 Million Parameters
  • Context Window: 256 tokens
  • Vocabulary Size: 64,000
  • Tokenizer: SentencePiece (Unigram)
  • Training Framework: Hugging Face Transformers & Tokenizers

Intended Use

This model is intended for:

  • Text generation in Gujarati.
  • Educational purposes to understand LLM training from scratch.
  • Fine-tuning on specific Gujarati downstream tasks.

Note: Due to the small model size and limited training data (headlines), the generated text may not always be coherent for long paragraphs.

How to Use

You can use this model directly with the Hugging Face pipeline or AutoModel classes.

Using Pipeline

from transformers import pipeline

generator = pipeline("text-generation", model="Naman0807/gujarati-lm")
prompt = "સરકાર"
output = generator(prompt, max_length=50, num_return_sequences=1)

print(output[0]['generated_text'])

Using Model and Tokenizer

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Naman0807/gujarati-lm")
model = AutoModelForCausalLM.from_pretrained("Naman0807/gujarati-lm")

input_text = "ગુજરાત રાજ્ય"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

Training Data

The model was trained on a dataset of Gujarati news headlines.

  • Source: Custom CSV dataset (train.csv, valid.csv)
  • Content: News headlines and short text snippets.
  • Preprocessing: Text was normalized using NFKC and tokenized using a custom SentencePiece tokenizer trained on the corpus.

Training Procedure

Hyperparameters

  • Epochs: 5
  • Batch Size: 8
  • Learning Rate: 5e-4
  • Weight Decay: 0.01
  • Optimizer: AdamW
  • Scheduler: Linear decay
  • Precision: Mixed Precision (FP16)

Model Configuration

GPT2Config(
    vocab_size=64000,
    n_positions=256,
    n_ctx=256,
    n_embd=512,
    n_layer=6,
    n_head=8
)

Limitations

  • Context Length: The model has a short context window of 256 tokens, limiting its ability to generate long, coherent articles.
  • Dataset Bias: Trained primarily on headlines, so the style is terse and news-oriented.
  • Size: As a "compact" model (6 layers, 512 hidden dim), it lacks the reasoning capabilities of larger LLMs.
Downloads last month
29
Safetensors
Model size
51.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Naman0807/gujarati-lm

Finetunes
1 model