gujarati-lm / README.md
Naman0807's picture
Update README.md
deea174 verified
metadata
language:
  - gu
tags:
  - text-generation
  - gpt2
  - gujarati
  - nlp
license: mit

Gujarati Language Model (GPT-2 Small)

This is a compact Gujarati language model based on the GPT-2 architecture. It has been trained from scratch on a dataset of Gujarati headlines to demonstrate language modeling capabilities in low-resource settings.

Model Description

  • Architecture: GPT-2 (Decoder-only Transformer)
  • Model Size: ~52 Million Parameters
  • Context Window: 256 tokens
  • Vocabulary Size: 64,000
  • Tokenizer: SentencePiece (Unigram)
  • Training Framework: Hugging Face Transformers & Tokenizers

Intended Use

This model is intended for:

  • Text generation in Gujarati.
  • Educational purposes to understand LLM training from scratch.
  • Fine-tuning on specific Gujarati downstream tasks.

Note: Due to the small model size and limited training data (headlines), the generated text may not always be coherent for long paragraphs.

How to Use

You can use this model directly with the Hugging Face pipeline or AutoModel classes.

Using Pipeline

from transformers import pipeline

generator = pipeline("text-generation", model="Naman0807/gujarati-lm")
prompt = "સરકાર"
output = generator(prompt, max_length=50, num_return_sequences=1)

print(output[0]['generated_text'])

Using Model and Tokenizer

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Naman0807/gujarati-lm")
model = AutoModelForCausalLM.from_pretrained("Naman0807/gujarati-lm")

input_text = "ગુજરાત રાજ્ય"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

Training Data

The model was trained on a dataset of Gujarati news headlines.

  • Source: Custom CSV dataset (train.csv, valid.csv)
  • Content: News headlines and short text snippets.
  • Preprocessing: Text was normalized using NFKC and tokenized using a custom SentencePiece tokenizer trained on the corpus.

Training Procedure

Hyperparameters

  • Epochs: 5
  • Batch Size: 8
  • Learning Rate: 5e-4
  • Weight Decay: 0.01
  • Optimizer: AdamW
  • Scheduler: Linear decay
  • Precision: Mixed Precision (FP16)

Model Configuration

GPT2Config(
    vocab_size=64000,
    n_positions=256,
    n_ctx=256,
    n_embd=512,
    n_layer=6,
    n_head=8
)

Limitations

  • Context Length: The model has a short context window of 256 tokens, limiting its ability to generate long, coherent articles.
  • Dataset Bias: Trained primarily on headlines, so the style is terse and news-oriented.
  • Size: As a "compact" model (6 layers, 512 hidden dim), it lacks the reasoning capabilities of larger LLMs.