Update README.md

deea174 verified 5 months ago

2.93 kB

language:
  - gu
tags:
  - text-generation
  - gpt2
  - gujarati
  - nlp
license: mit

Gujarati Language Model (GPT-2 Small)

This is a compact Gujarati language model based on the GPT-2 architecture. It has been trained from scratch on a dataset of Gujarati headlines to demonstrate language modeling capabilities in low-resource settings.

Model Description

Architecture: GPT-2 (Decoder-only Transformer)
Model Size: ~52 Million Parameters
Context Window: 256 tokens
Vocabulary Size: 64,000
Tokenizer: SentencePiece (Unigram)
Training Framework: Hugging Face Transformers & Tokenizers

Intended Use

This model is intended for:

Text generation in Gujarati.
Educational purposes to understand LLM training from scratch.
Fine-tuning on specific Gujarati downstream tasks.

Note: Due to the small model size and limited training data (headlines), the generated text may not always be coherent for long paragraphs.

How to Use

You can use this model directly with the Hugging Face pipeline or AutoModel classes.

Using Pipeline

from transformers import pipeline

generator = pipeline("text-generation", model="Naman0807/gujarati-lm")
prompt = "સરકાર"
output = generator(prompt, max_length=50, num_return_sequences=1)

print(output[0]['generated_text'])

Using Model and Tokenizer

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Naman0807/gujarati-lm")
model = AutoModelForCausalLM.from_pretrained("Naman0807/gujarati-lm")

input_text = "ગુજરાત રાજ્ય"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

Training Data

The model was trained on a dataset of Gujarati news headlines.

Source: Custom CSV dataset (train.csv, valid.csv)
Content: News headlines and short text snippets.
Preprocessing: Text was normalized using NFKC and tokenized using a custom SentencePiece tokenizer trained on the corpus.

Training Procedure

Hyperparameters

Epochs: 5
Batch Size: 8
Learning Rate: 5e-4
Weight Decay: 0.01
Optimizer: AdamW
Scheduler: Linear decay
Precision: Mixed Precision (FP16)

Model Configuration

GPT2Config(
    vocab_size=64000,
    n_positions=256,
    n_ctx=256,
    n_embd=512,
    n_layer=6,
    n_head=8
)

Limitations

Context Length: The model has a short context window of 256 tokens, limiting its ability to generate long, coherent articles.
Dataset Bias: Trained primarily on headlines, so the style is terse and news-oriented.
Size: As a "compact" model (6 layers, 512 hidden dim), it lacks the reasoning capabilities of larger LLMs.