Gujarati Llama2 Base

A compact, fine-tuned Llama2 model specialized for the Gujarati language.

Model Details

Model Description

This is a Llama2-based language model trained on Gujarati text data. It's designed to generate coherent and contextually relevant Gujarati language text. The model is a smaller variant optimized for efficiency while maintaining reasonable language understanding capabilities.

Model type: Language Model (Causal Language Model)
Library: transformers
License: Apache 2.0

Model Architecture

Architecture: Llama
Hidden Size: 512
Number of Attention Heads: 16
Number of Hidden Layers: 6
Intermediate Size: 2736
Max Position Embeddings: 1024
Vocabulary Size: 64,000
Head Dimension: 32

Model Performance

This model is optimized for Gujarati text generation tasks. It has been trained on a curated Gujarati text corpus and is capable of:

Text generation in Gujarati
Contextual understanding of Gujarati language patterns
Continuation of Gujarati sentences and paragraphs

Intended Use

Primary Use Cases:

Gujarati text generation
Language understanding and analysis tasks
Fine-tuning for downstream Gujarati NLP tasks

Out-of-scope Uses:

Production-level applications without further fine-tuning
Non-Gujarati language tasks (not optimized for other languages)

How to Use

Using transformers library

from transformers import LlamaForCausalLM, PreTrainedTokenizerFast

# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("Naman0807/gujarati-llama2-base")
tokenizer = PreTrainedTokenizerFast.from_pretrained("Naman0807/gujarati-llama2-base")

# Generate text
prompt = "ગુજરાત એક અદભુત"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs["input_ids"], max_length=100, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Using pipelines

from transformers import pipeline

generator = pipeline("text-generation", model="Naman0807/gujarati-llama2-base")
output = generator("ગુજરાત", max_length=100)
print(output)

Training Data

Dataset: Custom Gujarati text corpus
Data Source: Preprocessed Gujarati language texts
Training Data Size: Approximately 2GB of clean Gujarati text

Training Procedure

Training Hyperparameters

Learning Rate: 2e-4
Batch Size: Optimized for GPU training
Max Sequence Length: 1024 tokens
Number of Epochs: Multiple epochs over the dataset
Optimizer: AdamW
Tokenizer: Unigram-based tokenizer (trained on Gujarati corpus)

Training Details

Framework: Hugging Face Transformers with PyTorch
Hardware: GPU-accelerated training
Preprocessing: Text normalization and cleaning

Tokenizer

A custom Unigram tokenizer has been trained on the Gujarati corpus with:

Vocabulary Size: 64,000 tokens
Special Tokens: <pad>, <unk>, <bos>, <eos>
Pre-tokenizer: Metaspace-based splitting for Gujarati text

Limitations and Biases

Language-specific: Optimized for Gujarati; performance on other languages is not guaranteed
Model size: Compact model (6 layers) may have limitations in capturing complex language patterns compared to larger models
Training data: Limited to specific Gujarati text sources, which may introduce biases
Context length: Maximum context of 1024 tokens may be limiting for longer documents

Ethical Considerations

This model should be used responsibly and ethically
Users should be aware of potential biases in the training data
Outputs should be reviewed for accuracy and appropriateness in production use

Environmental Impact

This is a relatively small model (512 hidden size, 6 layers), resulting in:

Lower computational requirements compared to large-scale models
Reduced environmental impact during inference
Feasibility for deployment on resource-limited devices

Community and Contributions

Feedback, bug reports, and contributions are welcome. Please open an issue or pull request if you have suggestions for improvement.

Citation

If you use this model, please cite:

@model{gujarati-llama2-base,
  title={Gujarati Llama2 Base},
  author={Naman},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Naman0807/gujarati-llama2-base}}
}

Disclaimer

This model is provided "as is" without warranties. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when using this model.

Model Version: 1.0
Last Updated: January 2, 2026

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support