Gujarati Llama2 Base

A compact, fine-tuned Llama2 model specialized for the Gujarati language.

Model Details

Model Description

This is a Llama2-based language model trained on Gujarati text data. It's designed to generate coherent and contextually relevant Gujarati language text. The model is a smaller variant optimized for efficiency while maintaining reasonable language understanding capabilities.

  • Model type: Language Model (Causal Language Model)
  • Library: transformers
  • License: Apache 2.0

Model Architecture

  • Architecture: Llama
  • Hidden Size: 512
  • Number of Attention Heads: 16
  • Number of Hidden Layers: 6
  • Intermediate Size: 2736
  • Max Position Embeddings: 1024
  • Vocabulary Size: 64,000
  • Head Dimension: 32

Model Performance

This model is optimized for Gujarati text generation tasks. It has been trained on a curated Gujarati text corpus and is capable of:

  • Text generation in Gujarati
  • Contextual understanding of Gujarati language patterns
  • Continuation of Gujarati sentences and paragraphs

Intended Use

Primary Use Cases:

  • Gujarati text generation
  • Language understanding and analysis tasks
  • Fine-tuning for downstream Gujarati NLP tasks

Out-of-scope Uses:

  • Production-level applications without further fine-tuning
  • Non-Gujarati language tasks (not optimized for other languages)

How to Use

Using transformers library

from transformers import LlamaForCausalLM, PreTrainedTokenizerFast

# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("Naman0807/gujarati-llama2-base")
tokenizer = PreTrainedTokenizerFast.from_pretrained("Naman0807/gujarati-llama2-base")

# Generate text
prompt = "ગુજરાત એક અદભુત"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs["input_ids"], max_length=100, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Using pipelines

from transformers import pipeline

generator = pipeline("text-generation", model="Naman0807/gujarati-llama2-base")
output = generator("ગુજરાત", max_length=100)
print(output)

Training Data

  • Dataset: Custom Gujarati text corpus
  • Data Source: Preprocessed Gujarati language texts
  • Training Data Size: Approximately 2GB of clean Gujarati text

Training Procedure

Training Hyperparameters

  • Learning Rate: 2e-4
  • Batch Size: Optimized for GPU training
  • Max Sequence Length: 1024 tokens
  • Number of Epochs: Multiple epochs over the dataset
  • Optimizer: AdamW
  • Tokenizer: Unigram-based tokenizer (trained on Gujarati corpus)

Training Details

  • Framework: Hugging Face Transformers with PyTorch
  • Hardware: GPU-accelerated training
  • Preprocessing: Text normalization and cleaning

Tokenizer

A custom Unigram tokenizer has been trained on the Gujarati corpus with:

  • Vocabulary Size: 64,000 tokens
  • Special Tokens: <pad>, <unk>, <bos>, <eos>
  • Pre-tokenizer: Metaspace-based splitting for Gujarati text

Limitations and Biases

  • Language-specific: Optimized for Gujarati; performance on other languages is not guaranteed
  • Model size: Compact model (6 layers) may have limitations in capturing complex language patterns compared to larger models
  • Training data: Limited to specific Gujarati text sources, which may introduce biases
  • Context length: Maximum context of 1024 tokens may be limiting for longer documents

Ethical Considerations

  • This model should be used responsibly and ethically
  • Users should be aware of potential biases in the training data
  • Outputs should be reviewed for accuracy and appropriateness in production use

Environmental Impact

This is a relatively small model (512 hidden size, 6 layers), resulting in:

  • Lower computational requirements compared to large-scale models
  • Reduced environmental impact during inference
  • Feasibility for deployment on resource-limited devices

Community and Contributions

Feedback, bug reports, and contributions are welcome. Please open an issue or pull request if you have suggestions for improvement.

Citation

If you use this model, please cite:

@model{gujarati-llama2-base,
  title={Gujarati Llama2 Base},
  author={Naman},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Naman0807/gujarati-llama2-base}}
}

Disclaimer

This model is provided "as is" without warranties. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when using this model.


Model Version: 1.0
Last Updated: January 2, 2026

Downloads last month
87
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support