Gujarati Llama2 Base
A compact, fine-tuned Llama2 model specialized for the Gujarati language.
Model Details
Model Description
This is a Llama2-based language model trained on Gujarati text data. It's designed to generate coherent and contextually relevant Gujarati language text. The model is a smaller variant optimized for efficiency while maintaining reasonable language understanding capabilities.
- Model type: Language Model (Causal Language Model)
- Library: transformers
- License: Apache 2.0
Model Architecture
- Architecture: Llama
- Hidden Size: 512
- Number of Attention Heads: 16
- Number of Hidden Layers: 6
- Intermediate Size: 2736
- Max Position Embeddings: 1024
- Vocabulary Size: 64,000
- Head Dimension: 32
Model Performance
This model is optimized for Gujarati text generation tasks. It has been trained on a curated Gujarati text corpus and is capable of:
- Text generation in Gujarati
- Contextual understanding of Gujarati language patterns
- Continuation of Gujarati sentences and paragraphs
Intended Use
Primary Use Cases:
- Gujarati text generation
- Language understanding and analysis tasks
- Fine-tuning for downstream Gujarati NLP tasks
Out-of-scope Uses:
- Production-level applications without further fine-tuning
- Non-Gujarati language tasks (not optimized for other languages)
How to Use
Using transformers library
from transformers import LlamaForCausalLM, PreTrainedTokenizerFast
# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("Naman0807/gujarati-llama2-base")
tokenizer = PreTrainedTokenizerFast.from_pretrained("Naman0807/gujarati-llama2-base")
# Generate text
prompt = "ગુજરાત એક અદભુત"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs["input_ids"], max_length=100, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Using pipelines
from transformers import pipeline
generator = pipeline("text-generation", model="Naman0807/gujarati-llama2-base")
output = generator("ગુજરાત", max_length=100)
print(output)
Training Data
- Dataset: Custom Gujarati text corpus
- Data Source: Preprocessed Gujarati language texts
- Training Data Size: Approximately 2GB of clean Gujarati text
Training Procedure
Training Hyperparameters
- Learning Rate: 2e-4
- Batch Size: Optimized for GPU training
- Max Sequence Length: 1024 tokens
- Number of Epochs: Multiple epochs over the dataset
- Optimizer: AdamW
- Tokenizer: Unigram-based tokenizer (trained on Gujarati corpus)
Training Details
- Framework: Hugging Face Transformers with PyTorch
- Hardware: GPU-accelerated training
- Preprocessing: Text normalization and cleaning
Tokenizer
A custom Unigram tokenizer has been trained on the Gujarati corpus with:
- Vocabulary Size: 64,000 tokens
- Special Tokens:
<pad>,<unk>,<bos>,<eos> - Pre-tokenizer: Metaspace-based splitting for Gujarati text
Limitations and Biases
- Language-specific: Optimized for Gujarati; performance on other languages is not guaranteed
- Model size: Compact model (6 layers) may have limitations in capturing complex language patterns compared to larger models
- Training data: Limited to specific Gujarati text sources, which may introduce biases
- Context length: Maximum context of 1024 tokens may be limiting for longer documents
Ethical Considerations
- This model should be used responsibly and ethically
- Users should be aware of potential biases in the training data
- Outputs should be reviewed for accuracy and appropriateness in production use
Environmental Impact
This is a relatively small model (512 hidden size, 6 layers), resulting in:
- Lower computational requirements compared to large-scale models
- Reduced environmental impact during inference
- Feasibility for deployment on resource-limited devices
Community and Contributions
Feedback, bug reports, and contributions are welcome. Please open an issue or pull request if you have suggestions for improvement.
Citation
If you use this model, please cite:
@model{gujarati-llama2-base,
title={Gujarati Llama2 Base},
author={Naman},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Naman0807/gujarati-llama2-base}}
}
Disclaimer
This model is provided "as is" without warranties. Users are responsible for ensuring compliance with applicable laws and ethical guidelines when using this model.
Model Version: 1.0
Last Updated: January 2, 2026
- Downloads last month
- 87