|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ta |
|
|
--- |
|
|
|
|
|
# **Debug Divas: Colloquial Tamil Translation Model** |
|
|
🚀 **Fine-tuned Mistral-7B for English-to-Colloquial Tamil Translation** |
|
|
|
|
|
 |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
|
|
|
## 🌟 **Overview** |
|
|
This model is a **fine-tuned version of Mistral-7B** using **Unsloth's FastLanguageModel**, designed specifically to translate **English text into colloquial Tamil** (spoken Tamil). It is optimized for **real-world Tamil conversations**, making it useful for chatbots, assistants, and translation tools. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📖 **Model Details** |
|
|
- **Base Model**: [Mistral-7B-Instruct](https://huggingface.co/mistral-7b-instruct) |
|
|
- **Fine-Tuned Dataset**: Custom dataset (`debug_divas_dataset.json`) with **English → Colloquial Tamil** translation pairs. |
|
|
- **Training Library**: [Unsloth](https://github.com/unslothai/unsloth) (optimized training for large models) |
|
|
- **Max Sequence Length**: 128 tokens |
|
|
- **Batch Size**: 8 |
|
|
- **Epochs**: 3 |
|
|
- **Optimizer**: AdamW |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔧 **Installation & Setup** |
|
|
To use this model, install the necessary dependencies: |
|
|
|
|
|
```bash |
|
|
pip install torch transformers datasets unsloth accelerate |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 **Usage** |
|
|
### **Load Model & Tokenizer** |
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
# Load fine-tuned model from Hugging Face |
|
|
model_name = "your-huggingface-username/debug-divas-tamil-translation" |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name).to(device) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
|
|
# Translation function |
|
|
def translate_english_to_tamil(input_text): |
|
|
instruction = "Translate the following English sentence to colloquial Tamil" |
|
|
inputs = tokenizer(f"{instruction}: {input_text}", return_tensors="pt").to(device) |
|
|
|
|
|
translated_tokens = model.generate(**inputs, max_length=128) |
|
|
translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True) |
|
|
|
|
|
return translated_text |
|
|
|
|
|
# Example usage |
|
|
input_text = "The pharmacy is near the bus stop." |
|
|
translated_text = translate_english_to_tamil(input_text) |
|
|
print("Colloquial Tamil:", translated_text) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 📝 **Example Outputs** |
|
|
| **English** | **Colloquial Tamil Translation** | |
|
|
|------------|---------------------------------| |
|
|
| "How are you?" | "நீங்க எப்படி இருக்கீங்க?" | |
|
|
| "I am going to the market." | "நான் மார்க்கெட்டுக்கு பொறேன்." | |
|
|
| "The pharmacy is near the bus stop." | "மருந்துக் கடை பஸ்ஸ்டாப் அருகே இருக்க." | |
|
|
|
|
|
--- |
|
|
|
|
|
## 📚 **Dataset** |
|
|
The dataset contains **pairs of English sentences** with their **colloquial Tamil translations**. |
|
|
Example format: |
|
|
|
|
|
```json |
|
|
[ |
|
|
{ |
|
|
"input": "How are you?", |
|
|
"output": "நீங்க எப்படி இருக்கீங்க?" |
|
|
}, |
|
|
{ |
|
|
"input": "I am going to the market.", |
|
|
"output": "நான் மார்க்கெட்டுக்கு பொறேன்." |
|
|
} |
|
|
] |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏗 **Training Details** |
|
|
The model was fine-tuned using **UnslothTrainer** with the following hyperparameters: |
|
|
|
|
|
- **Batch Size**: 8 |
|
|
- **Epochs**: 3 |
|
|
- **Learning Rate**: 2e-5 |
|
|
- **FP16 Training**: Disabled |
|
|
- **Optimizer**: AdamW |
|
|
- **Dataset Split**: 80% Train, 20% Test |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## ⚖ **License & Citation** |
|
|
This model is released under the **MIT License**. If you use it in your work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{debugdivas2025, |
|
|
author = {Debug Divas}, |
|
|
title = {Fine-tuned Mistral-7B for Colloquial Tamil Translation}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/your-huggingface-username/debug-divas-tamil-translation} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
Dataset Linke : https://huggingface.co/datasets/anitha2520/debug_divas45/tree/main |
|
|
anitha2520/debug_divas45 |
|
|
|
|
|
## ❤️ **Contributions & Feedback** |
|
|
We welcome feedback and contributions! Feel free to open an issue or contribute to our dataset. |
|
|
|
|
|
📧 **Contact:** [xidanitha@gmail.com] |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
|