TamilSense-model / README.md
vishnuexe's picture
Update README.md
376896e verified
---
title: TamilSense
emoji: ๐ŸŒŸ
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: "6.14.0"
python_version: "3.10"
app_file: app/gradio_app.py
pinned: false
---
# TamilSense ๐ŸŒŸ
> Lightweight, production-grade Tamil/Tanglish sentiment analysis โ€” built for real-world deployment
[![HuggingFace Space](https://img.shields.io/badge/๐Ÿค—%20HuggingFace-TamilSense-blue)](https://huggingface.co/spaces/vishnuexe/TamilSense)
[![Model](https://img.shields.io/badge/๐Ÿค—%20Model-TamilSense--model-green)](https://huggingface.co/vishnuexe/TamilSense-model)
[![License](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
## The Problem
ChatGPT is a trillion-parameter monster that costs millions to run. Nobody is integrating it into a local government portal, a small business app, or a low-bandwidth mobile app in rural Tamil Nadu.
**TamilSense fills that gap** โ€” a lightweight, fast, open-source Tamil sentiment API that any developer can plug in.
## Performance
| Metric | Score |
|--------|-------|
| Accuracy | **94.7%** |
| Weighted F1 | **94.7%** |
| Positive F1 | **97%** |
| Negative F1 | **85%** |
Trained on **55,064 balanced Tamil-English sentences** from two combined datasets.
## ๐Ÿš€ Live Demo
๐Ÿ‘‰ [**Try it on Hugging Face Spaces**](https://huggingface.co/spaces/vishnuexe/TamilSense)
## โšก Quick Start
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="vishnuexe/TamilSense-model")
result = classifier("Super da machan vera level!")
print(result) # [{'label': 'positive', 'score': 0.9965}]
```
## ๐Ÿ› ๏ธ REST API
```bash
# Run locally
uvicorn app.main:app --reload
# Predict
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"text": "Romba nalla iruku bro"}'
```
Response:
```json
{
"text": "Romba nalla iruku bro",
"sentiment": "positive",
"confidence": 0.9966,
"scores": {"positive": 0.9966, "negative": 0.0034},
"response_time_ms": 48.23
}
```
## ๐Ÿ—๏ธ Architecture
Tamil/Tanglish Text
โ†“
MuRIL Tokenizer (WordPiece)
โ†“
12 Transformer Layers (Google MuRIL)
โ†“
[CLS] Vector (768-dim)
โ†“
Linear Classifier โ†’ Positive / Negative
## ๐Ÿ“ฆ Dataset
| Source | Size |
|--------|------|
| tamilmixsentiment (FIRE 2020) | 15,744 sentences |
| DravidianCodeMix Zenodo | ~44,000 sentences |
| **Final balanced train set** | **55,064 sentences** |
## ๐Ÿ”ฌ MLOps Pipeline
- **Experiment tracking**: MLflow โ€” 4 training runs logged with params and metrics
- **GPU training**: NVIDIA RTX 4060 โ€” fp16 mixed precision, ~45 mins
- **Model hosting**: Hugging Face Model Hub
- **Deployment**: Hugging Face Spaces (Gradio) + FastAPI
- **Containerization**: Docker
## ๐Ÿ“ˆ Training Progress
| Run | Data | F1 |
|-----|------|----|
| Run 1 โ€” 3-class | 10k sentences | 67.5% |
| Run 2 โ€” Binary | 15k sentences | 80.8% |
| Run 3 โ€” Tuned | 15k sentences | 81.5% |
| **Run 4 โ€” Full data** | **55k sentences** | **94.7%** |
## ๐Ÿ—‚๏ธ Project Structure
TamilSense/
โ”œโ”€โ”€ app/
โ”‚ โ”œโ”€โ”€ main.py # FastAPI REST API
โ”‚ โ””โ”€โ”€ gradio_app.py # Gradio demo UI
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ prepare_data.py # Data loading and balancing
โ”‚ โ”œโ”€โ”€ train.py # Fine-tuning with MLflow tracking
โ”‚ โ”œโ”€โ”€ evaluate.py # Evaluation + confusion matrix
โ”‚ โ””โ”€โ”€ predict.py # Inference pipeline
โ”œโ”€โ”€ Dockerfile
โ””โ”€โ”€ requirements.txt
## ๐Ÿ”ฎ Roadmap
- [ ] IndicBERT experiment on pure Tamil script dataset
- [ ] INT8 quantization โ€” reduce model size 4x
- [ ] Batch prediction API endpoint
- [ ] FIRE 2021/2022 data integration
- [ ] Offensive language detection endpoint (v2)
## ๐Ÿ“„ License
MIT License โ€” free to use, modify, and deploy.
---
**Built by Vishnu** | [HuggingFace](https://huggingface.co/vishnuexe) | [GitHub](https://github.com/vishnu3105)