TamilSense ๐
Lightweight, production-grade Tamil/Tanglish sentiment analysis โ built for real-world deployment
๐ฏ What Is This?
ChatGPT is a trillion-parameter monster that costs millions to run. Nobody is integrating it into a local government portal, a small business app, or a low-bandwidth mobile app in rural Tamil Nadu.
TamilSense fills that gap โ a lightweight, fast, open-source Tamil sentiment API that any developer can plug in.
๐ Performance
| Metric | Score |
|---|---|
| Accuracy | 94.7% |
| Weighted F1 | 94.7% |
| Positive F1 | 97% |
| Negative F1 | 85% |
Trained on 55,064 balanced Tamil-English sentences from two combined datasets.
๐ Live Demo
๐ Try it on Hugging Face Spaces
โก Quick Start
from transformers import pipeline
classifier = pipeline("text-classification", model="vishnuexe/TamilSense-model")
result = classifier("Super da machan vera level!")
print(result) # [{'label': 'positive', 'score': 0.9965}]
๐ ๏ธ REST API
# Run locally
uvicorn app.main:app --reload
# Predict
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"text": "Romba nalla iruku bro"}'
Response:
{
"text": "Romba nalla iruku bro",
"sentiment": "positive",
"confidence": 0.9966,
"scores": {"positive": 0.9966, "negative": 0.0034},
"response_time_ms": 48.23
}
๐๏ธ Architecture
Tamil/Tanglish Text โ MuRIL Tokenizer (WordPiece) โ 12 Transformer Layers (Google MuRIL) โ [CLS] Vector (768-dim) โ Linear Classifier โ Positive / Negative
๐ฆ Dataset
| Source | Size |
|---|---|
| tamilmixsentiment (FIRE 2020) | 15,744 sentences |
| DravidianCodeMix Zenodo | ~44,000 sentences |
| Final balanced train set | 55,064 sentences |
๐ฌ MLOps Pipeline
- Experiment tracking: MLflow โ 4 training runs logged with params and metrics
- GPU training: NVIDIA RTX 4060 โ fp16 mixed precision, ~45 mins
- Model hosting: Hugging Face Model Hub
- Deployment: Hugging Face Spaces (Gradio) + FastAPI
- Containerization: Docker
๐ Training Progress
| Run | Data | F1 |
|---|---|---|
| Run 1 โ 3-class | 10k sentences | 67.5% |
| Run 2 โ Binary | 15k sentences | 80.8% |
| Run 3 โ Tuned | 15k sentences | 81.5% |
| Run 4 โ Full data | 55k sentences | 94.7% |
๐๏ธ Project Structure
TamilSense/ โโโ app/ โ โโโ main.py # FastAPI REST API โ โโโ gradio_app.py # Gradio demo UI โโโ src/ โ โโโ prepare_data.py # Data loading and balancing โ โโโ train.py # Fine-tuning with MLflow tracking โ โโโ evaluate.py # Evaluation + confusion matrix โ โโโ predict.py # Inference pipeline โโโ Dockerfile โโโ requirements.txt
๐ฎ Roadmap
- IndicBERT experiment on pure Tamil script dataset
- INT8 quantization โ reduce model size 4x
- Batch prediction API endpoint
- FIRE 2021/2022 data integration
- Offensive language detection endpoint (v2)
๐ License
MIT License โ free to use, modify, and deploy.
Built by Vishnu | HuggingFace | GitHub
- Downloads last month
- 56