TamilSense 🌟

Lightweight, production-grade Tamil/Tanglish sentiment analysis — built for real-world deployment

🎯 What Is This?

ChatGPT is a trillion-parameter monster that costs millions to run. Nobody is integrating it into a local government portal, a small business app, or a low-bandwidth mobile app in rural Tamil Nadu.

TamilSense fills that gap — a lightweight, fast, open-source Tamil sentiment API that any developer can plug in.

📊 Performance

Metric	Score
Accuracy	94.7%
Weighted F1	94.7%
Positive F1	97%
Negative F1	85%

Trained on 55,064 balanced Tamil-English sentences from two combined datasets.

🚀 Live Demo

👉 Try it on Hugging Face Spaces

⚡ Quick Start

from transformers import pipeline

classifier = pipeline("text-classification", model="vishnuexe/TamilSense-model")
result = classifier("Super da machan vera level!")
print(result)  # [{'label': 'positive', 'score': 0.9965}]

🛠️ REST API

# Run locally
uvicorn app.main:app --reload

# Predict
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "Romba nalla iruku bro"}'

Response:

{
  "text": "Romba nalla iruku bro",
  "sentiment": "positive",
  "confidence": 0.9966,
  "scores": {"positive": 0.9966, "negative": 0.0034},
  "response_time_ms": 48.23
}

🏗️ Architecture

Tamil/Tanglish Text ↓ MuRIL Tokenizer (WordPiece) ↓ 12 Transformer Layers (Google MuRIL) ↓ [CLS] Vector (768-dim) ↓ Linear Classifier → Positive / Negative

📦 Dataset

Source	Size
tamilmixsentiment (FIRE 2020)	15,744 sentences
DravidianCodeMix Zenodo	~44,000 sentences
Final balanced train set	55,064 sentences

🔬 MLOps Pipeline

Experiment tracking: MLflow — 4 training runs logged with params and metrics
GPU training: NVIDIA RTX 4060 — fp16 mixed precision, ~45 mins
Model hosting: Hugging Face Model Hub
Deployment: Hugging Face Spaces (Gradio) + FastAPI
Containerization: Docker

📈 Training Progress

Run	Data	F1
Run 1 — 3-class	10k sentences	67.5%
Run 2 — Binary	15k sentences	80.8%
Run 3 — Tuned	15k sentences	81.5%
Run 4 — Full data	55k sentences	94.7%

🗂️ Project Structure

TamilSense/ ├── app/ │ ├── main.py # FastAPI REST API │ └── gradio_app.py # Gradio demo UI ├── src/ │ ├── prepare_data.py # Data loading and balancing │ ├── train.py # Fine-tuning with MLflow tracking │ ├── evaluate.py # Evaluation + confusion matrix │ └── predict.py # Inference pipeline ├── Dockerfile └── requirements.txt

🔮 Roadmap

IndicBERT experiment on pure Tamil script dataset
INT8 quantization — reduce model size 4x
Batch prediction API endpoint
FIRE 2021/2022 data integration
Offensive language detection endpoint (v2)

📄 License

MIT License — free to use, modify, and deploy.

Built by Vishnu | HuggingFace | GitHub

Downloads last month: 56

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

vishnuexe
/

TamilSense-model