TamilSense ๐ŸŒŸ

Lightweight, production-grade Tamil/Tanglish sentiment analysis โ€” built for real-world deployment

HuggingFace Space Model License

๐ŸŽฏ What Is This?

ChatGPT is a trillion-parameter monster that costs millions to run. Nobody is integrating it into a local government portal, a small business app, or a low-bandwidth mobile app in rural Tamil Nadu.

TamilSense fills that gap โ€” a lightweight, fast, open-source Tamil sentiment API that any developer can plug in.

๐Ÿ“Š Performance

Metric Score
Accuracy 94.7%
Weighted F1 94.7%
Positive F1 97%
Negative F1 85%

Trained on 55,064 balanced Tamil-English sentences from two combined datasets.

๐Ÿš€ Live Demo

๐Ÿ‘‰ Try it on Hugging Face Spaces

โšก Quick Start

from transformers import pipeline

classifier = pipeline("text-classification", model="vishnuexe/TamilSense-model")
result = classifier("Super da machan vera level!")
print(result)  # [{'label': 'positive', 'score': 0.9965}]

๐Ÿ› ๏ธ REST API

# Run locally
uvicorn app.main:app --reload

# Predict
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "Romba nalla iruku bro"}'

Response:

{
  "text": "Romba nalla iruku bro",
  "sentiment": "positive",
  "confidence": 0.9966,
  "scores": {"positive": 0.9966, "negative": 0.0034},
  "response_time_ms": 48.23
}

๐Ÿ—๏ธ Architecture

Tamil/Tanglish Text โ†“ MuRIL Tokenizer (WordPiece) โ†“ 12 Transformer Layers (Google MuRIL) โ†“ [CLS] Vector (768-dim) โ†“ Linear Classifier โ†’ Positive / Negative

๐Ÿ“ฆ Dataset

Source Size
tamilmixsentiment (FIRE 2020) 15,744 sentences
DravidianCodeMix Zenodo ~44,000 sentences
Final balanced train set 55,064 sentences

๐Ÿ”ฌ MLOps Pipeline

  • Experiment tracking: MLflow โ€” 4 training runs logged with params and metrics
  • GPU training: NVIDIA RTX 4060 โ€” fp16 mixed precision, ~45 mins
  • Model hosting: Hugging Face Model Hub
  • Deployment: Hugging Face Spaces (Gradio) + FastAPI
  • Containerization: Docker

๐Ÿ“ˆ Training Progress

Run Data F1
Run 1 โ€” 3-class 10k sentences 67.5%
Run 2 โ€” Binary 15k sentences 80.8%
Run 3 โ€” Tuned 15k sentences 81.5%
Run 4 โ€” Full data 55k sentences 94.7%

๐Ÿ—‚๏ธ Project Structure

TamilSense/ โ”œโ”€โ”€ app/ โ”‚ โ”œโ”€โ”€ main.py # FastAPI REST API โ”‚ โ””โ”€โ”€ gradio_app.py # Gradio demo UI โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ prepare_data.py # Data loading and balancing โ”‚ โ”œโ”€โ”€ train.py # Fine-tuning with MLflow tracking โ”‚ โ”œโ”€โ”€ evaluate.py # Evaluation + confusion matrix โ”‚ โ””โ”€โ”€ predict.py # Inference pipeline โ”œโ”€โ”€ Dockerfile โ””โ”€โ”€ requirements.txt

๐Ÿ”ฎ Roadmap

  • IndicBERT experiment on pure Tamil script dataset
  • INT8 quantization โ€” reduce model size 4x
  • Batch prediction API endpoint
  • FIRE 2021/2022 data integration
  • Offensive language detection endpoint (v2)

๐Ÿ“„ License

MIT License โ€” free to use, modify, and deploy.


Built by Vishnu | HuggingFace | GitHub

Downloads last month
56
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using vishnuexe/TamilSense-model 1