Update README.md

376896e verified 18 days ago

3.88 kB

	---
	title: TamilSense
	emoji: 🌟
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	sdk_version: "6.14.0"
	python_version: "3.10"
	app_file: app/gradio_app.py
	pinned: false
	---

	# TamilSense 🌟
	> Lightweight, production-grade Tamil/Tanglish sentiment analysis — built for real-world deployment

	[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-TamilSense-blue)](https://huggingface.co/spaces/vishnuexe/TamilSense)
	[![Model](https://img.shields.io/badge/🤗%20Model-TamilSense--model-green)](https://huggingface.co/vishnuexe/TamilSense-model)
	[![License](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)

	## The Problem

	ChatGPT is a trillion-parameter monster that costs millions to run. Nobody is integrating it into a local government portal, a small business app, or a low-bandwidth mobile app in rural Tamil Nadu.

	TamilSense fills that gap — a lightweight, fast, open-source Tamil sentiment API that any developer can plug in.

	## Performance

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Accuracy \| 94.7% \|
	\| Weighted F1 \| 94.7% \|
	\| Positive F1 \| 97% \|
	\| Negative F1 \| 85% \|

	Trained on 55,064 balanced Tamil-English sentences from two combined datasets.

	## 🚀 Live Demo

	👉 [Try it on Hugging Face Spaces](https://huggingface.co/spaces/vishnuexe/TamilSense)

	## ⚡ Quick Start

	```python
	from transformers import pipeline

	classifier = pipeline("text-classification", model="vishnuexe/TamilSense-model")
	result = classifier("Super da machan vera level!")
	print(result) # [{'label': 'positive', 'score': 0.9965}]
	```

	## 🛠️ REST API

	```bash
	# Run locally
	uvicorn app.main:app --reload

	# Predict
	curl -X POST "http://localhost:8000/predict" \
	-H "Content-Type: application/json" \
	-d '{"text": "Romba nalla iruku bro"}'
	```

	Response:
	```json
	{
	"text": "Romba nalla iruku bro",
	"sentiment": "positive",
	"confidence": 0.9966,
	"scores": {"positive": 0.9966, "negative": 0.0034},
	"response_time_ms": 48.23
	}
	```

	## 🏗️ Architecture
	Tamil/Tanglish Text
	↓
	MuRIL Tokenizer (WordPiece)
	↓
	12 Transformer Layers (Google MuRIL)
	↓
	[CLS] Vector (768-dim)
	↓
	Linear Classifier → Positive / Negative

	## 📦 Dataset

	\| Source \| Size \|
	\|--------\|------\|
	\| tamilmixsentiment (FIRE 2020) \| 15,744 sentences \|
	\| DravidianCodeMix Zenodo \| ~44,000 sentences \|
	\| Final balanced train set \| 55,064 sentences \|

	## 🔬 MLOps Pipeline

	- Experiment tracking: MLflow — 4 training runs logged with params and metrics
	- GPU training: NVIDIA RTX 4060 — fp16 mixed precision, ~45 mins
	- Model hosting: Hugging Face Model Hub
	- Deployment: Hugging Face Spaces (Gradio) + FastAPI
	- Containerization: Docker

	## 📈 Training Progress

	\| Run \| Data \| F1 \|
	\|-----\|------\|----\|
	\| Run 1 — 3-class \| 10k sentences \| 67.5% \|
	\| Run 2 — Binary \| 15k sentences \| 80.8% \|
	\| Run 3 — Tuned \| 15k sentences \| 81.5% \|
	\| Run 4 — Full data \| 55k sentences \| 94.7% \|

	## 🗂️ Project Structure
	TamilSense/
	├── app/
	│ ├── main.py # FastAPI REST API
	│ └── gradio_app.py # Gradio demo UI
	├── src/
	│ ├── prepare_data.py # Data loading and balancing
	│ ├── train.py # Fine-tuning with MLflow tracking
	│ ├── evaluate.py # Evaluation + confusion matrix
	│ └── predict.py # Inference pipeline
	├── Dockerfile
	└── requirements.txt

	## 🔮 Roadmap

	- [ ] IndicBERT experiment on pure Tamil script dataset
	- [ ] INT8 quantization — reduce model size 4x
	- [ ] Batch prediction API endpoint
	- [ ] FIRE 2021/2022 data integration
	- [ ] Offensive language detection endpoint (v2)

	## 📄 License

	MIT License — free to use, modify, and deploy.

	---

	Built by Vishnu \| [HuggingFace](https://huggingface.co/vishnuexe) \| [GitHub](https://github.com/vishnu3105)