--- title: TamilSense emoji: 🌟 colorFrom: blue colorTo: indigo sdk: gradio sdk_version: "6.14.0" python_version: "3.10" app_file: app/gradio_app.py pinned: false --- # TamilSense 🌟 > Lightweight, production-grade Tamil/Tanglish sentiment analysis — built for real-world deployment [![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-TamilSense-blue)](https://huggingface.co/spaces/vishnuexe/TamilSense) [![Model](https://img.shields.io/badge/🤗%20Model-TamilSense--model-green)](https://huggingface.co/vishnuexe/TamilSense-model) [![License](https://img.shields.io/badge/License-MIT-yellow)](LICENSE) ## The Problem ChatGPT is a trillion-parameter monster that costs millions to run. Nobody is integrating it into a local government portal, a small business app, or a low-bandwidth mobile app in rural Tamil Nadu. **TamilSense fills that gap** — a lightweight, fast, open-source Tamil sentiment API that any developer can plug in. ## Performance | Metric | Score | |--------|-------| | Accuracy | **94.7%** | | Weighted F1 | **94.7%** | | Positive F1 | **97%** | | Negative F1 | **85%** | Trained on **55,064 balanced Tamil-English sentences** from two combined datasets. ## 🚀 Live Demo 👉 [**Try it on Hugging Face Spaces**](https://huggingface.co/spaces/vishnuexe/TamilSense) ## ⚡ Quick Start ```python from transformers import pipeline classifier = pipeline("text-classification", model="vishnuexe/TamilSense-model") result = classifier("Super da machan vera level!") print(result) # [{'label': 'positive', 'score': 0.9965}] ``` ## 🛠️ REST API ```bash # Run locally uvicorn app.main:app --reload # Predict curl -X POST "http://localhost:8000/predict" \ -H "Content-Type: application/json" \ -d '{"text": "Romba nalla iruku bro"}' ``` Response: ```json { "text": "Romba nalla iruku bro", "sentiment": "positive", "confidence": 0.9966, "scores": {"positive": 0.9966, "negative": 0.0034}, "response_time_ms": 48.23 } ``` ## 🏗️ Architecture Tamil/Tanglish Text ↓ MuRIL Tokenizer (WordPiece) ↓ 12 Transformer Layers (Google MuRIL) ↓ [CLS] Vector (768-dim) ↓ Linear Classifier → Positive / Negative ## 📦 Dataset | Source | Size | |--------|------| | tamilmixsentiment (FIRE 2020) | 15,744 sentences | | DravidianCodeMix Zenodo | ~44,000 sentences | | **Final balanced train set** | **55,064 sentences** | ## 🔬 MLOps Pipeline - **Experiment tracking**: MLflow — 4 training runs logged with params and metrics - **GPU training**: NVIDIA RTX 4060 — fp16 mixed precision, ~45 mins - **Model hosting**: Hugging Face Model Hub - **Deployment**: Hugging Face Spaces (Gradio) + FastAPI - **Containerization**: Docker ## 📈 Training Progress | Run | Data | F1 | |-----|------|----| | Run 1 — 3-class | 10k sentences | 67.5% | | Run 2 — Binary | 15k sentences | 80.8% | | Run 3 — Tuned | 15k sentences | 81.5% | | **Run 4 — Full data** | **55k sentences** | **94.7%** | ## 🗂️ Project Structure TamilSense/ ├── app/ │ ├── main.py # FastAPI REST API │ └── gradio_app.py # Gradio demo UI ├── src/ │ ├── prepare_data.py # Data loading and balancing │ ├── train.py # Fine-tuning with MLflow tracking │ ├── evaluate.py # Evaluation + confusion matrix │ └── predict.py # Inference pipeline ├── Dockerfile └── requirements.txt ## 🔮 Roadmap - [ ] IndicBERT experiment on pure Tamil script dataset - [ ] INT8 quantization — reduce model size 4x - [ ] Batch prediction API endpoint - [ ] FIRE 2021/2022 data integration - [ ] Offensive language detection endpoint (v2) ## 📄 License MIT License — free to use, modify, and deploy. --- **Built by Vishnu** | [HuggingFace](https://huggingface.co/vishnuexe) | [GitHub](https://github.com/vishnu3105)