| --- |
| title: TamilSense |
| emoji: ๐ |
| colorFrom: blue |
| colorTo: indigo |
| sdk: gradio |
| sdk_version: "6.14.0" |
| python_version: "3.10" |
| app_file: app/gradio_app.py |
| pinned: false |
| --- |
| |
| # TamilSense ๐ |
| > Lightweight, production-grade Tamil/Tanglish sentiment analysis โ built for real-world deployment |
|
|
| [](https://huggingface.co/spaces/vishnuexe/TamilSense) |
| [](https://huggingface.co/vishnuexe/TamilSense-model) |
| [](LICENSE) |
|
|
| ## The Problem |
|
|
| ChatGPT is a trillion-parameter monster that costs millions to run. Nobody is integrating it into a local government portal, a small business app, or a low-bandwidth mobile app in rural Tamil Nadu. |
|
|
| **TamilSense fills that gap** โ a lightweight, fast, open-source Tamil sentiment API that any developer can plug in. |
|
|
| ## Performance |
|
|
| | Metric | Score | |
| |--------|-------| |
| | Accuracy | **94.7%** | |
| | Weighted F1 | **94.7%** | |
| | Positive F1 | **97%** | |
| | Negative F1 | **85%** | |
|
|
| Trained on **55,064 balanced Tamil-English sentences** from two combined datasets. |
|
|
| ## ๐ Live Demo |
|
|
| ๐ [**Try it on Hugging Face Spaces**](https://huggingface.co/spaces/vishnuexe/TamilSense) |
|
|
| ## โก Quick Start |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline("text-classification", model="vishnuexe/TamilSense-model") |
| result = classifier("Super da machan vera level!") |
| print(result) # [{'label': 'positive', 'score': 0.9965}] |
| ``` |
|
|
| ## ๐ ๏ธ REST API |
|
|
| ```bash |
| # Run locally |
| uvicorn app.main:app --reload |
| |
| # Predict |
| curl -X POST "http://localhost:8000/predict" \ |
| -H "Content-Type: application/json" \ |
| -d '{"text": "Romba nalla iruku bro"}' |
| ``` |
|
|
| Response: |
| ```json |
| { |
| "text": "Romba nalla iruku bro", |
| "sentiment": "positive", |
| "confidence": 0.9966, |
| "scores": {"positive": 0.9966, "negative": 0.0034}, |
| "response_time_ms": 48.23 |
| } |
| ``` |
|
|
| ## ๐๏ธ Architecture |
| Tamil/Tanglish Text |
| โ |
| MuRIL Tokenizer (WordPiece) |
| โ |
| 12 Transformer Layers (Google MuRIL) |
| โ |
| [CLS] Vector (768-dim) |
| โ |
| Linear Classifier โ Positive / Negative |
|
|
| ## ๐ฆ Dataset |
|
|
| | Source | Size | |
| |--------|------| |
| | tamilmixsentiment (FIRE 2020) | 15,744 sentences | |
| | DravidianCodeMix Zenodo | ~44,000 sentences | |
| | **Final balanced train set** | **55,064 sentences** | |
|
|
| ## ๐ฌ MLOps Pipeline |
|
|
| - **Experiment tracking**: MLflow โ 4 training runs logged with params and metrics |
| - **GPU training**: NVIDIA RTX 4060 โ fp16 mixed precision, ~45 mins |
| - **Model hosting**: Hugging Face Model Hub |
| - **Deployment**: Hugging Face Spaces (Gradio) + FastAPI |
| - **Containerization**: Docker |
|
|
| ## ๐ Training Progress |
|
|
| | Run | Data | F1 | |
| |-----|------|----| |
| | Run 1 โ 3-class | 10k sentences | 67.5% | |
| | Run 2 โ Binary | 15k sentences | 80.8% | |
| | Run 3 โ Tuned | 15k sentences | 81.5% | |
| | **Run 4 โ Full data** | **55k sentences** | **94.7%** | |
|
|
| ## ๐๏ธ Project Structure |
| TamilSense/ |
| โโโ app/ |
| โ โโโ main.py # FastAPI REST API |
| โ โโโ gradio_app.py # Gradio demo UI |
| โโโ src/ |
| โ โโโ prepare_data.py # Data loading and balancing |
| โ โโโ train.py # Fine-tuning with MLflow tracking |
| โ โโโ evaluate.py # Evaluation + confusion matrix |
| โ โโโ predict.py # Inference pipeline |
| โโโ Dockerfile |
| โโโ requirements.txt |
|
|
| ## ๐ฎ Roadmap |
|
|
| - [ ] IndicBERT experiment on pure Tamil script dataset |
| - [ ] INT8 quantization โ reduce model size 4x |
| - [ ] Batch prediction API endpoint |
| - [ ] FIRE 2021/2022 data integration |
| - [ ] Offensive language detection endpoint (v2) |
|
|
| ## ๐ License |
|
|
| MIT License โ free to use, modify, and deploy. |
|
|
| --- |
|
|
| **Built by Vishnu** | [HuggingFace](https://huggingface.co/vishnuexe) | [GitHub](https://github.com/vishnu3105) |