Spaces:

nice-bill
/

vigilaudio

Sleeping

App Files Files Community

nice-bill commited on Dec 24, 2025

Commit

4e87e6b

1 Parent(s): 48cf750

updated README with performance benchmarks

Browse files

Files changed (3) hide show

README.md +79 -52
docs/VigilAudio_Fine_Tuning.ipynb +199 -0
docs/benchmark_report.csv +4 -0

README.md CHANGED Viewed

@@ -2,77 +2,104 @@
 **A production-ready audio emotion classification system built for content moderation.**
-VigilAudio is the first phase of a multimodal moderation suite designed to detect distress, aggression, and safety risks in user-generated content. Unlike traditional moderators that look for keywords, VigilAudio listens to the *tone* of the voice—detecting anger, fear, or distress even when the words themselves are neutral.
-## Key Features
-*   **State-of-the-Art Architecture:** Fine-tuned `facebook/wav2vec2-base-960h` Transformer model.
-*   **High Accuracy:** Achieved **82% accuracy** on a 7-class emotion dataset (Angry, Happy, Sad, Fearful, Disgusted, Neutral, Surprised).
-*   **Production Pipeline:** End-to-end data harmonization, stratified splitting, and efficient feature extraction.
-*   **Cloud-Native Training:** Optimized training scripts for Google Colab (T4 GPU), reducing training time from 50+ hours to <20 minutes.
-## Technology Stack
-*   **Language:** Python 3.10+
-*   **Environment:** `uv` (for fast dependency management)
-*   **ML Framework:** PyTorch, Hugging Face Transformers, Accelerate
-*   **Audio Processing:** Librosa, Soundfile
-*   **Data Ops:** Pandas, Scikit-learn
-## Installation
-1.  **Clone the repository:**
-    ```bash
-    git clone https://github.com/yourusername/vigilaudio.git
-    cd vigilaudio
-    ```
-2.  **Initialize the environment:**
-    We use `uv` for lightning-fast setups.
-    ```bash
-    uv sync
-    ```
-## Execution Guide
-### 1. Data Pipeline (Harmonization)
-Turn raw, messy folders into a clean, stratified dataset.
 ```bash
-uv run src/data/harmonize.py
 ```
-*   **Input:** Raw audio folders (`Emotions/Angry`, `Emotions/Happy`...)
-*   **Output:** `data/processed/metadata.csv` (Unified labels + 80/10/10 splits)
-### 2. Feature Extraction (Local Test)
-Verify that your machine can process audio using the Wav2Vec2 processor.
 ```bash
-uv run src/features/extractor.py
 ```
-*   **Output:** Prints the embedding shape `(768,)` for a sample file.
-### 3. Model Training (The "Professional" Way)
-Training a Transformer on a CPU is too slow. We use Google Colab.
-1.  Upload `train_colab.py` and your `Emotions` folder to Google Drive.
-2.  Open `VigilAudio_Fine_Tuning.ipynb` in Colab.
-3.  Set Runtime to **T4 GPU**.
-4.  Run the training script.
-    *   **Result:** A fine-tuned model saved to `wav2vec2-finetuned/`.
-    *   **Performance:** ~82% Accuracy / 0.81 F1 Score.
-## Dataset
-The model was trained on a combined dataset of **12,798 audio recordings** across 7 emotions.
-*   **Source:** [Kaggle - Audio Emotions Dataset](https://www.kaggle.com/datasets/uldisvalainis/audio-emotions)
-*   **Composition:** An amalgam of CREMA-D, TESS, RAVDESS, and SAVEE datasets.
-## Results Summary
-| Model | Architecture | Training Time | Accuracy |
-|-------|--------------|---------------|----------|
-| Baseline | Simple MLP (CPU) | ~3 hours | 54% |
-| **VigilAudio** | **Fine-Tuned Wav2Vec2 (GPU)** | **17 mins** | **82%** |
 ## License
-MIT

 **A production-ready audio emotion classification system built for content moderation.**
+VigilAudio is an advanced audio analysis engine designed to detect aggression, distress, and safety risks by analyzing the *tone* of voice. It is the audio foundation of a multimodal moderation suite, utilizing fine-tuned Transformers and optimized for high-speed CPU inference.
+![Dashboard](docs/screenshot_placeholder.png)
+## Dataset & Results
+*   **Source:** [Kaggle - Audio Emotions Dataset](https://www.kaggle.com/datasets/uldisvalainis/audio-emotions) (12,798 recordings).
+*   **Architecture:** Fine-tuned `Wav2Vec2` Transformer.
+*   **Accuracy:** **83%** (PyTorch) / **84%** (Optimized INT8 ONNX).
+*   **Optimization:** 1.85x speedup and 67% size reduction via INT8 Quantization.
+---
+## Prerequisites
+*   **Python 3.10+**
+*   **uv:** [Install uv](https://docs.astral.sh/uv/getting-started/installation/) (recommended for environment management).
+*   **FFMPEG:** Required for audio processing.
+    *   *Windows:* `winget install ffmpeg`
+    *   *Linux:* `sudo apt install ffmpeg`
+---
+## How to Run (Quick Start)
+### 1. Setup Environment
 ```bash
+git clone https://github.com/yourusername/vigilaudio.git
+cd vigilaudio
+uv sync
 ```
+### 2. Download Model Weights
+Because model weights are large, they are not stored in Git.
+1. Download `wav2vec2_model.zip` from [Your Link/Releases].
+2. Extract to `models/onnx_quantized/`.
+### 3. Launch the Application
+Run the standalone demo (recommended for local testing):
 ```bash
+uv run streamlit run src/ui/app_standalone.py
 ```
+*   **Access:** `http://localhost:8501`
+---
+## Development Workflow
+If you want to retrain or modify the system:
+### 1. Data Preparation
+1. Download the [Kaggle Dataset](https://www.kaggle.com/datasets/uldisvalainis/audio-emotions).
+2. Place the folders (Angry, Happy, etc.) in `data/raw/Emotions/`.
+3. Run harmonization:
+```bash
+uv run src/data/harmonize.py
+```
+### 2. Model Training (Cloud Accelerated)
+We use Google Colab (T4 GPU) for high-speed fine-tuning.
+*   The training script and notebook are in `docs/VigilAudio_Fine_Tuning.ipynb`.
+### 3. Optimization & Benchmarking
+Convert to ONNX and verify performance:
+```bash
+uv run src/models/optimize.py
+uv run src/models/benchmark.py
+```
+---
+## Project Structure
+```text
+vigilaudio/
+├── data/                   # Dataset storage
+│   ├── raw/                # Original audio files (excluded from Git)
+│   └── processed/          # Metadata and splits
+├── models/                 # Model registry
+│   ├── wav2vec2-finetuned/ # PyTorch weights
+│   └── onnx_quantized/     # Optimized INT8 engine
+├── src/
+│   ├── api/                # FastAPI backend service
+│   ├── data/               # ETL and harmonization scripts
+│   ├── features/           # Audio feature extraction
+│   ├── models/             # Training, Inference, and Optimization logic
+│   └── ui/                 # Streamlit frontend dashboards
+├── docs/                   # Benchmarks, Logs, and Colab Notebooks
+└── notebooks/              # Experimental EDA
+```
+## Performance Optimization (ONNX)
+| Model Version | Accuracy | Latency (ms) | Speedup | Size (MB) |
+|---------------|----------|--------------|---------|-----------|
+| PyTorch (Full) | 82.0% | 370ms | 1.00x | 361MB |
+| ONNX (Standard)| 82.0% | 306ms | 1.21x | 361MB |
+| **ONNX (INT8)** | **84.0%** | **199ms** | **1.85x** | **116MB** |
 ## License
+MIT

docs/VigilAudio_Fine_Tuning.ipynb ADDED Viewed

	@@ -0,0 +1,199 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1342e84e",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import torch\n",
+    "import pandas as pd\n",
+    "import librosa\n",
+    "import numpy as np\n",
+    "from torch.utils.data import Dataset\n",
+    "from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification, Trainer, TrainingArguments\n",
+    "from sklearn.metrics import accuracy_score, f1_score\n",
+    "import shutil"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ef12ccf6",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "DRIVE_PROJECT_ROOT = \"/content/drive/MyDrive/Colab_VigilAudio\"\n",
+    "METADATA_PATH = os.path.join(DRIVE_PROJECT_ROOT, \"metadata.csv\")\n",
+    "MODEL_NAME = \"facebook/wav2vec2-base-960h\"\n",
+    "\n",
+    "LOCAL_DATA_PATH = \"/content/Emotions\" \n",
+    "OUTPUT_DIR = os.path.join(DRIVE_PROJECT_ROOT, \"wav2vec2-finetuned\")\n",
+    "LOCAL_OUTPUT = \"/content/wav2vec2-finetuned\"\n",
+    "\n",
+    "os.environ[\"WANDB_DISABLED\"] = \"true\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8459c22f",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def setup_data():\n",
+    "    if not os.path.exists(LOCAL_DATA_PATH):\n",
+    "        print(\"Copying data to local disk (this takes ~3 mins)...\")\n",
+    "        drive_data = os.path.join(DRIVE_PROJECT_ROOT, \"Emotions\")\n",
+    "        if os.path.exists(drive_data):\n",
+    "            shutil.copytree(drive_data, LOCAL_DATA_PATH)\n",
+    "            print(\"Data copy complete.\")\n",
+    "        else:\n",
+    "            print(\"Drive data not found. Assuming data is already in /content/Emotions\")\n",
+    "    else:\n",
+    "        print(\"Data already exists on local disk.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "14d6662d",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "class AudioDataset(Dataset):\n",
+    "    def __init__(self, dataframe, audio_root, feature_extractor, label_map):\n",
+    "        self.df = dataframe.reset_index(drop=True)\n",
+    "        self.audio_root = audio_root\n",
+    "        self.feature_extractor = feature_extractor\n",
+    "        self.label_map = label_map\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.df)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        row = self.df.iloc[idx]\n",
+    "        filename = os.path.basename(row['path'].replace('\\\\', '/'))\n",
+    "        folder = row['emotion'].capitalize()\n",
+    "        if folder == 'Suprised': folder = 'Suprised'\n",
+    "        audio_path = os.path.join(self.audio_root, folder, filename)\n",
+    "        \n",
+    "        try:\n",
+    "            speech, _ = librosa.load(audio_path, sr=16000)\n",
+    "            inputs = self.feature_extractor(speech, sampling_rate=16000, padding=True, return_tensors=\"pt\")\n",
+    "            return {\n",
+    "                \"input_values\": inputs.input_values.squeeze(0),\n",
+    "                \"labels\": torch.tensor(self.label_map[row['emotion']], dtype=torch.long),\n",
+    "            }\n",
+    "        except Exception:\n",
+    "            return self.__getitem__((idx + 1) % len(self))\n",
+    "\n",
+    "def compute_metrics(p):\n",
+    "    preds = np.argmax(p.predictions, axis=1)\n",
+    "    return {\n",
+    "        'accuracy': accuracy_score(p.label_ids, preds),\n",
+    "        'f1': f1_score(p.label_ids, preds, average='weighted'),\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "86dabb49",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "\n",
+    "setup_data()\n",
+    "\n",
+    "if not os.path.exists(METADATA_PATH):\n",
+    "print(f\"Error: Metadata not found at {METADATA_PATH}\")\n",
+    "return\n",
+    "\n",
+    "df = pd.read_csv(METADATA_PATH)\n",
+    "emotions = sorted(df['emotion'].unique())\n",
+    "label_map = {name: i for i, name in enumerate(emotions)}\n",
+    "id2label = {i: name for name, i in label_map.items()}\n",
+    "\n",
+    "feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)\n",
+    "train_ds = AudioDataset(df[df['split']=='train'], LOCAL_DATA_PATH, feature_extractor, label_map)\n",
+    "val_ds = AudioDataset(df[df['split']=='val'], LOCAL_DATA_PATH, feature_extractor, label_map)\n",
+    "\n",
+    "model = Wav2Vec2ForSequenceClassification.from_pretrained(\n",
+    "MODEL_NAME,\n",
+    "num_labels=len(emotions),\n",
+    "id2label=id2label,\n",
+    "label2id=label_map,\n",
+    "ignore_mismatched_sizes=True\n",
+    ")\n",
+    "model.freeze_feature_encoder()\n",
+    "\n",
+    "training_args = TrainingArguments(\n",
+    "output_dir=\"/content/checkpoints\",\n",
+    "eval_strategy=\"epoch\",\n",
+    "save_strategy=\"epoch\",\n",
+    "per_device_train_batch_size=8,\n",
+    "gradient_accumulation_steps=2,\n",
+    "num_train_epochs=5,\n",
+    "learning_rate=3e-5,\n",
+    "warmup_steps=500,\n",
+    "load_best_model_at_end=True,\n",
+    "metric_for_best_model=\"accuracy\",\n",
+    "fp16=True,\n",
+    "report_to=\"none\"\n",
+    ")\n",
+    "\n",
+    "trainer = Trainer(\n",
+    "model=model,\n",
+    "args=training_args,\n",
+    "train_dataset=train_ds,\n",
+    "eval_dataset=val_ds,\n",
+    "tokenizer=feature_extractor,\n",
+    "compute_metrics=compute_metrics\n",
+    ")\n",
+    "\n",
+    "print(\"Starting Training (Restored Logic)...\")\n",
+    "trainer.train()\n",
+    "\n",
+    "print(\"Saving final model locally...\")\n",
+    "trainer.save_model(LOCAL_OUTPUT)\n",
+    "\n",
+    "print(\"Zipping for download...\")\n",
+    "shutil.make_archive(\"/content/wav2vec2_model\", 'zip', LOCAL_OUTPUT)\n",
+    "print(\"DONE! Please download /content/wav2vec2_model.zip\")\n",
+    "\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

docs/benchmark_report.csv ADDED Viewed

	@@ -0,0 +1,4 @@

+Model,Accuracy,Latency (Avg ms),Speedup,Size (MB)
+PyTorch (Full),82.00%,369.98ms,1.00x,360.8MB
+ONNX (Standard),82.00%,306.52ms,1.21x,361.0MB
+ONNX (INT8 Quantized),84.00%,199.46ms,1.85x,116.5MB