nice-bill commited on
Commit
4e87e6b
·
1 Parent(s): 48cf750

updated README with performance benchmarks

Browse files
README.md CHANGED
@@ -2,77 +2,104 @@
2
 
3
  **A production-ready audio emotion classification system built for content moderation.**
4
 
5
- VigilAudio is the first phase of a multimodal moderation suite designed to detect distress, aggression, and safety risks in user-generated content. Unlike traditional moderators that look for keywords, VigilAudio listens to the *tone* of the voice—detecting anger, fear, or distress even when the words themselves are neutral.
6
 
7
- ## Key Features
8
 
9
- * **State-of-the-Art Architecture:** Fine-tuned `facebook/wav2vec2-base-960h` Transformer model.
10
- * **High Accuracy:** Achieved **82% accuracy** on a 7-class emotion dataset (Angry, Happy, Sad, Fearful, Disgusted, Neutral, Surprised).
11
- * **Production Pipeline:** End-to-end data harmonization, stratified splitting, and efficient feature extraction.
12
- * **Cloud-Native Training:** Optimized training scripts for Google Colab (T4 GPU), reducing training time from 50+ hours to <20 minutes.
13
 
14
- ## Technology Stack
 
 
 
15
 
16
- * **Language:** Python 3.10+
17
- * **Environment:** `uv` (for fast dependency management)
18
- * **ML Framework:** PyTorch, Hugging Face Transformers, Accelerate
19
- * **Audio Processing:** Librosa, Soundfile
20
- * **Data Ops:** Pandas, Scikit-learn
21
 
22
- ## Installation
23
 
24
- 1. **Clone the repository:**
25
- ```bash
26
- git clone https://github.com/yourusername/vigilaudio.git
27
- cd vigilaudio
28
- ```
29
 
30
- 2. **Initialize the environment:**
31
- We use `uv` for lightning-fast setups.
32
- ```bash
33
- uv sync
34
- ```
35
 
36
- ## Execution Guide
37
 
38
- ### 1. Data Pipeline (Harmonization)
39
- Turn raw, messy folders into a clean, stratified dataset.
40
  ```bash
41
- uv run src/data/harmonize.py
 
 
42
  ```
43
- * **Input:** Raw audio folders (`Emotions/Angry`, `Emotions/Happy`...)
44
- * **Output:** `data/processed/metadata.csv` (Unified labels + 80/10/10 splits)
45
 
46
- ### 2. Feature Extraction (Local Test)
47
- Verify that your machine can process audio using the Wav2Vec2 processor.
 
 
 
 
 
48
  ```bash
49
- uv run src/features/extractor.py
50
  ```
51
- * **Output:** Prints the embedding shape `(768,)` for a sample file.
 
 
52
 
53
- ### 3. Model Training (The "Professional" Way)
54
- Training a Transformer on a CPU is too slow. We use Google Colab.
 
 
 
 
 
 
 
 
 
55
 
56
- 1. Upload `train_colab.py` and your `Emotions` folder to Google Drive.
57
- 2. Open `VigilAudio_Fine_Tuning.ipynb` in Colab.
58
- 3. Set Runtime to **T4 GPU**.
59
- 4. Run the training script.
60
- * **Result:** A fine-tuned model saved to `wav2vec2-finetuned/`.
61
- * **Performance:** ~82% Accuracy / 0.81 F1 Score.
62
 
63
- ## Dataset
 
 
 
 
 
64
 
65
- The model was trained on a combined dataset of **12,798 audio recordings** across 7 emotions.
66
- * **Source:** [Kaggle - Audio Emotions Dataset](https://www.kaggle.com/datasets/uldisvalainis/audio-emotions)
67
- * **Composition:** An amalgam of CREMA-D, TESS, RAVDESS, and SAVEE datasets.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
- ## Results Summary
70
 
71
- | Model | Architecture | Training Time | Accuracy |
72
- |-------|--------------|---------------|----------|
73
- | Baseline | Simple MLP (CPU) | ~3 hours | 54% |
74
- | **VigilAudio** | **Fine-Tuned Wav2Vec2 (GPU)** | **17 mins** | **82%** |
 
75
 
76
  ## License
77
-
78
- MIT
 
2
 
3
  **A production-ready audio emotion classification system built for content moderation.**
4
 
5
+ VigilAudio is an advanced audio analysis engine designed to detect aggression, distress, and safety risks by analyzing the *tone* of voice. It is the audio foundation of a multimodal moderation suite, utilizing fine-tuned Transformers and optimized for high-speed CPU inference.
6
 
7
+ ![Dashboard](docs/screenshot_placeholder.png)
8
 
9
+ ## Dataset & Results
 
 
 
10
 
11
+ * **Source:** [Kaggle - Audio Emotions Dataset](https://www.kaggle.com/datasets/uldisvalainis/audio-emotions) (12,798 recordings).
12
+ * **Architecture:** Fine-tuned `Wav2Vec2` Transformer.
13
+ * **Accuracy:** **83%** (PyTorch) / **84%** (Optimized INT8 ONNX).
14
+ * **Optimization:** 1.85x speedup and 67% size reduction via INT8 Quantization.
15
 
16
+ ---
 
 
 
 
17
 
18
+ ## Prerequisites
19
 
20
+ * **Python 3.10+**
21
+ * **uv:** [Install uv](https://docs.astral.sh/uv/getting-started/installation/) (recommended for environment management).
22
+ * **FFMPEG:** Required for audio processing.
23
+ * *Windows:* `winget install ffmpeg`
24
+ * *Linux:* `sudo apt install ffmpeg`
25
 
26
+ ---
 
 
 
 
27
 
28
+ ## How to Run (Quick Start)
29
 
30
+ ### 1. Setup Environment
 
31
  ```bash
32
+ git clone https://github.com/yourusername/vigilaudio.git
33
+ cd vigilaudio
34
+ uv sync
35
  ```
 
 
36
 
37
+ ### 2. Download Model Weights
38
+ Because model weights are large, they are not stored in Git.
39
+ 1. Download `wav2vec2_model.zip` from [Your Link/Releases].
40
+ 2. Extract to `models/onnx_quantized/`.
41
+
42
+ ### 3. Launch the Application
43
+ Run the standalone demo (recommended for local testing):
44
  ```bash
45
+ uv run streamlit run src/ui/app_standalone.py
46
  ```
47
+ * **Access:** `http://localhost:8501`
48
+
49
+ ---
50
 
51
+ ## Development Workflow
52
+
53
+ If you want to retrain or modify the system:
54
+
55
+ ### 1. Data Preparation
56
+ 1. Download the [Kaggle Dataset](https://www.kaggle.com/datasets/uldisvalainis/audio-emotions).
57
+ 2. Place the folders (Angry, Happy, etc.) in `data/raw/Emotions/`.
58
+ 3. Run harmonization:
59
+ ```bash
60
+ uv run src/data/harmonize.py
61
+ ```
62
 
63
+ ### 2. Model Training (Cloud Accelerated)
64
+ We use Google Colab (T4 GPU) for high-speed fine-tuning.
65
+ * The training script and notebook are in `docs/VigilAudio_Fine_Tuning.ipynb`.
 
 
 
66
 
67
+ ### 3. Optimization & Benchmarking
68
+ Convert to ONNX and verify performance:
69
+ ```bash
70
+ uv run src/models/optimize.py
71
+ uv run src/models/benchmark.py
72
+ ```
73
 
74
+ ---
75
+
76
+ ## Project Structure
77
+
78
+ ```text
79
+ vigilaudio/
80
+ ├── data/ # Dataset storage
81
+ │ ├── raw/ # Original audio files (excluded from Git)
82
+ │ └── processed/ # Metadata and splits
83
+ ├── models/ # Model registry
84
+ │ ├── wav2vec2-finetuned/ # PyTorch weights
85
+ │ └── onnx_quantized/ # Optimized INT8 engine
86
+ ├── src/
87
+ │ ├── api/ # FastAPI backend service
88
+ │ ├── data/ # ETL and harmonization scripts
89
+ │ ├── features/ # Audio feature extraction
90
+ │ ├── models/ # Training, Inference, and Optimization logic
91
+ │ └── ui/ # Streamlit frontend dashboards
92
+ ├── docs/ # Benchmarks, Logs, and Colab Notebooks
93
+ └── notebooks/ # Experimental EDA
94
+ ```
95
 
96
+ ## Performance Optimization (ONNX)
97
 
98
+ | Model Version | Accuracy | Latency (ms) | Speedup | Size (MB) |
99
+ |---------------|----------|--------------|---------|-----------|
100
+ | PyTorch (Full) | 82.0% | 370ms | 1.00x | 361MB |
101
+ | ONNX (Standard)| 82.0% | 306ms | 1.21x | 361MB |
102
+ | **ONNX (INT8)** | **84.0%** | **199ms** | **1.85x** | **116MB** |
103
 
104
  ## License
105
+ MIT
 
docs/VigilAudio_Fine_Tuning.ipynb ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "id": "1342e84e",
7
+ "metadata": {
8
+ "vscode": {
9
+ "languageId": "plaintext"
10
+ }
11
+ },
12
+ "outputs": [],
13
+ "source": [
14
+ "import os\n",
15
+ "import torch\n",
16
+ "import pandas as pd\n",
17
+ "import librosa\n",
18
+ "import numpy as np\n",
19
+ "from torch.utils.data import Dataset\n",
20
+ "from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification, Trainer, TrainingArguments\n",
21
+ "from sklearn.metrics import accuracy_score, f1_score\n",
22
+ "import shutil"
23
+ ]
24
+ },
25
+ {
26
+ "cell_type": "code",
27
+ "execution_count": null,
28
+ "id": "ef12ccf6",
29
+ "metadata": {
30
+ "vscode": {
31
+ "languageId": "plaintext"
32
+ }
33
+ },
34
+ "outputs": [],
35
+ "source": [
36
+ "DRIVE_PROJECT_ROOT = \"/content/drive/MyDrive/Colab_VigilAudio\"\n",
37
+ "METADATA_PATH = os.path.join(DRIVE_PROJECT_ROOT, \"metadata.csv\")\n",
38
+ "MODEL_NAME = \"facebook/wav2vec2-base-960h\"\n",
39
+ "\n",
40
+ "LOCAL_DATA_PATH = \"/content/Emotions\" \n",
41
+ "OUTPUT_DIR = os.path.join(DRIVE_PROJECT_ROOT, \"wav2vec2-finetuned\")\n",
42
+ "LOCAL_OUTPUT = \"/content/wav2vec2-finetuned\"\n",
43
+ "\n",
44
+ "os.environ[\"WANDB_DISABLED\"] = \"true\""
45
+ ]
46
+ },
47
+ {
48
+ "cell_type": "code",
49
+ "execution_count": null,
50
+ "id": "8459c22f",
51
+ "metadata": {
52
+ "vscode": {
53
+ "languageId": "plaintext"
54
+ }
55
+ },
56
+ "outputs": [],
57
+ "source": [
58
+ "def setup_data():\n",
59
+ " if not os.path.exists(LOCAL_DATA_PATH):\n",
60
+ " print(\"Copying data to local disk (this takes ~3 mins)...\")\n",
61
+ " drive_data = os.path.join(DRIVE_PROJECT_ROOT, \"Emotions\")\n",
62
+ " if os.path.exists(drive_data):\n",
63
+ " shutil.copytree(drive_data, LOCAL_DATA_PATH)\n",
64
+ " print(\"Data copy complete.\")\n",
65
+ " else:\n",
66
+ " print(\"Drive data not found. Assuming data is already in /content/Emotions\")\n",
67
+ " else:\n",
68
+ " print(\"Data already exists on local disk.\")"
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "code",
73
+ "execution_count": null,
74
+ "id": "14d6662d",
75
+ "metadata": {
76
+ "vscode": {
77
+ "languageId": "plaintext"
78
+ }
79
+ },
80
+ "outputs": [],
81
+ "source": [
82
+ "class AudioDataset(Dataset):\n",
83
+ " def __init__(self, dataframe, audio_root, feature_extractor, label_map):\n",
84
+ " self.df = dataframe.reset_index(drop=True)\n",
85
+ " self.audio_root = audio_root\n",
86
+ " self.feature_extractor = feature_extractor\n",
87
+ " self.label_map = label_map\n",
88
+ "\n",
89
+ " def __len__(self):\n",
90
+ " return len(self.df)\n",
91
+ "\n",
92
+ " def __getitem__(self, idx):\n",
93
+ " row = self.df.iloc[idx]\n",
94
+ " filename = os.path.basename(row['path'].replace('\\\\', '/'))\n",
95
+ " folder = row['emotion'].capitalize()\n",
96
+ " if folder == 'Suprised': folder = 'Suprised'\n",
97
+ " audio_path = os.path.join(self.audio_root, folder, filename)\n",
98
+ " \n",
99
+ " try:\n",
100
+ " speech, _ = librosa.load(audio_path, sr=16000)\n",
101
+ " inputs = self.feature_extractor(speech, sampling_rate=16000, padding=True, return_tensors=\"pt\")\n",
102
+ " return {\n",
103
+ " \"input_values\": inputs.input_values.squeeze(0),\n",
104
+ " \"labels\": torch.tensor(self.label_map[row['emotion']], dtype=torch.long),\n",
105
+ " }\n",
106
+ " except Exception:\n",
107
+ " return self.__getitem__((idx + 1) % len(self))\n",
108
+ "\n",
109
+ "def compute_metrics(p):\n",
110
+ " preds = np.argmax(p.predictions, axis=1)\n",
111
+ " return {\n",
112
+ " 'accuracy': accuracy_score(p.label_ids, preds),\n",
113
+ " 'f1': f1_score(p.label_ids, preds, average='weighted'),\n",
114
+ " }"
115
+ ]
116
+ },
117
+ {
118
+ "cell_type": "code",
119
+ "execution_count": null,
120
+ "id": "86dabb49",
121
+ "metadata": {
122
+ "vscode": {
123
+ "languageId": "plaintext"
124
+ }
125
+ },
126
+ "outputs": [],
127
+ "source": [
128
+ "\n",
129
+ "\n",
130
+ "setup_data()\n",
131
+ "\n",
132
+ "if not os.path.exists(METADATA_PATH):\n",
133
+ "print(f\"Error: Metadata not found at {METADATA_PATH}\")\n",
134
+ "return\n",
135
+ "\n",
136
+ "df = pd.read_csv(METADATA_PATH)\n",
137
+ "emotions = sorted(df['emotion'].unique())\n",
138
+ "label_map = {name: i for i, name in enumerate(emotions)}\n",
139
+ "id2label = {i: name for name, i in label_map.items()}\n",
140
+ "\n",
141
+ "feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)\n",
142
+ "train_ds = AudioDataset(df[df['split']=='train'], LOCAL_DATA_PATH, feature_extractor, label_map)\n",
143
+ "val_ds = AudioDataset(df[df['split']=='val'], LOCAL_DATA_PATH, feature_extractor, label_map)\n",
144
+ "\n",
145
+ "model = Wav2Vec2ForSequenceClassification.from_pretrained(\n",
146
+ "MODEL_NAME,\n",
147
+ "num_labels=len(emotions),\n",
148
+ "id2label=id2label,\n",
149
+ "label2id=label_map,\n",
150
+ "ignore_mismatched_sizes=True\n",
151
+ ")\n",
152
+ "model.freeze_feature_encoder()\n",
153
+ "\n",
154
+ "training_args = TrainingArguments(\n",
155
+ "output_dir=\"/content/checkpoints\",\n",
156
+ "eval_strategy=\"epoch\",\n",
157
+ "save_strategy=\"epoch\",\n",
158
+ "per_device_train_batch_size=8,\n",
159
+ "gradient_accumulation_steps=2,\n",
160
+ "num_train_epochs=5,\n",
161
+ "learning_rate=3e-5,\n",
162
+ "warmup_steps=500,\n",
163
+ "load_best_model_at_end=True,\n",
164
+ "metric_for_best_model=\"accuracy\",\n",
165
+ "fp16=True,\n",
166
+ "report_to=\"none\"\n",
167
+ ")\n",
168
+ "\n",
169
+ "trainer = Trainer(\n",
170
+ "model=model,\n",
171
+ "args=training_args,\n",
172
+ "train_dataset=train_ds,\n",
173
+ "eval_dataset=val_ds,\n",
174
+ "tokenizer=feature_extractor,\n",
175
+ "compute_metrics=compute_metrics\n",
176
+ ")\n",
177
+ "\n",
178
+ "print(\"Starting Training (Restored Logic)...\")\n",
179
+ "trainer.train()\n",
180
+ "\n",
181
+ "print(\"Saving final model locally...\")\n",
182
+ "trainer.save_model(LOCAL_OUTPUT)\n",
183
+ "\n",
184
+ "print(\"Zipping for download...\")\n",
185
+ "shutil.make_archive(\"/content/wav2vec2_model\", 'zip', LOCAL_OUTPUT)\n",
186
+ "print(\"DONE! Please download /content/wav2vec2_model.zip\")\n",
187
+ "\n",
188
+ "\n"
189
+ ]
190
+ }
191
+ ],
192
+ "metadata": {
193
+ "language_info": {
194
+ "name": "python"
195
+ }
196
+ },
197
+ "nbformat": 4,
198
+ "nbformat_minor": 5
199
+ }
docs/benchmark_report.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Model,Accuracy,Latency (Avg ms),Speedup,Size (MB)
2
+ PyTorch (Full),82.00%,369.98ms,1.00x,360.8MB
3
+ ONNX (Standard),82.00%,306.52ms,1.21x,361.0MB
4
+ ONNX (INT8 Quantized),84.00%,199.46ms,1.85x,116.5MB