Spaces:
Sleeping
Sleeping
readme updated
Browse files- README.md +78 -0
- src/data/harmonize.py +8 -8
- src/features/build_features.py +4 -4
- src/features/extractor.py +6 -6
README.md
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# VigilAudio: AI-Powered Audio Moderation Engine
|
| 2 |
+
|
| 3 |
+
**A production-ready audio emotion classification system built for content moderation.**
|
| 4 |
+
|
| 5 |
+
VigilAudio is the first phase of a multimodal moderation suite designed to detect distress, aggression, and safety risks in user-generated content. Unlike traditional moderators that look for keywords, VigilAudio listens to the *tone* of the voiceβdetecting anger, fear, or distress even when the words themselves are neutral.
|
| 6 |
+
|
| 7 |
+
## Key Features
|
| 8 |
+
|
| 9 |
+
* **State-of-the-Art Architecture:** Fine-tuned `facebook/wav2vec2-base-960h` Transformer model.
|
| 10 |
+
* **High Accuracy:** Achieved **82% accuracy** on a 7-class emotion dataset (Angry, Happy, Sad, Fearful, Disgusted, Neutral, Surprised).
|
| 11 |
+
* **Production Pipeline:** End-to-end data harmonization, stratified splitting, and efficient feature extraction.
|
| 12 |
+
* **Cloud-Native Training:** Optimized training scripts for Google Colab (T4 GPU), reducing training time from 50+ hours to <20 minutes.
|
| 13 |
+
|
| 14 |
+
## Technology Stack
|
| 15 |
+
|
| 16 |
+
* **Language:** Python 3.10+
|
| 17 |
+
* **Environment:** `uv` (for fast dependency management)
|
| 18 |
+
* **ML Framework:** PyTorch, Hugging Face Transformers, Accelerate
|
| 19 |
+
* **Audio Processing:** Librosa, Soundfile
|
| 20 |
+
* **Data Ops:** Pandas, Scikit-learn
|
| 21 |
+
|
| 22 |
+
## Installation
|
| 23 |
+
|
| 24 |
+
1. **Clone the repository:**
|
| 25 |
+
```bash
|
| 26 |
+
git clone https://github.com/yourusername/vigilaudio.git
|
| 27 |
+
cd vigilaudio
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
2. **Initialize the environment:**
|
| 31 |
+
We use `uv` for lightning-fast setups.
|
| 32 |
+
```bash
|
| 33 |
+
uv sync
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
## Execution Guide
|
| 37 |
+
|
| 38 |
+
### 1. Data Pipeline (Harmonization)
|
| 39 |
+
Turn raw, messy folders into a clean, stratified dataset.
|
| 40 |
+
```bash
|
| 41 |
+
uv run src/data/harmonize.py
|
| 42 |
+
```
|
| 43 |
+
* **Input:** Raw audio folders (`Emotions/Angry`, `Emotions/Happy`...)
|
| 44 |
+
* **Output:** `data/processed/metadata.csv` (Unified labels + 80/10/10 splits)
|
| 45 |
+
|
| 46 |
+
### 2. Feature Extraction (Local Test)
|
| 47 |
+
Verify that your machine can process audio using the Wav2Vec2 processor.
|
| 48 |
+
```bash
|
| 49 |
+
uv run src/features/extractor.py
|
| 50 |
+
```
|
| 51 |
+
* **Output:** Prints the embedding shape `(768,)` for a sample file.
|
| 52 |
+
|
| 53 |
+
### 3. Model Training (The "Professional" Way)
|
| 54 |
+
Training a Transformer on a CPU is too slow. We use Google Colab.
|
| 55 |
+
|
| 56 |
+
1. Upload `train_colab.py` and your `Emotions` folder to Google Drive.
|
| 57 |
+
2. Open `VigilAudio_Fine_Tuning.ipynb` in Colab.
|
| 58 |
+
3. Set Runtime to **T4 GPU**.
|
| 59 |
+
4. Run the training script.
|
| 60 |
+
* **Result:** A fine-tuned model saved to `wav2vec2-finetuned/`.
|
| 61 |
+
* **Performance:** ~82% Accuracy / 0.81 F1 Score.
|
| 62 |
+
|
| 63 |
+
## Dataset
|
| 64 |
+
|
| 65 |
+
The model was trained on a combined dataset of **12,798 audio recordings** across 7 emotions.
|
| 66 |
+
* **Source:** [Kaggle - Audio Emotions Dataset](https://www.kaggle.com/datasets/uldisvalainis/audio-emotions)
|
| 67 |
+
* **Composition:** An amalgam of CREMA-D, TESS, RAVDESS, and SAVEE datasets.
|
| 68 |
+
|
| 69 |
+
## Results Summary
|
| 70 |
+
|
| 71 |
+
| Model | Architecture | Training Time | Accuracy |
|
| 72 |
+
|-------|--------------|---------------|----------|
|
| 73 |
+
| Baseline | Simple MLP (CPU) | ~3 hours | 54% |
|
| 74 |
+
| **VigilAudio** | **Fine-Tuned Wav2Vec2 (GPU)** | **17 mins** | **82%** |
|
| 75 |
+
|
| 76 |
+
## License
|
| 77 |
+
|
| 78 |
+
MIT
|
src/data/harmonize.py
CHANGED
|
@@ -6,7 +6,7 @@ from tqdm import tqdm
|
|
| 6 |
import librosa
|
| 7 |
|
| 8 |
def harmonize_data(raw_data_path, output_path):
|
| 9 |
-
print(f"
|
| 10 |
|
| 11 |
data = []
|
| 12 |
# Folder names are our labels
|
|
@@ -20,7 +20,7 @@ def harmonize_data(raw_data_path, output_path):
|
|
| 20 |
folder_path = Path(raw_data_path) / folder
|
| 21 |
files = list(folder_path.glob("*.wav"))
|
| 22 |
|
| 23 |
-
print(f"
|
| 24 |
|
| 25 |
for file_path in tqdm(files, desc=f"Processing {folder}"):
|
| 26 |
try:
|
|
@@ -33,16 +33,16 @@ def harmonize_data(raw_data_path, output_path):
|
|
| 33 |
"path": str(file_path.absolute())
|
| 34 |
})
|
| 35 |
except Exception as e:
|
| 36 |
-
print(f"
|
| 37 |
|
| 38 |
df = pd.DataFrame(data)
|
| 39 |
|
| 40 |
if df.empty:
|
| 41 |
-
print("
|
| 42 |
return
|
| 43 |
|
| 44 |
# --- Stratified Splitting (80/10/10) ---
|
| 45 |
-
print("\
|
| 46 |
|
| 47 |
# First split: Train vs Temp (20%)
|
| 48 |
train_df, temp_df = train_test_split(
|
|
@@ -66,9 +66,9 @@ def harmonize_data(raw_data_path, output_path):
|
|
| 66 |
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
| 67 |
final_df.to_csv(output_path, index=False)
|
| 68 |
|
| 69 |
-
print(f"\
|
| 70 |
-
print(f"
|
| 71 |
-
print(f"
|
| 72 |
print("\nSplit Statistics:")
|
| 73 |
print(final_df.groupby(['split', 'emotion']).size().unstack(fill_value=0))
|
| 74 |
|
|
|
|
| 6 |
import librosa
|
| 7 |
|
| 8 |
def harmonize_data(raw_data_path, output_path):
|
| 9 |
+
print(f"Scanning directory: {raw_data_path}")
|
| 10 |
|
| 11 |
data = []
|
| 12 |
# Folder names are our labels
|
|
|
|
| 20 |
folder_path = Path(raw_data_path) / folder
|
| 21 |
files = list(folder_path.glob("*.wav"))
|
| 22 |
|
| 23 |
+
print(f"Processing {folder}: {len(files)} files")
|
| 24 |
|
| 25 |
for file_path in tqdm(files, desc=f"Processing {folder}"):
|
| 26 |
try:
|
|
|
|
| 33 |
"path": str(file_path.absolute())
|
| 34 |
})
|
| 35 |
except Exception as e:
|
| 36 |
+
print(f"Error processing {file_path}: {e}")
|
| 37 |
|
| 38 |
df = pd.DataFrame(data)
|
| 39 |
|
| 40 |
if df.empty:
|
| 41 |
+
print("No data found! Please check the raw_data_path.")
|
| 42 |
return
|
| 43 |
|
| 44 |
# --- Stratified Splitting (80/10/10) ---
|
| 45 |
+
print("\nCreating stratified splits...")
|
| 46 |
|
| 47 |
# First split: Train vs Temp (20%)
|
| 48 |
train_df, temp_df = train_test_split(
|
|
|
|
| 66 |
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
| 67 |
final_df.to_csv(output_path, index=False)
|
| 68 |
|
| 69 |
+
print(f"\nHarmonization Complete!")
|
| 70 |
+
print(f"Total files: {len(final_df)}")
|
| 71 |
+
print(f"Metadata saved to: {output_path}")
|
| 72 |
print("\nSplit Statistics:")
|
| 73 |
print(final_df.groupby(['split', 'emotion']).size().unstack(fill_value=0))
|
| 74 |
|
src/features/build_features.py
CHANGED
|
@@ -13,7 +13,7 @@ def build_all_features(metadata_path, output_dir):
|
|
| 13 |
df = pd.read_csv(metadata_path)
|
| 14 |
extractor = AudioFeatureExtractor()
|
| 15 |
|
| 16 |
-
print(f"
|
| 17 |
|
| 18 |
# 2. Loop with progress bar
|
| 19 |
# We use a custom naming scheme: {split}_{original_filename}.npy
|
|
@@ -32,8 +32,8 @@ def build_all_features(metadata_path, output_dir):
|
|
| 32 |
if embedding is not None:
|
| 33 |
np.save(embedding_path, embedding)
|
| 34 |
|
| 35 |
-
print(f"\
|
| 36 |
-
print(f"
|
| 37 |
|
| 38 |
if __name__ == "__main__":
|
| 39 |
METADATA = "data/processed/metadata.csv"
|
|
@@ -42,4 +42,4 @@ if __name__ == "__main__":
|
|
| 42 |
if os.path.exists(METADATA):
|
| 43 |
build_all_features(METADATA, OUTPUT)
|
| 44 |
else:
|
| 45 |
-
print("
|
|
|
|
| 13 |
df = pd.read_csv(metadata_path)
|
| 14 |
extractor = AudioFeatureExtractor()
|
| 15 |
|
| 16 |
+
print(f"Starting bulk extraction for {len(df)} files...")
|
| 17 |
|
| 18 |
# 2. Loop with progress bar
|
| 19 |
# We use a custom naming scheme: {split}_{original_filename}.npy
|
|
|
|
| 32 |
if embedding is not None:
|
| 33 |
np.save(embedding_path, embedding)
|
| 34 |
|
| 35 |
+
print(f"\nBulk Extraction Complete!")
|
| 36 |
+
print(f"Embeddings saved to: {output_dir.absolute()}")
|
| 37 |
|
| 38 |
if __name__ == "__main__":
|
| 39 |
METADATA = "data/processed/metadata.csv"
|
|
|
|
| 42 |
if os.path.exists(METADATA):
|
| 43 |
build_all_features(METADATA, OUTPUT)
|
| 44 |
else:
|
| 45 |
+
print("Metadata not found. Run harmonize.py first.")
|
src/features/extractor.py
CHANGED
|
@@ -12,8 +12,8 @@ class AudioFeatureExtractor:
|
|
| 12 |
self.cache_dir = Path(cache_dir)
|
| 13 |
self.cache_dir.mkdir(parents=True, exist_ok=True)
|
| 14 |
|
| 15 |
-
print(f"
|
| 16 |
-
print(f"
|
| 17 |
|
| 18 |
# Load processor and model with explicit cache_dir
|
| 19 |
self.processor = Wav2Vec2Processor.from_pretrained(model_name, cache_dir=self.cache_dir)
|
|
@@ -24,7 +24,7 @@ class AudioFeatureExtractor:
|
|
| 24 |
self.model.to(self.device)
|
| 25 |
self.model.eval()
|
| 26 |
|
| 27 |
-
print(f"
|
| 28 |
|
| 29 |
def extract(self, audio_path):
|
| 30 |
"""
|
|
@@ -48,7 +48,7 @@ class AudioFeatureExtractor:
|
|
| 48 |
return embeddings.cpu().numpy().flatten()
|
| 49 |
|
| 50 |
except Exception as e:
|
| 51 |
-
print(f"
|
| 52 |
return None
|
| 53 |
|
| 54 |
if __name__ == "__main__":
|
|
@@ -66,7 +66,7 @@ if __name__ == "__main__":
|
|
| 66 |
if embedding is not None:
|
| 67 |
print(f"\n⨠Success!")
|
| 68 |
print(f"File: {sample_path}")
|
| 69 |
-
print(f"Embedding shape: {embedding.shape}")
|
| 70 |
print(f"First 5 values: {embedding[:5]}")
|
| 71 |
else:
|
| 72 |
-
print("
|
|
|
|
| 12 |
self.cache_dir = Path(cache_dir)
|
| 13 |
self.cache_dir.mkdir(parents=True, exist_ok=True)
|
| 14 |
|
| 15 |
+
print(f"Loading model: {model_name}...")
|
| 16 |
+
print(f"Cache directory: {self.cache_dir.absolute()}")
|
| 17 |
|
| 18 |
# Load processor and model with explicit cache_dir
|
| 19 |
self.processor = Wav2Vec2Processor.from_pretrained(model_name, cache_dir=self.cache_dir)
|
|
|
|
| 24 |
self.model.to(self.device)
|
| 25 |
self.model.eval()
|
| 26 |
|
| 27 |
+
print(f"Model loaded on {self.device}")
|
| 28 |
|
| 29 |
def extract(self, audio_path):
|
| 30 |
"""
|
|
|
|
| 48 |
return embeddings.cpu().numpy().flatten()
|
| 49 |
|
| 50 |
except Exception as e:
|
| 51 |
+
print(f"Error extracting features from {audio_path}: {e}")
|
| 52 |
return None
|
| 53 |
|
| 54 |
if __name__ == "__main__":
|
|
|
|
| 66 |
if embedding is not None:
|
| 67 |
print(f"\n⨠Success!")
|
| 68 |
print(f"File: {sample_path}")
|
| 69 |
+
print(f"Embedding shape: {embedding.shape}")
|
| 70 |
print(f"First 5 values: {embedding[:5]}")
|
| 71 |
else:
|
| 72 |
+
print("Metadata not found. Please run harmonization first.")
|