---
title: Representation Chizzler
sdk: gradio
sdk_version: 5.16.1
app_file: app.py
python_version: "3.10"
packages:
  - ffmpeg
hf_oauth: true
---

# 🎧 Representation Chizzler™

A powerful two-stage audio processing tool that combines Voice Activity Detection (VAD) and Speech Enhancement to clean and denoise audio files.

## 🌟 Features

1. **Two-Stage Processing Pipeline**:
   - Stage 1: Uses Silero VAD to detect and extract speech segments
   - Stage 2: Applies MP-SENet deep learning model to remove noise

2. **Memory-Efficient Processing**:
   - Processes audio in chunks to prevent memory issues
   - Automatically converts audio to the required format (16kHz mono WAV)

3. **User-Friendly Interface**:
   - Beautiful Gradio web interface
   - Real-time progress reporting
   - Compare original, VAD-processed, and denoised versions

4. **Dataset Cleaning to Hub**:
   - Load any HF audio dataset (wav/mp3/flac)
   - Process every audio file with Representation Chizzler
   - Upload a cleaned dataset with a Representation Chizzler suffix

## 🚀 Installation

1. Clone this repository:
   ```bash
   git clone https://github.com/Reza2kn/RepresentationChizzler.git
   cd RepresentationChizzler
   ```

2. Create and activate a virtual environment:
   ```bash
   python -m venv .venv
   source .venv/bin/activate  # On Windows, use: .venv\Scripts\activate
   ```

3. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

4. Set up environment variables:
   - Create a `.env` file in the project root
   - Add your Hugging Face token:
     ```
     HF_TOKEN=your_huggingface_token_here
     ```

5. MP-SENet files and weights:
   - This repo includes a minimal MP-SENet copy plus the pretrained
     `best_ckpt/g_best_dns` and `best_ckpt/config.json` downloaded from the
     official MP-SENet GitHub repository.

   Optional alternatives (if you want to swap weights):
   - Set `MPSENET_REPO` to a Hugging Face repo that contains `g_best_dns` and
     `config.json` (use `MPSENET_CKPT_FILENAME` / `MPSENET_CONFIG_FILENAME` if
     the filenames differ).

## 🎮 Usage

1. Run the app:
   ```bash
   python app.py
   ```

2. Open your web browser and navigate to the provided URL

3. Upload an audio file and adjust the parameters:
   - VAD Threshold: Controls voice detection sensitivity (0.1-0.9)
   - Max Silence Gap: Controls merging of close speech segments (1-10s)
   - Normalize volume: Boosts quiet samples and gently attenuates loud ones

4. Compare the results:
   - Original Audio
   - VAD Processed (Speech Only)
   - Final Denoised

## 📦 Dataset Cleaning (Hugging Face Hub)

Use the "Dataset to Hub" tab to process any HF dataset that includes audio
files (wav, mp3, flac).

Inputs:
- Dataset ID or URL (defaults to `kiarashQ/farsi-asr-unified-cleaned`)
- Optional config and split (use `all` to process every split)
- Optional audio column (auto-detected if left empty)
- Optional output dataset repo (defaults to `{username}/{dataset}-representation-chizzler`)
- Resume from cached shards to continue long runs without restarting
- Normalize volume to raise quiet clips and reduce overly loud clips

Requirements:
- `HF_TOKEN` must be set so the app can download private datasets and push the
  cleaned dataset to your account.

Output notes:
- The cleaned dataset adds `chizzler_ok` (bool) and `chizzler_error` (string)
  columns for per-row error tracking.
- Cached shards are stored under `chizzler_cache/` (configurable via
  `CHIZZLER_CACHE_DIR`).
- On ZeroGPU, enable "Cache shards on Hub" so resume works across GPU workers.
  ZeroGPU is preemptible; for uninterrupted runs use a dedicated GPU Space.

## ☁️ Hugging Face Space (Zero GPU)

This repo is Space-ready. Create a Zero GPU Space and:
1. Set the Space secret `HF_TOKEN`.
2. MP-SENet files are already bundled in the repo:
   - `MP-SENet/best_ckpt/g_best_dns`
   - `MP-SENet/best_ckpt/config.json`
3. Launch uses `app.py` automatically.

## 🛠️ Parameters

- **VAD Threshold** (0.1-0.9):
  - Higher values = stricter voice detection
  - Lower values = more lenient detection
  - Default: 0.5

- **Max Silence Gap** (1-10s):
  - Maximum silence duration to consider segments as continuous
  - Higher values = fewer segments but may include more silence
  - Default: 4.0s

## 🙏 Credits

This project combines two powerful models:
- [Silero VAD](https://github.com/snakers4/silero-vad) for Voice Activity Detection
- [MP-SENet](https://github.com/yxlu-0102/MP-SENet) for Speech Enhancement

## 📝 License

This project is licensed under the terms specified in the MP-SENet repository.