Reza2kn's picture
Upgrade Gradio for ZeroGPU auth handshake
69c2d55 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: Representation Chizzler
sdk: gradio
sdk_version: 5.16.1
app_file: app.py
python_version: '3.10'
packages:
  - ffmpeg
hf_oauth: true

๐ŸŽง Representation Chizzlerโ„ข

A powerful two-stage audio processing tool that combines Voice Activity Detection (VAD) and Speech Enhancement to clean and denoise audio files.

๐ŸŒŸ Features

  1. Two-Stage Processing Pipeline:

    • Stage 1: Uses Silero VAD to detect and extract speech segments
    • Stage 2: Applies MP-SENet deep learning model to remove noise
  2. Memory-Efficient Processing:

    • Processes audio in chunks to prevent memory issues
    • Automatically converts audio to the required format (16kHz mono WAV)
  3. User-Friendly Interface:

    • Beautiful Gradio web interface
    • Real-time progress reporting
    • Compare original, VAD-processed, and denoised versions
  4. Dataset Cleaning to Hub:

    • Load any HF audio dataset (wav/mp3/flac)
    • Process every audio file with Representation Chizzler
    • Upload a cleaned dataset with a Representation Chizzler suffix

๐Ÿš€ Installation

  1. Clone this repository:

    git clone https://github.com/Reza2kn/RepresentationChizzler.git
    cd RepresentationChizzler
    
  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows, use: .venv\Scripts\activate
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Set up environment variables:

    • Create a .env file in the project root
    • Add your Hugging Face token:
      HF_TOKEN=your_huggingface_token_here
      
  5. MP-SENet files and weights:

    • This repo includes a minimal MP-SENet copy plus the pretrained best_ckpt/g_best_dns and best_ckpt/config.json downloaded from the official MP-SENet GitHub repository.

    Optional alternatives (if you want to swap weights):

    • Set MPSENET_REPO to a Hugging Face repo that contains g_best_dns and config.json (use MPSENET_CKPT_FILENAME / MPSENET_CONFIG_FILENAME if the filenames differ).

๐ŸŽฎ Usage

  1. Run the app:

    python app.py
    
  2. Open your web browser and navigate to the provided URL

  3. Upload an audio file and adjust the parameters:

    • VAD Threshold: Controls voice detection sensitivity (0.1-0.9)
    • Max Silence Gap: Controls merging of close speech segments (1-10s)
    • Normalize volume: Boosts quiet samples and gently attenuates loud ones
  4. Compare the results:

    • Original Audio
    • VAD Processed (Speech Only)
    • Final Denoised

๐Ÿ“ฆ Dataset Cleaning (Hugging Face Hub)

Use the "Dataset to Hub" tab to process any HF dataset that includes audio files (wav, mp3, flac).

Inputs:

  • Dataset ID or URL (defaults to kiarashQ/farsi-asr-unified-cleaned)
  • Optional config and split (use all to process every split)
  • Optional audio column (auto-detected if left empty)
  • Optional output dataset repo (defaults to {username}/{dataset}-representation-chizzler)
  • Resume from cached shards to continue long runs without restarting
  • Normalize volume to raise quiet clips and reduce overly loud clips

Requirements:

  • HF_TOKEN must be set so the app can download private datasets and push the cleaned dataset to your account.

Output notes:

  • The cleaned dataset adds chizzler_ok (bool) and chizzler_error (string) columns for per-row error tracking.
  • Cached shards are stored under chizzler_cache/ (configurable via CHIZZLER_CACHE_DIR).
  • On ZeroGPU, enable "Cache shards on Hub" so resume works across GPU workers. ZeroGPU is preemptible; for uninterrupted runs use a dedicated GPU Space.

โ˜๏ธ Hugging Face Space (Zero GPU)

This repo is Space-ready. Create a Zero GPU Space and:

  1. Set the Space secret HF_TOKEN.
  2. MP-SENet files are already bundled in the repo:
    • MP-SENet/best_ckpt/g_best_dns
    • MP-SENet/best_ckpt/config.json
  3. Launch uses app.py automatically.

๐Ÿ› ๏ธ Parameters

  • VAD Threshold (0.1-0.9):

    • Higher values = stricter voice detection
    • Lower values = more lenient detection
    • Default: 0.5
  • Max Silence Gap (1-10s):

    • Maximum silence duration to consider segments as continuous
    • Higher values = fewer segments but may include more silence
    • Default: 4.0s

๐Ÿ™ Credits

This project combines two powerful models:

๐Ÿ“ License

This project is licensed under the terms specified in the MP-SENet repository.