Initial commit: Speech-to-Text Model Arena app

Add Gradio-based web app for comparing multiple speech-to-text models side-by-side. Includes app source code, Dockerfiles for GPU and CPU, Docker Compose files, requirements, and documentation. Supports Whisper, StutteredSpeechASR, and Wav2Vec2 models with persistent HuggingFace cache and both local and containerized deployment.

Files changed (8) hide show

.dockerignore +18 -0
Dockerfile +43 -0
Dockerfile.cpu +41 -0
README.md +156 -0
app.py +323 -0
docker-compose.cpu.yml +16 -0
docker-compose.yml +23 -0
requirements.txt +6 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,18 @@

+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+.git
+.gitignore
+.venv
+venv
+env
+*.egg-info
+dist
+build
+.pytest_cache
+.mypy_cache
+*.log
+.DS_Store
+Thumbs.db

Dockerfile ADDED Viewed

	@@ -0,0 +1,43 @@

+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Set environment variables
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV GRADIO_SERVER_NAME=0.0.0.0
+ENV GRADIO_SERVER_PORT=7860
+ENV HF_HOME=/app/.cache/huggingface
+ENV TRANSFORMERS_CACHE=/app/.cache/huggingface
+# Install system dependencies for audio processing
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ffmpeg \
+    libsndfile1 \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Install PyTorch with CUDA support first
+RUN pip install --no-cache-dir \
+    torch \
+    torchaudio \
+    --index-url https://download.pytorch.org/whl/cu126
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install remaining Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY app.py .
+# Create cache directory for HuggingFace models
+RUN mkdir -p /app/.cache/huggingface
+# Expose the Gradio port
+EXPOSE 7860
+# Run the application
+CMD ["python", "app.py"]

Dockerfile.cpu ADDED Viewed

	@@ -0,0 +1,41 @@

+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Set environment variables
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV GRADIO_SERVER_NAME=0.0.0.0
+ENV GRADIO_SERVER_PORT=7860
+ENV HF_HOME=/app/.cache/huggingface
+ENV TRANSFORMERS_CACHE=/app/.cache/huggingface
+# Install system dependencies for audio processing
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ffmpeg \
+    libsndfile1 \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Install PyTorch CPU-only version (smaller download, works on Mac/Linux/Windows)
+RUN pip install --no-cache-dir \
+    torch \
+    torchaudio
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install remaining Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY app.py .
+# Create cache directory for HuggingFace models
+RUN mkdir -p /app/.cache/huggingface
+# Expose the Gradio port
+EXPOSE 7860
+# Run the application
+CMD ["python", "app.py"]

README.md ADDED Viewed

	@@ -0,0 +1,156 @@

+# 🏆 Speech-to-Text Model Arena
+A Gradio-based web application for comparing multiple speech-to-text models side-by-side. Upload audio or record from your microphone and see how different ASR models transcribe your speech.
+![Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
+![Gradio](https://img.shields.io/badge/Gradio-4.0+-orange.svg)
+![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)
+## 🎯 Features
+- **Multi-model comparison**: Compare 3 different STT models simultaneously
+- **Audio input flexibility**: Record via microphone or upload audio files
+- **Real-time inference timing**: See how long each model takes to process
+- **GPU acceleration**: Automatically uses CUDA when available
+- **Model caching**: Models are loaded once and cached for faster subsequent runs
+## 🤖 Models Included
+| Model | HuggingFace ID | Description |
+|-------|----------------|-------------|
+| StutteredSpeechASR | `AImpower/StutteredSpeechASR` | Whisper fine-tuned for stuttered speech (Mandarin) |
+| Whisper Base | `openai/whisper-base` | OpenAI's base Whisper model |
+| Wav2Vec2 | `facebook/wav2vec2-base-960h` | Meta's Wav2Vec2 (English) |
+## 📋 Requirements
+- Python 3.9+
+- NVIDIA GPU with CUDA support (recommended)
+- Docker (optional, for containerized deployment)
+## 🚀 Quick Start
+### Option 1: Run Locally
+1. **Clone the repository**
+   ```bash
+   git clone <your-repo-url>
+   cd stt_battle_arena
+   ```
+2. **Create a virtual environment** (recommended)
+   ```bash
+   python -m venv venv
+   # Windows
+   venv\Scripts\activate
+   # Linux/macOS
+   source venv/bin/activate
+   ```
+3. **Install dependencies**
+   ```bash
+   pip install -r requirements.txt
+   ```
+4. **Run the application**
+   ```bash
+   python app.py
+   ```
+5. **Open your browser** and navigate to `http://localhost:7860`
+### Option 2: Run with Docker (GPU - Linux/Windows with NVIDIA)
+For machines with NVIDIA GPUs:
+1. **Build and run with Docker Compose**
+   ```bash
+   docker compose up --build
+   ```
+2. **Open your browser** and navigate to `http://localhost:7860`
+### Option 3: Run with Docker (CPU - Mac/Linux/Windows)
+For Mac users or machines without NVIDIA GPUs:
+1. **Build and run with Docker Compose**
+   ```bash
+   docker compose -f docker-compose.cpu.yml up --build
+   ```
+2. **Or build and run manually**
+   ```bash
+   # Build the CPU image
+   docker build -f Dockerfile.cpu -t stt-arena-cpu .
+   # Run the container
+   docker run -p 7860:7860 stt-arena-cpu
+   ```
+3. **Open your browser** and navigate to `http://localhost:7860`
+> ⚠️ **Note**: CPU inference is significantly slower than GPU. Expect 10-30+ seconds per model depending on audio length.
+## 🐳 Docker Configuration
+### GPU Support (NVIDIA - Linux/Windows only)
+The Docker setup requires the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) for GPU acceleration.
+**Install NVIDIA Container Toolkit:**
+```bash
+# Ubuntu/Debian
+distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
+curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
+curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
+sudo apt-get update
+sudo apt-get install -y nvidia-container-toolkit
+sudo systemctl restart docker
+```
+### Persistent Model Cache
+The Docker Compose configuration includes a volume (`hf-cache`) to persist downloaded HuggingFace models. This means models won't need to be re-downloaded when the container restarts.
+## 📁 Project Structure
+```
+stt_battle_arena/
+├── app.py                 # Main Gradio application
+├── requirements.txt       # Python dependencies
+├── Dockerfile             # Docker build (GPU/CUDA)
+├── Dockerfile.cpu         # Docker build (CPU-only, Mac compatible)
+├── docker-compose.yml     # Docker Compose (GPU)
+├── docker-compose.cpu.yml # Docker Compose (CPU-only, Mac compatible)
+├── .dockerignore          # Docker build exclusions
+└── README.md              # This file
+```
+## ⚙️ Configuration
+### Changing Models
+To add or modify models, edit the `MODELS` list in `app.py`:
+```python
+MODELS = [
+    {
+        "name": "🎙️ Your Model Name",
+        "id": "unique_id",
+        "hf_id": "huggingface/model-id",
+        "description": "Model description",
+    },
+    # Add more models...
+]
+```
+## 📚 References
+- [Gradio Documentation](https://www.gradio.app/docs)
+- [HuggingFace Transformers](https://huggingface.co/docs/transformers)
+- [AImpower StutteredSpeechASR](https://huggingface.co/AImpower/StutteredSpeechASR)
+- [OpenAI Whisper](https://github.com/openai/whisper)
+- [Wav2Vec 2.0](https://huggingface.co/facebook/wav2vec2-base-960h)

app.py ADDED Viewed

	@@ -0,0 +1,323 @@

+"""
+Speech-to-Text Model Arena
+A Gradio demo for comparing multiple STT models side-by-side.
+"""
+import gradio as gr
+import time
+import torch
+import librosa
+import logging
+from transformers import (
+    AutoModelForSpeechSeq2Seq,
+    AutoProcessor,
+    WhisperForConditionalGeneration,
+    WhisperProcessor,
+    Wav2Vec2ForCTC,
+    Wav2Vec2Processor,
+)
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+)
+logger = logging.getLogger("stt_arena")
+# Determine device
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+TORCH_DTYPE = torch.float16 if torch.cuda.is_available() else torch.float32
+logger.info(f"Using device: {DEVICE}")
+logger.info(f"Torch dtype: {TORCH_DTYPE}")
+# Model configurations
+MODELS = [
+    {
+        "name": "🗣️ StutteredSpeechASR",
+        "id": "stuttered",
+        "hf_id": "AImpower/StutteredSpeechASR",
+        "description": "Whisper fine-tuned for stuttered speech (Mandarin)",
+    },
+    {
+        "name": "🎙️ Whisper Base",
+        "id": "whisper",
+        "hf_id": "openai/whisper-base",
+        "description": "OpenAI Whisper base model",
+    },
+    {
+        "name": "🔊 Wav2Vec2",
+        "id": "wav2vec",
+        "hf_id": "facebook/wav2vec2-base-960h",
+        "description": "Meta's Wav2Vec2 (English)",
+    },
+]
+# Global model cache
+_model_cache = {}
+def load_model(model_config: dict):
+    """
+    Load and cache a model based on its configuration.
+    """
+    model_id = model_config["id"]
+    hf_id = model_config["hf_id"]
+    if model_id in _model_cache:
+        logger.debug(f"Model {model_id} found in cache")
+        return _model_cache[model_id]
+    logger.info(f"Loading model: {hf_id}...")
+    if model_id == "stuttered":
+        # StutteredSpeechASR - Whisper-based model
+        model = AutoModelForSpeechSeq2Seq.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
+        processor = AutoProcessor.from_pretrained(hf_id)
+        model.to(DEVICE)
+        _model_cache[model_id] = (model, processor, "whisper")
+    elif model_id == "whisper":
+        # Standard Whisper model
+        model = WhisperForConditionalGeneration.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
+        processor = WhisperProcessor.from_pretrained(hf_id)
+        model.to(DEVICE)
+        _model_cache[model_id] = (model, processor, "whisper")
+    elif model_id == "wav2vec":
+        # Wav2Vec2 model
+        model = Wav2Vec2ForCTC.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
+        processor = Wav2Vec2Processor.from_pretrained(hf_id)
+        model.to(DEVICE)
+        _model_cache[model_id] = (model, processor, "wav2vec")
+    logger.info(f"Model {hf_id} loaded successfully!")
+    return _model_cache[model_id]
+def run_inference(audio_path: str, model_config: dict) -> tuple[str, float]:
+    """
+    Run inference on a single model.
+    Args:
+        audio_path: Path to the audio file
+        model_config: Model configuration dictionary
+    Returns:
+        Tuple of (transcribed_text, inference_time_in_seconds)
+    """
+    if audio_path is None:
+        logger.warning("No audio provided")
+        return "⚠️ No audio provided. Please record or upload audio first.", 0.0
+    try:
+        logger.info(f"Running inference with model: {model_config['name']}")
+        logger.debug(f"Audio path: {audio_path}")
+        # Load audio file
+        waveform, sampling_rate = librosa.load(audio_path, sr=16000)
+        logger.debug(f"Audio loaded: {len(waveform)} samples at {sampling_rate}Hz")
+        # Load model
+        model, processor, model_type = load_model(model_config)
+        # Start timing
+        start_time = time.time()
+        if model_type == "whisper":
+            # Whisper-style inference
+            input_features = processor(
+                waveform,
+                sampling_rate=16000,
+                return_tensors="pt"
+            ).input_features
+            input_features = input_features.to(DEVICE, dtype=TORCH_DTYPE)
+            with torch.no_grad():
+                predicted_ids = model.generate(input_features)
+            transcription = processor.batch_decode(
+                predicted_ids,
+                skip_special_tokens=True
+            )[0]
+        elif model_type == "wav2vec":
+            # Wav2Vec2-style inference
+            inputs = processor(
+                waveform,
+                sampling_rate=16000,
+                return_tensors="pt",
+                padding=True
+            )
+            input_values = inputs.input_values.to(DEVICE, dtype=TORCH_DTYPE)
+            with torch.no_grad():
+                logits = model(input_values).logits
+            predicted_ids = torch.argmax(logits, dim=-1)
+            transcription = processor.batch_decode(predicted_ids)[0]
+        else:
+            transcription = "Unknown model type"
+        # Calculate inference time
+        inference_time = time.time() - start_time
+        logger.info(f"Inference complete for {model_config['name']}: {inference_time:.3f}s")
+        logger.debug(f"Transcription: {transcription[:100]}..." if len(transcription) > 100 else f"Transcription: {transcription}")
+        return transcription.strip(), round(inference_time, 3)
+    except Exception as e:
+        logger.error(f"Error during inference with {model_config['name']}: {str(e)}", exc_info=True)
+        return f"❌ Error: {str(e)}", 0.0
+def run_all_models(audio):
+    """
+    Run inference on all models sequentially.
+    Note: Running sequentially to avoid GPU memory issues and ensure
+    models are loaded one at a time if needed.
+    Args:
+        audio: Audio input from Gradio component
+    Returns:
+        List of results for each model (text1, time1, text2, time2, text3, time3)
+    """
+    logger.info(f"Starting inference on {len(MODELS)} models")
+    results = []
+    for model_config in MODELS:
+        text, inference_time = run_inference(audio, model_config)
+        results.extend([text, inference_time])
+    logger.info("All models completed")
+    return results
+# Build the Gradio interface
+with gr.Blocks(
+    theme=gr.themes.Soft(),
+    title="Speech-to-Text Model Arena",
+    css="""
+        .model-card {
+            border: 1px solid #e0e0e0;
+            border-radius: 12px;
+            padding: 16px;
+            background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
+        }
+        .run-button {
+            background: linear-gradient(90deg, #667eea 0%, #764ba2 100%) !important;
+            font-size: 1.2em !important;
+            font-weight: bold !important;
+        }
+        .title-text {
+            text-align: center;
+            background: linear-gradient(90deg, #667eea, #764ba2);
+            -webkit-background-clip: text;
+            -webkit-text-fill-color: transparent;
+            background-clip: text;
+        }
+    """
+) as demo:
+    # Title and Description
+    gr.Markdown(
+        """
+        # 🏆 Speech-to-Text Model Arena
+        **Compare multiple speech recognition models side-by-side!**
+        Upload an audio file or record using your microphone, then click "Run Models"
+        to see how different STT models transcribe your speech. Compare their outputs
+        and inference times to find the best model for your use case.
+        """,
+        elem_classes=["title-text"]
+    )
+    gr.Markdown("---")
+    # Audio Input Section
+    with gr.Group():
+        gr.Markdown("### 🎤 Audio Input")
+        audio_input = gr.Audio(
+            sources=["microphone", "upload"],
+            type="filepath",
+            label="Record or Upload Audio",
+            show_label=True,
+        )
+    # Run Button
+    run_button = gr.Button(
+        "🚀 Run Models",
+        variant="primary",
+        size="lg",
+        elem_classes=["run-button"]
+    )
+    gr.Markdown("---")
+    gr.Markdown("### 📊 Model Results")
+    # Model Output Cards
+    with gr.Row(equal_height=True):
+        # Create output components for each model
+        output_components = []
+        for model in MODELS:
+            with gr.Column(elem_classes=["model-card"]):
+                gr.Markdown(f"## {model['name']}")
+                text_output = gr.Textbox(
+                    label="Transcription",
+                    placeholder="Transcribed text will appear here...",
+                    lines=4,
+                    interactive=False,
+                )
+                time_output = gr.Number(
+                    label="⏱️ Inference Time (seconds)",
+                    value=0.0,
+                    interactive=False,
+                    precision=3,
+                )
+                output_components.extend([text_output, time_output])
+    # Connect the button to the inference function
+    run_button.click(
+        fn=run_all_models,
+        inputs=[audio_input],
+        outputs=output_components,
+        show_progress=True,
+    )
+    # Footer
+    gr.Markdown("---")
+    gr.Markdown(
+        """
+        <center>
+        **💡 Tip:**
+        - For best results, use clear audio with minimal background noise
+        *Built with ❤️ using Gradio*
+        </center>
+        """,
+        elem_classes=["footer"]
+    )
+# Launch the app
+if __name__ == "__main__":
+    logger.info("Starting Speech-to-Text Model Arena")
+    logger.info(f"Models configured: {[m['name'] for m in MODELS]}")
+    demo.launch(
+        share=False,
+        server_name="0.0.0.0",
+        server_port=7860,
+        show_error=True,
+    )
+    logger.info("Application shutdown")

docker-compose.cpu.yml ADDED Viewed

	@@ -0,0 +1,16 @@

+services:
+  stt-arena:
+    build:
+      context: .
+      dockerfile: Dockerfile.cpu
+    image: stt-arena-cpu
+    container_name: stt-arena
+    ports:
+      - "7860:7860"
+    volumes:
+      # Persist HuggingFace model cache
+      - hf-cache:/app/.cache/huggingface
+    restart: unless-stopped
+volumes:
+  hf-cache:

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,23 @@

+services:
+  stt-arena:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    image: stt-arena
+    container_name: stt-arena
+    ports:
+      - "7860:7860"
+    volumes:
+      # Persist HuggingFace model cache
+      - hf-cache:/app/.cache/huggingface
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    restart: unless-stopped
+volumes:
+  hf-cache:

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+gradio>=4.0.0
+torch>=2.0.0
+transformers>=4.36.0
+librosa>=0.10.0
+soundfile>=0.12.0
+accelerate>=0.25.0