Automatic Speech Recognition
Safetensors
Chinese
whisper
gpric024 commited on
Commit
df83b8b
·
1 Parent(s): bc46fb1

Changed for API calls instead of locally running models

Browse files
Files changed (10) hide show
  1. .dockerignore +1 -1
  2. .gitignore +1 -0
  3. Dockerfile +2 -15
  4. Dockerfile.cpu +0 -41
  5. README.md +44 -93
  6. app.py +132 -169
  7. docker-compose.cpu.yml +0 -16
  8. docker-compose.yml +5 -16
  9. requirements.txt +2 -5
  10. style.css +149 -0
.dockerignore CHANGED
@@ -15,4 +15,4 @@ build
15
  .mypy_cache
16
  *.log
17
  .DS_Store
18
- Thumbs.db
 
15
  .mypy_cache
16
  *.log
17
  .DS_Store
18
+ Thumbs.db
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ *.env
Dockerfile CHANGED
@@ -8,34 +8,21 @@ ENV PYTHONDONTWRITEBYTECODE=1
8
  ENV PYTHONUNBUFFERED=1
9
  ENV GRADIO_SERVER_NAME=0.0.0.0
10
  ENV GRADIO_SERVER_PORT=7860
11
- ENV HF_HOME=/app/.cache/huggingface
12
- ENV TRANSFORMERS_CACHE=/app/.cache/huggingface
13
 
14
- # Install system dependencies for audio processing
15
  RUN apt-get update && apt-get install -y --no-install-recommends \
16
  ffmpeg \
17
- libsndfile1 \
18
- git \
19
  && rm -rf /var/lib/apt/lists/*
20
 
21
- # Install PyTorch with CUDA support first
22
- RUN pip install --no-cache-dir \
23
- torch \
24
- torchaudio \
25
- --index-url https://download.pytorch.org/whl/cu126
26
-
27
  # Copy requirements first for better caching
28
  COPY requirements.txt .
29
 
30
- # Install remaining Python dependencies
31
  RUN pip install --no-cache-dir -r requirements.txt
32
 
33
  # Copy application code
34
  COPY app.py .
35
 
36
- # Create cache directory for HuggingFace models
37
- RUN mkdir -p /app/.cache/huggingface
38
-
39
  # Expose the Gradio port
40
  EXPOSE 7860
41
 
 
8
  ENV PYTHONUNBUFFERED=1
9
  ENV GRADIO_SERVER_NAME=0.0.0.0
10
  ENV GRADIO_SERVER_PORT=7860
 
 
11
 
12
+ # Install system dependencies (ffmpeg is required for Gradio audio processing)
13
  RUN apt-get update && apt-get install -y --no-install-recommends \
14
  ffmpeg \
 
 
15
  && rm -rf /var/lib/apt/lists/*
16
 
 
 
 
 
 
 
17
  # Copy requirements first for better caching
18
  COPY requirements.txt .
19
 
20
+ # Install Python dependencies
21
  RUN pip install --no-cache-dir -r requirements.txt
22
 
23
  # Copy application code
24
  COPY app.py .
25
 
 
 
 
26
  # Expose the Gradio port
27
  EXPOSE 7860
28
 
Dockerfile.cpu DELETED
@@ -1,41 +0,0 @@
1
- FROM python:3.11-slim
2
-
3
- # Set working directory
4
- WORKDIR /app
5
-
6
- # Set environment variables
7
- ENV PYTHONDONTWRITEBYTECODE=1
8
- ENV PYTHONUNBUFFERED=1
9
- ENV GRADIO_SERVER_NAME=0.0.0.0
10
- ENV GRADIO_SERVER_PORT=7860
11
- ENV HF_HOME=/app/.cache/huggingface
12
- ENV TRANSFORMERS_CACHE=/app/.cache/huggingface
13
-
14
- # Install system dependencies for audio processing
15
- RUN apt-get update && apt-get install -y --no-install-recommends \
16
- ffmpeg \
17
- libsndfile1 \
18
- git \
19
- && rm -rf /var/lib/apt/lists/*
20
-
21
- # Install PyTorch CPU-only version (smaller download, works on Mac/Linux/Windows)
22
- RUN pip install --no-cache-dir \
23
- torch \
24
- torchaudio
25
- # Copy requirements first for better caching
26
- COPY requirements.txt .
27
-
28
- # Install remaining Python dependencies
29
- RUN pip install --no-cache-dir -r requirements.txt
30
-
31
- # Copy application code
32
- COPY app.py .
33
-
34
- # Create cache directory for HuggingFace models
35
- RUN mkdir -p /app/.cache/huggingface
36
-
37
- # Expose the Gradio port
38
- EXPOSE 7860
39
-
40
- # Run the application
41
- CMD ["python", "app.py"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,75 +1,68 @@
1
- # 🏆 Speech-to-Text Model Arena
2
 
3
- A Gradio-based web application for comparing multiple speech-to-text models side-by-side. Upload audio or record from your microphone and see how different ASR models transcribe your speech.
4
 
5
  ![Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
6
  ![Gradio](https://img.shields.io/badge/Gradio-4.0+-orange.svg)
7
- ![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)
8
 
9
  ## 🎯 Features
10
 
11
- - **Multi-model comparison**: Compare 3 different STT models simultaneously
12
- - **Audio input flexibility**: Record via microphone or upload audio files
13
- - **Real-time inference timing**: See how long each model takes to process
14
- - **GPU acceleration**: Automatically uses CUDA when available
15
- - **Model caching**: Models are loaded once and cached for faster subsequent runs
 
16
 
17
  ## 🤖 Models Included
18
 
19
- | Model | HuggingFace ID | Description |
20
- |-------|----------------|-------------|
21
- | StutteredSpeechASR | `AImpower/StutteredSpeechASR` | Whisper fine-tuned for stuttered speech (Mandarin) |
22
- | Whisper Base | `openai/whisper-base` | OpenAI's base Whisper model |
23
- | Wav2Vec2 | `facebook/wav2vec2-base-960h` | Meta's Wav2Vec2 (English) |
 
24
 
25
  ## 📋 Requirements
26
 
27
  - Python 3.9+
28
- - NVIDIA GPU with CUDA support (recommended)
29
  - Docker (optional, for containerized deployment)
30
 
31
- ## 🚀 Quick Start
32
-
33
- ### Option 1: Run with Docker (GPU - Linux/Windows with NVIDIA)
34
 
35
- For machines with NVIDIA GPUs:
36
 
37
- 1. **Build and run with Docker Compose**
38
- ```bash
39
- docker compose up --build
40
- ```
41
 
42
- 2. **Open your browser** and navigate to `http://localhost:7860`
 
 
 
43
 
44
- ### Option 2: Run with Docker (CPU - Mac/Linux/Windows)
45
 
46
- For Mac users or machines without NVIDIA GPUs:
47
 
48
- 1. **Build and run with Docker Compose**
49
- ```bash
50
- docker compose -f docker-compose.cpu.yml up --build
51
- ```
52
 
53
- 2. **Or build and run manually**
54
  ```bash
55
- # Build the CPU image
56
- docker build -f Dockerfile.cpu -t stt-arena-cpu .
57
-
58
- # Run the container
59
- docker run -p 7860:7860 stt-arena-cpu
60
  ```
61
 
62
  3. **Open your browser** and navigate to `http://localhost:7860`
63
 
64
- > ⚠️ **Note**: CPU inference is significantly slower than GPU. Expect 10-30+ seconds per model depending on audio length.
65
-
66
-
67
- ### Option 3: Run Locally
68
 
69
  1. **Clone the repository**
70
  ```bash
71
  git clone <your-repo-url>
72
- cd stt_battle_arena
73
  ```
74
 
75
  2. **Create a virtual environment** (recommended)
@@ -88,71 +81,29 @@ For Mac users or machines without NVIDIA GPUs:
88
  pip install -r requirements.txt
89
  ```
90
 
91
- 4. **Run the application**
 
 
92
  ```bash
93
  python app.py
94
  ```
95
 
96
- 5. **Open your browser** and navigate to `http://localhost:7860`
97
-
98
 
99
- ## 🐳 Docker Configuration
100
 
101
- ### GPU Support (NVIDIA - Linux/Windows only)
102
 
103
- The Docker setup requires the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) for GPU acceleration.
104
-
105
- **Install NVIDIA Container Toolkit:**
106
- ```bash
107
- # Ubuntu/Debian
108
- distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
109
- curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
110
- curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
111
- sudo apt-get update
112
- sudo apt-get install -y nvidia-container-toolkit
113
- sudo systemctl restart docker
114
- ```
115
 
116
- ### Persistent Model Cache
 
 
 
117
 
118
- The Docker Compose configuration includes a volume (`hf-cache`) to persist downloaded HuggingFace models. This means models won't need to be re-downloaded when the container restarts.
119
-
120
- ## 📁 Project Structure
121
-
122
- ```
123
- stt_battle_arena/
124
- ├── app.py # Main Gradio application
125
- ├── requirements.txt # Python dependencies
126
- ├── Dockerfile # Docker build (GPU/CUDA)
127
- ├── Dockerfile.cpu # Docker build (CPU-only, Mac compatible)
128
- ├── docker-compose.yml # Docker Compose (GPU)
129
- ├── docker-compose.cpu.yml # Docker Compose (CPU-only, Mac compatible)
130
- ├── .dockerignore # Docker build exclusions
131
- └── README.md # This file
132
- ```
133
-
134
- ## ⚙️ Configuration
135
-
136
- ### Changing Models
137
-
138
- To add or modify models, edit the `MODELS` list in `app.py`:
139
-
140
- ```python
141
- MODELS = [
142
- {
143
- "name": "🎙️ Your Model Name",
144
- "id": "unique_id",
145
- "hf_id": "huggingface/model-id",
146
- "description": "Model description",
147
- },
148
- # Add more models...
149
- ]
150
- ```
151
 
152
  ## 📚 References
153
 
154
  - [Gradio Documentation](https://www.gradio.app/docs)
155
- - [HuggingFace Transformers](https://huggingface.co/docs/transformers)
 
156
  - [AImpower StutteredSpeechASR](https://huggingface.co/AImpower/StutteredSpeechASR)
157
  - [OpenAI Whisper](https://github.com/openai/whisper)
158
- - [Wav2Vec 2.0](https://huggingface.co/facebook/wav2vec2-base-960h)
 
1
+ # 🗣️ StutteredSpeechASR Research Demo
2
 
3
+ A Gradio-based research demonstration showcasing **StutteredSpeechASR**, a Whisper model fine-tuned specifically for stuttered speech recognition (Mandarin). Compare its performance against baseline Whisper models to see the improvement on stuttered speech patterns.
4
 
5
  ![Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
6
  ![Gradio](https://img.shields.io/badge/Gradio-4.0+-orange.svg)
7
+ ![Research](https://img.shields.io/badge/Research-Demo-green.svg)
8
 
9
  ## 🎯 Features
10
 
11
+ - **StutteredSpeechASR Research**: Showcases fine-tuned Whisper model specifically designed for stuttered speech
12
+ - **Comparative Analysis**: Side-by-side comparison with baseline Whisper models
13
+ - **Audio Input Flexibility**: Record via microphone or upload audio files
14
+ - **Specialized for Stuttered Speech**: Better handling of repetitions, prolongations, and blocks
15
+ - **Clean Interface**: Organized model cards with clear transcription results
16
+ - **Lightweight Deployment**: All inference via Hugging Face APIs - no GPU required
17
 
18
  ## 🤖 Models Included
19
 
20
+ | Model | Type | Description |
21
+ |-------|------|-------------|
22
+ | 🗣️ **StutteredSpeechASR** | Fine-tuned Research Model | Whisper fine-tuned specifically for stuttered speech (Mandarin) |
23
+ | 🎙️ **Whisper Large V3** | Baseline Model | OpenAI's Whisper Large V3 model via HF Inference API |
24
+ | 🔊 **Whisper Large V3 Turbo** | Baseline Model | OpenAI's Whisper Large V3 Turbo (faster) via HF Inference API |
25
+
26
 
27
  ## 📋 Requirements
28
 
29
  - Python 3.9+
30
+ - Hugging Face API key
31
  - Docker (optional, for containerized deployment)
32
 
33
+ ## 🔑 Environment Setup
 
 
34
 
35
+ Create a `.env` file in the project root with your Hugging Face credentials:
36
 
37
+ ```env
38
+ HF_ENDPOINT=https://your-endpoint-url.aws.endpoints.huggingface.cloud
39
+ HF_API_KEY=hf_your_api_key_here
40
+ ```
41
 
42
+ | Variable | Description |
43
+ |----------|-------------|
44
+ | `HF_ENDPOINT` | Your dedicated Hugging Face Inference Endpoint URL for StutteredSpeechASR |
45
+ | `HF_API_KEY` | Your Hugging Face API token (get one at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)) |
46
 
47
+ ## 🚀 Quick Start
48
 
49
+ ### Option 1: Run with Docker (Recommended)
50
 
51
+ 1. **Create your `.env` file** with HuggingFace credentials (see above)
 
 
 
52
 
53
+ 2. **Build and run with Docker Compose**
54
  ```bash
55
+ docker compose up --build
 
 
 
 
56
  ```
57
 
58
  3. **Open your browser** and navigate to `http://localhost:7860`
59
 
60
+ ### Option 2: Run Locally
 
 
 
61
 
62
  1. **Clone the repository**
63
  ```bash
64
  git clone <your-repo-url>
65
+ cd asr_demo
66
  ```
67
 
68
  2. **Create a virtual environment** (recommended)
 
81
  pip install -r requirements.txt
82
  ```
83
 
84
+ 4. **Create your `.env` file** with HuggingFace credentials (see Environment Setup above)
85
+
86
+ 5. **Run the application**
87
  ```bash
88
  python app.py
89
  ```
90
 
91
+ 6. **Open your browser** and navigate to `http://localhost:7860`
 
92
 
 
93
 
 
94
 
95
+ ## 🧪 Research Notes
 
 
 
 
 
 
 
 
 
 
 
96
 
97
+ - **Target Language**: The StutteredSpeechASR model is specifically trained for Mandarin Chinese
98
+ - **Use Cases**: Research demonstration, stuttered speech analysis, comparative ASR evaluation
99
+ - **Best Results**: Use clear audio recordings for optimal model performance
100
+ - **Baseline Comparison**: The Whisper models may struggle with stuttered speech patterns that StutteredSpeechASR handles well
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  ## 📚 References
104
 
105
  - [Gradio Documentation](https://www.gradio.app/docs)
106
+ - [Hugging Face Inference API](https://huggingface.co/docs/api-inference)
107
+ - [Hugging Face Inference Endpoints](https://huggingface.co/docs/inference-endpoints)
108
  - [AImpower StutteredSpeechASR](https://huggingface.co/AImpower/StutteredSpeechASR)
109
  - [OpenAI Whisper](https://github.com/openai/whisper)
 
app.py CHANGED
@@ -4,20 +4,13 @@ A Gradio demo for comparing multiple STT models side-by-side.
4
  """
5
 
6
  import gradio as gr
7
- import time
8
- import torch
9
- import librosa
10
  import logging
11
- from transformers import (
12
- AutoModelForSpeechSeq2Seq,
13
- AutoProcessor,
14
- WhisperForConditionalGeneration,
15
- WhisperProcessor,
16
- Wav2Vec2ForCTC,
17
- Wav2Vec2Processor,
18
- )
19
 
20
- # Configure logging
21
  logging.basicConfig(
22
  level=logging.INFO,
23
  format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
@@ -25,14 +18,16 @@ logging.basicConfig(
25
  )
26
  logger = logging.getLogger("stt_arena")
27
 
28
- # Determine device
29
- DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
30
- TORCH_DTYPE = torch.float16 if torch.cuda.is_available() else torch.float32
 
31
 
32
- logger.info(f"Using device: {DEVICE}")
33
- logger.info(f"Torch dtype: {TORCH_DTYPE}")
 
 
34
 
35
- # Model configurations
36
  MODELS = [
37
  {
38
  "name": "🗣️ StutteredSpeechASR",
@@ -41,62 +36,84 @@ MODELS = [
41
  "description": "Whisper fine-tuned for stuttered speech (Mandarin)",
42
  },
43
  {
44
- "name": "🎙️ Whisper Base",
45
  "id": "whisper",
46
- "hf_id": "openai/whisper-base",
47
- "description": "OpenAI Whisper base model",
48
  },
49
  {
50
- "name": "🔊 Wav2Vec2",
51
- "id": "wav2vec",
52
- "hf_id": "facebook/wav2vec2-base-960h",
53
- "description": "Meta's Wav2Vec2 (English)",
54
  },
55
  ]
56
 
57
- # Global model cache
58
- _model_cache = {}
59
 
60
-
61
- def load_model(model_config: dict):
62
  """
63
- Load and cache a model based on its configuration.
 
 
 
 
 
 
 
 
64
  """
65
- model_id = model_config["id"]
66
- hf_id = model_config["hf_id"]
67
-
68
- if model_id in _model_cache:
69
- logger.debug(f"Model {model_id} found in cache")
70
- return _model_cache[model_id]
71
-
72
- logger.info(f"Loading model: {hf_id}...")
73
-
74
- if model_id == "stuttered":
75
- # StutteredSpeechASR - Whisper-based model
76
- model = AutoModelForSpeechSeq2Seq.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
77
- processor = AutoProcessor.from_pretrained(hf_id)
78
- model.to(DEVICE)
79
- _model_cache[model_id] = (model, processor, "whisper")
80
-
81
- elif model_id == "whisper":
82
- # Standard Whisper model
83
- model = WhisperForConditionalGeneration.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
84
- processor = WhisperProcessor.from_pretrained(hf_id)
85
- model.to(DEVICE)
86
- _model_cache[model_id] = (model, processor, "whisper")
87
-
88
- elif model_id == "wav2vec":
89
- # Wav2Vec2 model
90
- model = Wav2Vec2ForCTC.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
91
- processor = Wav2Vec2Processor.from_pretrained(hf_id)
92
- model.to(DEVICE)
93
- _model_cache[model_id] = (model, processor, "wav2vec")
94
-
95
- logger.info(f"Model {hf_id} loaded successfully!")
96
- return _model_cache[model_id]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
 
99
- def run_inference(audio_path: str, model_config: dict) -> tuple[str, float]:
100
  """
101
  Run inference on a single model.
102
 
@@ -105,135 +122,87 @@ def run_inference(audio_path: str, model_config: dict) -> tuple[str, float]:
105
  model_config: Model configuration dictionary
106
 
107
  Returns:
108
- Tuple of (transcribed_text, inference_time_in_seconds)
109
  """
110
  if audio_path is None:
111
  logger.warning("No audio provided")
112
- return "⚠️ No audio provided. Please record or upload audio first.", 0.0
113
 
114
  try:
115
  logger.info(f"Running inference with model: {model_config['name']}")
116
  logger.debug(f"Audio path: {audio_path}")
117
-
118
- # Load audio file
119
- waveform, sampling_rate = librosa.load(audio_path, sr=16000)
120
- logger.debug(f"Audio loaded: {len(waveform)} samples at {sampling_rate}Hz")
121
-
122
- # Load model
123
- model, processor, model_type = load_model(model_config)
124
-
125
- # Start timing
126
- start_time = time.time()
127
-
128
- if model_type == "whisper":
129
- # Whisper-style inference
130
- input_features = processor(
131
- waveform,
132
- sampling_rate=16000,
133
- return_tensors="pt"
134
- ).input_features
135
- input_features = input_features.to(DEVICE, dtype=TORCH_DTYPE)
136
-
137
- with torch.no_grad():
138
- predicted_ids = model.generate(input_features)
139
-
140
- transcription = processor.batch_decode(
141
- predicted_ids,
142
- skip_special_tokens=True
143
- )[0]
144
-
145
- elif model_type == "wav2vec":
146
- # Wav2Vec2-style inference
147
- inputs = processor(
148
- waveform,
149
- sampling_rate=16000,
150
- return_tensors="pt",
151
- padding=True
152
- )
153
- input_values = inputs.input_values.to(DEVICE, dtype=TORCH_DTYPE)
154
-
155
- with torch.no_grad():
156
- logits = model(input_values).logits
157
-
158
- predicted_ids = torch.argmax(logits, dim=-1)
159
- transcription = processor.batch_decode(predicted_ids)[0]
160
-
161
- else:
162
- transcription = "Unknown model type"
163
-
164
- # Calculate inference time
165
- inference_time = time.time() - start_time
166
-
167
- logger.info(f"Inference complete for {model_config['name']}: {inference_time:.3f}s")
168
- logger.debug(f"Transcription: {transcription[:100]}..." if len(transcription) > 100 else f"Transcription: {transcription}")
169
-
170
- return transcription.strip(), round(inference_time, 3)
171
 
172
  except Exception as e:
173
  logger.error(f"Error during inference with {model_config['name']}: {str(e)}", exc_info=True)
174
- return f"❌ Error: {str(e)}", 0.0
175
 
176
 
177
  def run_all_models(audio):
178
  """
179
  Run inference on all models sequentially.
180
 
181
- Note: Running sequentially to avoid GPU memory issues and ensure
182
- models are loaded one at a time if needed.
183
-
184
  Args:
185
  audio: Audio input from Gradio component
186
 
187
  Returns:
188
- List of results for each model (text1, time1, text2, time2, text3, time3)
189
  """
190
  logger.info(f"Starting inference on {len(MODELS)} models")
191
  results = []
192
 
193
  for model_config in MODELS:
194
- text, inference_time = run_inference(audio, model_config)
195
- results.extend([text, inference_time])
196
 
197
  logger.info("All models completed")
198
  return results
199
 
200
 
 
 
 
 
 
 
 
 
 
 
 
201
  # Build the Gradio interface
202
  with gr.Blocks(
203
  theme=gr.themes.Soft(),
204
- title="Speech-to-Text Model Arena",
205
- css="""
206
- .model-card {
207
- border: 1px solid #e0e0e0;
208
- border-radius: 12px;
209
- padding: 16px;
210
- background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
211
- }
212
- .run-button {
213
- background: linear-gradient(90deg, #667eea 0%, #764ba2 100%) !important;
214
- font-size: 1.2em !important;
215
- font-weight: bold !important;
216
- }
217
- .title-text {
218
- text-align: center;
219
- background: linear-gradient(90deg, #667eea, #764ba2);
220
- -webkit-background-clip: text;
221
- -webkit-text-fill-color: transparent;
222
- background-clip: text;
223
- }
224
- """
225
  ) as demo:
226
 
227
  # Title and Description
228
  gr.Markdown(
229
  """
230
- # 🏆 Speech-to-Text Model Arena
 
 
231
 
232
- **Compare multiple speech recognition models side-by-side!**
233
 
234
- Upload an audio file or record using your microphone, then click "Run Models"
235
- to see how different STT models transcribe your speech. Compare their outputs
236
- and inference times to find the best model for your use case.
 
 
 
 
237
  """,
238
  elem_classes=["title-text"]
239
  )
@@ -247,23 +216,23 @@ with gr.Blocks(
247
  sources=["microphone", "upload"],
248
  type="filepath",
249
  label="Record or Upload Audio",
250
- show_label=True,
 
251
  )
252
 
253
  # Run Button
254
  run_button = gr.Button(
255
- "🚀 Run Models",
256
  variant="primary",
257
  size="lg",
258
  elem_classes=["run-button"]
259
  )
260
 
261
  gr.Markdown("---")
262
- gr.Markdown("### 📊 Model Results")
263
 
264
  # Model Output Cards
265
  with gr.Row(equal_height=True):
266
- # Create output components for each model
267
  output_components = []
268
 
269
  for model in MODELS:
@@ -277,16 +246,8 @@ with gr.Blocks(
277
  interactive=False,
278
  )
279
 
280
- time_output = gr.Number(
281
- label="⏱️ Inference Time (seconds)",
282
- value=0.0,
283
- interactive=False,
284
- precision=3,
285
- )
286
-
287
- output_components.extend([text_output, time_output])
288
 
289
- # Connect the button to the inference function
290
  run_button.click(
291
  fn=run_all_models,
292
  inputs=[audio_input],
@@ -300,9 +261,11 @@ with gr.Blocks(
300
  """
301
  <center>
302
 
303
- **💡 Tip:**
304
- - For best results, use clear audio with minimal background noise
305
- *Built with ❤️ using Gradio*
 
 
306
 
307
  </center>
308
  """,
@@ -312,7 +275,7 @@ with gr.Blocks(
312
 
313
  # Launch the app
314
  if __name__ == "__main__":
315
- logger.info("Starting Speech-to-Text Model Arena")
316
  logger.info(f"Models configured: {[m['name'] for m in MODELS]}")
317
  demo.launch(
318
  share=False,
 
4
  """
5
 
6
  import gradio as gr
 
 
 
7
  import logging
8
+ import os
9
+ import requests
10
+ from dotenv import load_dotenv
11
+
12
+ load_dotenv()
 
 
 
13
 
 
14
  logging.basicConfig(
15
  level=logging.INFO,
16
  format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
 
18
  )
19
  logger = logging.getLogger("stt_arena")
20
 
21
+ HF_ENDPOINT = os.getenv("HF_ENDPOINT")
22
+ HF_API_KEY = os.getenv("HF_API_KEY")
23
+ WHISPER_API_URL = "https://router.huggingface.co/hf-inference/models/openai/whisper-large-v3"
24
+ WHISPER_TURBO_API_URL = "https://router.huggingface.co/hf-inference/models/openai/whisper-large-v3-turbo"
25
 
26
+ if HF_ENDPOINT:
27
+ logger.info(f"Using Hugging Face Endpoint: {HF_ENDPOINT}")
28
+ else:
29
+ logger.warning("HF_ENDPOINT not set, StutteredSpeechASR will use local model")
30
 
 
31
  MODELS = [
32
  {
33
  "name": "🗣️ StutteredSpeechASR",
 
36
  "description": "Whisper fine-tuned for stuttered speech (Mandarin)",
37
  },
38
  {
39
+ "name": "🎙️ Whisper Large V3",
40
  "id": "whisper",
41
+ "hf_id": "openai/whisper-large-v3",
42
+ "description": "OpenAI Whisper Large V3 model (via HF Inference API)",
43
  },
44
  {
45
+ "name": "🔊 Whisper Large V3 Turbo",
46
+ "id": "whisper_turbo",
47
+ "hf_id": "openai/whisper-large-v3-turbo",
48
+ "description": "OpenAI Whisper Large V3 Turbo (via HF Inference API)",
49
  },
50
  ]
51
 
 
 
52
 
53
+ def run_api_inference(audio_path: str, api_url: str, model_name: str) -> str:
 
54
  """
55
+ Run inference using any Hugging Face API endpoint.
56
+
57
+ Args:
58
+ audio_path: Path to the audio file
59
+ api_url: The API endpoint URL
60
+ model_name: Name of the model for error messages
61
+
62
+ Returns:
63
+ Transcribed text
64
  """
65
+ if not HF_API_KEY:
66
+ raise ValueError("HF_API_KEY must be set in environment variables")
67
+
68
+ logger.info(f"Running inference via {model_name}")
69
+
70
+ with open(audio_path, "rb") as f:
71
+ audio_bytes = f.read()
72
+
73
+ headers = {
74
+ "Authorization": f"Bearer {HF_API_KEY}",
75
+ "Content-Type": "audio/wav",
76
+ }
77
+
78
+ response = requests.post(
79
+ api_url,
80
+ headers=headers,
81
+ data=audio_bytes,
82
+ timeout=120,
83
+ )
84
+
85
+ if response.status_code != 200:
86
+ logger.error(f"{model_name} error: {response.status_code} - {response.text}")
87
+
88
+ try:
89
+ error_data = response.json()
90
+ error_msg = error_data.get("error", "")
91
+
92
+ if "paused" in error_msg.lower():
93
+ return f"⏸️ The {model_name} endpoint is currently paused. Please contact the maintainer to restart it."
94
+ elif "loading" in error_msg.lower():
95
+ return f" {model_name} is loading. Please wait and try again."
96
+ elif response.status_code == 503:
97
+ return f"🔄 {model_name} service is temporarily unavailable. Please try again."
98
+ else:
99
+ return f"❌ {model_name} Error: {error_msg}"
100
+ except:
101
+ return f"❌ {model_name} Error: HTTP {response.status_code}"
102
+
103
+ result = response.json()
104
+ logger.debug(f"{model_name} response: {result}")
105
+
106
+ if isinstance(result, dict):
107
+ transcription = result.get("text", "") or result.get("transcription", "")
108
+ elif isinstance(result, list) and len(result) > 0:
109
+ transcription = result[0].get("text", "") if isinstance(result[0], dict) else str(result[0])
110
+ else:
111
+ transcription = str(result)
112
+
113
+ return transcription.strip()
114
 
115
 
116
+ def run_inference(audio_path: str, model_config: dict) -> str:
117
  """
118
  Run inference on a single model.
119
 
 
122
  model_config: Model configuration dictionary
123
 
124
  Returns:
125
+ Transcribed text
126
  """
127
  if audio_path is None:
128
  logger.warning("No audio provided")
129
+ return "⚠️ No audio provided. Please record or upload audio first."
130
 
131
  try:
132
  logger.info(f"Running inference with model: {model_config['name']}")
133
  logger.debug(f"Audio path: {audio_path}")
134
+
135
+ if model_config["id"] == "stuttered" and HF_ENDPOINT and HF_API_KEY:
136
+ return run_api_inference(audio_path, HF_ENDPOINT, "StutteredSpeechASR")
137
+
138
+ if model_config["id"] == "whisper" and HF_API_KEY:
139
+ return run_api_inference(audio_path, WHISPER_API_URL, "Whisper Large V3")
140
+
141
+ if model_config["id"] == "whisper_turbo" and HF_API_KEY:
142
+ return run_api_inference(audio_path, WHISPER_TURBO_API_URL, "Whisper Large V3 Turbo")
143
+
144
+ raise ValueError("HF_API_KEY must be set to use this model")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  except Exception as e:
147
  logger.error(f"Error during inference with {model_config['name']}: {str(e)}", exc_info=True)
148
+ return f"❌ Error: {str(e)}"
149
 
150
 
151
  def run_all_models(audio):
152
  """
153
  Run inference on all models sequentially.
154
 
 
 
 
155
  Args:
156
  audio: Audio input from Gradio component
157
 
158
  Returns:
159
+ List of transcription results for each model
160
  """
161
  logger.info(f"Starting inference on {len(MODELS)} models")
162
  results = []
163
 
164
  for model_config in MODELS:
165
+ text = run_inference(audio, model_config)
166
+ results.append(text)
167
 
168
  logger.info("All models completed")
169
  return results
170
 
171
 
172
+ def load_css():
173
+ """Load CSS from external file"""
174
+ css_path = os.path.join(os.path.dirname(__file__), "style.css")
175
+ try:
176
+ with open(css_path, "r", encoding="utf-8") as f:
177
+ return f.read()
178
+ except FileNotFoundError:
179
+ logger.warning(f"CSS file not found at {css_path}")
180
+ return ""
181
+
182
+
183
  # Build the Gradio interface
184
  with gr.Blocks(
185
  theme=gr.themes.Soft(),
186
+ title="StutteredSpeechASR Research Demo",
187
+ css=load_css()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
  ) as demo:
189
 
190
  # Title and Description
191
  gr.Markdown(
192
  """
193
+ <div style="text-align: center; max-width: 800px; margin: 0 auto;">
194
+
195
+ # 🗣️ StutteredSpeechASR Research Demo
196
 
197
+ ### Fine-tuned Whisper model for stuttered speech recognition
198
 
199
+ This demo showcases our **StutteredSpeechASR** model, a Whisper model fine-tuned specifically
200
+ for stuttered speech (Mandarin). Compare its performance against baseline Whisper models
201
+ to see the improvement on stuttered speech patterns.
202
+
203
+ Upload an audio file or record using your microphone to test the models.
204
+
205
+ </div>
206
  """,
207
  elem_classes=["title-text"]
208
  )
 
216
  sources=["microphone", "upload"],
217
  type="filepath",
218
  label="Record or Upload Audio",
219
+ streaming=False,
220
+ editable=True,
221
  )
222
 
223
  # Run Button
224
  run_button = gr.Button(
225
+ "🚀 Compare Models",
226
  variant="primary",
227
  size="lg",
228
  elem_classes=["run-button"]
229
  )
230
 
231
  gr.Markdown("---")
232
+ gr.Markdown("### 📊 Model Comparison Results")
233
 
234
  # Model Output Cards
235
  with gr.Row(equal_height=True):
 
236
  output_components = []
237
 
238
  for model in MODELS:
 
246
  interactive=False,
247
  )
248
 
249
+ output_components.append(text_output)
 
 
 
 
 
 
 
250
 
 
251
  run_button.click(
252
  fn=run_all_models,
253
  inputs=[audio_input],
 
261
  """
262
  <center>
263
 
264
+ **💡 Research Note:**
265
+ - The StutteredSpeechASR model is designed to better handle stuttered speech patterns
266
+ - For best results, use clear audio recordings
267
+
268
+ *Research Demo | AImpower StutteredSpeechASR*
269
 
270
  </center>
271
  """,
 
275
 
276
  # Launch the app
277
  if __name__ == "__main__":
278
+ logger.info("Starting StutteredSpeechASR Research Demo")
279
  logger.info(f"Models configured: {[m['name'] for m in MODELS]}")
280
  demo.launch(
281
  share=False,
docker-compose.cpu.yml DELETED
@@ -1,16 +0,0 @@
1
- services:
2
- stt-arena:
3
- build:
4
- context: .
5
- dockerfile: Dockerfile.cpu
6
- image: stt-arena-cpu
7
- container_name: stt-arena
8
- ports:
9
- - "7860:7860"
10
- volumes:
11
- # Persist HuggingFace model cache
12
- - hf-cache:/app/.cache/huggingface
13
- restart: unless-stopped
14
-
15
- volumes:
16
- hf-cache:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docker-compose.yml CHANGED
@@ -1,23 +1,12 @@
1
  services:
2
- stt-arena:
3
  build:
4
  context: .
5
  dockerfile: Dockerfile
6
- image: stt-arena
7
- container_name: stt-arena
8
  ports:
9
  - "7860:7860"
10
- volumes:
11
- # Persist HuggingFace model cache
12
- - hf-cache:/app/.cache/huggingface
13
- deploy:
14
- resources:
15
- reservations:
16
- devices:
17
- - driver: nvidia
18
- count: 1
19
- capabilities: [gpu]
20
  restart: unless-stopped
21
-
22
- volumes:
23
- hf-cache:
 
1
  services:
2
+ stuttered-speech-asr-demo:
3
  build:
4
  context: .
5
  dockerfile: Dockerfile
6
+ image: stuttered-speech-asr-demo
7
+ container_name: stuttered-speech-asr-demo
8
  ports:
9
  - "7860:7860"
10
+ env_file:
11
+ - .env
 
 
 
 
 
 
 
 
12
  restart: unless-stopped
 
 
 
requirements.txt CHANGED
@@ -1,6 +1,3 @@
1
  gradio>=4.0.0
2
- torch>=2.0.0
3
- transformers>=4.36.0
4
- librosa>=0.10.0
5
- soundfile>=0.12.0
6
- accelerate>=0.25.0
 
1
  gradio>=4.0.0
2
+ python-dotenv>=1.0.0
3
+ requests>=2.31.0
 
 
 
style.css ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* Force light mode implementation */
2
+ :root, .dark, body, gradio-app {
3
+ --background-fill-primary: #ffffff !important;
4
+ --background-fill-secondary: #f3f4f6 !important;
5
+ --background-fill-tertiary: #e5e7eb !important;
6
+ --block-background-fill: #ffffff !important;
7
+ --block-border-color: #e5e7eb !important;
8
+ --block-label-text-color: #374151 !important;
9
+ --body-background-fill: #ffffff !important;
10
+ --body-text-color: #1f2937 !important;
11
+ --input-background-fill: #ffffff !important;
12
+ color-scheme: light !important;
13
+ }
14
+
15
+ /* Override dark mode specific styles */
16
+ .dark .gradio-container {
17
+ background-color: #ffffff !important;
18
+ color: #1f2937 !important;
19
+ }
20
+
21
+ /* Ensure all text is dark and readable */
22
+ p, h1, h2, h3, span, label, textarea, .prose {
23
+ color: #1f2937 !important;
24
+ }
25
+
26
+ /* Transcription textboxes */
27
+ textarea {
28
+ background-color: #ffffff !important;
29
+ color: #1f2937 !important;
30
+ font-size: 16px !important;
31
+ line-height: 1.6 !important;
32
+ }
33
+
34
+ /* Audio component styling */
35
+ .audio-container {
36
+ background-color: #ffffff !important;
37
+ }
38
+
39
+ /* Footer readability */
40
+ .footer {
41
+ color: #1f2937 !important;
42
+ }
43
+ .footer p {
44
+ color: #1f2937 !important;
45
+ }
46
+
47
+ /* Model Card styling */
48
+ .model-card {
49
+ border: 1px solid #e0e0e0;
50
+ border-radius: 12px;
51
+ padding: 16px;
52
+ background: #ffffff !important;
53
+ }
54
+
55
+ /* Force Textbox background to white explicitly */
56
+ .block.textarea, .block.textbox {
57
+ background: #ffffff !important;
58
+ }
59
+
60
+ /* Ensure model card text is dark */
61
+ .model-card h2, .model-card p, .model-card span {
62
+ color: #1f2937 !important;
63
+ }
64
+
65
+ .run-button {
66
+ background: linear-gradient(90deg, #667eea 0%, #764ba2 100%) !important;
67
+ font-size: 1.2em !important;
68
+ font-weight: bold !important;
69
+ color: white !important;
70
+ }
71
+
72
+ /* Fix the specific "Transcription" label element */
73
+ span[data-testid="block-info"], .svelte-jdcl7l {
74
+ background: #ffffff !important;
75
+ background-color: #ffffff !important;
76
+ color: #1f2937 !important;
77
+ padding: 4px 8px !important;
78
+ border-radius: 4px !important;
79
+ border: 1px solid #e5e7eb !important;
80
+ }
81
+
82
+ /* Fix label headers for Audio and Transcription inputs - most aggressive approach */
83
+ * [class*="label"], * [class*="Label"], .label, .Label,
84
+ .block-label, span.label-wrap, .label-wrap span, label,
85
+ .textbox label, .textbox .label-wrap, .textbox .block-label,
86
+ .gr-textbox label, .gr-textbox .label-wrap, .gr-textbox .block-label,
87
+ [data-testid="textbox"] label, [data-testid="textbox"] .label-wrap,
88
+ .gradio-textbox label, .gradio-textbox .label-wrap {
89
+ background: #ffffff !important;
90
+ background-color: #ffffff !important;
91
+ color: #1f2937 !important;
92
+ border: none !important;
93
+ font-weight: bold !important;
94
+ font-size: 1.1em !important;
95
+ margin-bottom: 8px !important;
96
+ padding: 4px 8px !important;
97
+ border-radius: 4px !important;
98
+ }
99
+
100
+ /* Ensure specific component headers are readable */
101
+ .svelte-1b6s6s {
102
+ /* This targets Gradio specific label classes if needed */
103
+ color: #374151 !important;
104
+ }
105
+
106
+ /* Title section centering - universal approach */
107
+ [data-testid="markdown"] {
108
+ display: flex !important;
109
+ justify-content: center !important;
110
+ width: 100% !important;
111
+ }
112
+
113
+ [data-testid="markdown"] > * {
114
+ width: 100% !important;
115
+ max-width: 800px !important;
116
+ text-align: center !important;
117
+ }
118
+
119
+ /* Target any element with title-text class and all its children */
120
+ .title-text,
121
+ .title-text > *,
122
+ .title-text span,
123
+ .title-text div {
124
+ text-align: center !important;
125
+ margin-left: auto !important;
126
+ margin-right: auto !important;
127
+ }
128
+
129
+ /* Force center alignment on all heading and paragraph elements in title */
130
+ .title-text h1,
131
+ .title-text h2,
132
+ .title-text h3,
133
+ .title-text p {
134
+ text-align: center !important;
135
+ margin-left: auto !important;
136
+ margin-right: auto !important;
137
+ display: block !important;
138
+ width: 100% !important;
139
+ }
140
+
141
+ .title-text h1 {
142
+ color: #4f46e5 !important;
143
+ margin-bottom: 0.5em !important;
144
+ }
145
+
146
+ .title-text h3 {
147
+ margin-bottom: 1.5em !important;
148
+ color: #6b7280 !important;
149
+ }