Automatic Speech Recognition
Safetensors
Chinese
whisper
gpric024 commited on
Commit
fbcb7d1
·
0 Parent(s):

Initial commit: Speech-to-Text Model Arena app

Browse files

Add Gradio-based web app for comparing multiple speech-to-text models side-by-side. Includes app source code, Dockerfiles for GPU and CPU, Docker Compose files, requirements, and documentation. Supports Whisper, StutteredSpeechASR, and Wav2Vec2 models with persistent HuggingFace cache and both local and containerized deployment.

Files changed (8) hide show
  1. .dockerignore +18 -0
  2. Dockerfile +43 -0
  3. Dockerfile.cpu +41 -0
  4. README.md +156 -0
  5. app.py +323 -0
  6. docker-compose.cpu.yml +16 -0
  7. docker-compose.yml +23 -0
  8. requirements.txt +6 -0
.dockerignore ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__
2
+ *.pyc
3
+ *.pyo
4
+ *.pyd
5
+ .Python
6
+ .git
7
+ .gitignore
8
+ .venv
9
+ venv
10
+ env
11
+ *.egg-info
12
+ dist
13
+ build
14
+ .pytest_cache
15
+ .mypy_cache
16
+ *.log
17
+ .DS_Store
18
+ Thumbs.db
Dockerfile ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ # Set working directory
4
+ WORKDIR /app
5
+
6
+ # Set environment variables
7
+ ENV PYTHONDONTWRITEBYTECODE=1
8
+ ENV PYTHONUNBUFFERED=1
9
+ ENV GRADIO_SERVER_NAME=0.0.0.0
10
+ ENV GRADIO_SERVER_PORT=7860
11
+ ENV HF_HOME=/app/.cache/huggingface
12
+ ENV TRANSFORMERS_CACHE=/app/.cache/huggingface
13
+
14
+ # Install system dependencies for audio processing
15
+ RUN apt-get update && apt-get install -y --no-install-recommends \
16
+ ffmpeg \
17
+ libsndfile1 \
18
+ git \
19
+ && rm -rf /var/lib/apt/lists/*
20
+
21
+ # Install PyTorch with CUDA support first
22
+ RUN pip install --no-cache-dir \
23
+ torch \
24
+ torchaudio \
25
+ --index-url https://download.pytorch.org/whl/cu126
26
+
27
+ # Copy requirements first for better caching
28
+ COPY requirements.txt .
29
+
30
+ # Install remaining Python dependencies
31
+ RUN pip install --no-cache-dir -r requirements.txt
32
+
33
+ # Copy application code
34
+ COPY app.py .
35
+
36
+ # Create cache directory for HuggingFace models
37
+ RUN mkdir -p /app/.cache/huggingface
38
+
39
+ # Expose the Gradio port
40
+ EXPOSE 7860
41
+
42
+ # Run the application
43
+ CMD ["python", "app.py"]
Dockerfile.cpu ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ # Set working directory
4
+ WORKDIR /app
5
+
6
+ # Set environment variables
7
+ ENV PYTHONDONTWRITEBYTECODE=1
8
+ ENV PYTHONUNBUFFERED=1
9
+ ENV GRADIO_SERVER_NAME=0.0.0.0
10
+ ENV GRADIO_SERVER_PORT=7860
11
+ ENV HF_HOME=/app/.cache/huggingface
12
+ ENV TRANSFORMERS_CACHE=/app/.cache/huggingface
13
+
14
+ # Install system dependencies for audio processing
15
+ RUN apt-get update && apt-get install -y --no-install-recommends \
16
+ ffmpeg \
17
+ libsndfile1 \
18
+ git \
19
+ && rm -rf /var/lib/apt/lists/*
20
+
21
+ # Install PyTorch CPU-only version (smaller download, works on Mac/Linux/Windows)
22
+ RUN pip install --no-cache-dir \
23
+ torch \
24
+ torchaudio
25
+ # Copy requirements first for better caching
26
+ COPY requirements.txt .
27
+
28
+ # Install remaining Python dependencies
29
+ RUN pip install --no-cache-dir -r requirements.txt
30
+
31
+ # Copy application code
32
+ COPY app.py .
33
+
34
+ # Create cache directory for HuggingFace models
35
+ RUN mkdir -p /app/.cache/huggingface
36
+
37
+ # Expose the Gradio port
38
+ EXPOSE 7860
39
+
40
+ # Run the application
41
+ CMD ["python", "app.py"]
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🏆 Speech-to-Text Model Arena
2
+
3
+ A Gradio-based web application for comparing multiple speech-to-text models side-by-side. Upload audio or record from your microphone and see how different ASR models transcribe your speech.
4
+
5
+ ![Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
6
+ ![Gradio](https://img.shields.io/badge/Gradio-4.0+-orange.svg)
7
+ ![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)
8
+
9
+ ## 🎯 Features
10
+
11
+ - **Multi-model comparison**: Compare 3 different STT models simultaneously
12
+ - **Audio input flexibility**: Record via microphone or upload audio files
13
+ - **Real-time inference timing**: See how long each model takes to process
14
+ - **GPU acceleration**: Automatically uses CUDA when available
15
+ - **Model caching**: Models are loaded once and cached for faster subsequent runs
16
+
17
+ ## 🤖 Models Included
18
+
19
+ | Model | HuggingFace ID | Description |
20
+ |-------|----------------|-------------|
21
+ | StutteredSpeechASR | `AImpower/StutteredSpeechASR` | Whisper fine-tuned for stuttered speech (Mandarin) |
22
+ | Whisper Base | `openai/whisper-base` | OpenAI's base Whisper model |
23
+ | Wav2Vec2 | `facebook/wav2vec2-base-960h` | Meta's Wav2Vec2 (English) |
24
+
25
+ ## 📋 Requirements
26
+
27
+ - Python 3.9+
28
+ - NVIDIA GPU with CUDA support (recommended)
29
+ - Docker (optional, for containerized deployment)
30
+
31
+ ## 🚀 Quick Start
32
+
33
+ ### Option 1: Run Locally
34
+
35
+ 1. **Clone the repository**
36
+ ```bash
37
+ git clone <your-repo-url>
38
+ cd stt_battle_arena
39
+ ```
40
+
41
+ 2. **Create a virtual environment** (recommended)
42
+ ```bash
43
+ python -m venv venv
44
+
45
+ # Windows
46
+ venv\Scripts\activate
47
+
48
+ # Linux/macOS
49
+ source venv/bin/activate
50
+ ```
51
+
52
+ 3. **Install dependencies**
53
+ ```bash
54
+ pip install -r requirements.txt
55
+ ```
56
+
57
+ 4. **Run the application**
58
+ ```bash
59
+ python app.py
60
+ ```
61
+
62
+ 5. **Open your browser** and navigate to `http://localhost:7860`
63
+
64
+ ### Option 2: Run with Docker (GPU - Linux/Windows with NVIDIA)
65
+
66
+ For machines with NVIDIA GPUs:
67
+
68
+ 1. **Build and run with Docker Compose**
69
+ ```bash
70
+ docker compose up --build
71
+ ```
72
+
73
+ 2. **Open your browser** and navigate to `http://localhost:7860`
74
+
75
+ ### Option 3: Run with Docker (CPU - Mac/Linux/Windows)
76
+
77
+ For Mac users or machines without NVIDIA GPUs:
78
+
79
+ 1. **Build and run with Docker Compose**
80
+ ```bash
81
+ docker compose -f docker-compose.cpu.yml up --build
82
+ ```
83
+
84
+ 2. **Or build and run manually**
85
+ ```bash
86
+ # Build the CPU image
87
+ docker build -f Dockerfile.cpu -t stt-arena-cpu .
88
+
89
+ # Run the container
90
+ docker run -p 7860:7860 stt-arena-cpu
91
+ ```
92
+
93
+ 3. **Open your browser** and navigate to `http://localhost:7860`
94
+
95
+ > ⚠️ **Note**: CPU inference is significantly slower than GPU. Expect 10-30+ seconds per model depending on audio length.
96
+
97
+ ## 🐳 Docker Configuration
98
+
99
+ ### GPU Support (NVIDIA - Linux/Windows only)
100
+
101
+ The Docker setup requires the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) for GPU acceleration.
102
+
103
+ **Install NVIDIA Container Toolkit:**
104
+ ```bash
105
+ # Ubuntu/Debian
106
+ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
107
+ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
108
+ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
109
+ sudo apt-get update
110
+ sudo apt-get install -y nvidia-container-toolkit
111
+ sudo systemctl restart docker
112
+ ```
113
+
114
+ ### Persistent Model Cache
115
+
116
+ The Docker Compose configuration includes a volume (`hf-cache`) to persist downloaded HuggingFace models. This means models won't need to be re-downloaded when the container restarts.
117
+
118
+ ## 📁 Project Structure
119
+
120
+ ```
121
+ stt_battle_arena/
122
+ ├── app.py # Main Gradio application
123
+ ├── requirements.txt # Python dependencies
124
+ ├── Dockerfile # Docker build (GPU/CUDA)
125
+ ├── Dockerfile.cpu # Docker build (CPU-only, Mac compatible)
126
+ ├── docker-compose.yml # Docker Compose (GPU)
127
+ ├── docker-compose.cpu.yml # Docker Compose (CPU-only, Mac compatible)
128
+ ├── .dockerignore # Docker build exclusions
129
+ └── README.md # This file
130
+ ```
131
+
132
+ ## ⚙️ Configuration
133
+
134
+ ### Changing Models
135
+
136
+ To add or modify models, edit the `MODELS` list in `app.py`:
137
+
138
+ ```python
139
+ MODELS = [
140
+ {
141
+ "name": "🎙️ Your Model Name",
142
+ "id": "unique_id",
143
+ "hf_id": "huggingface/model-id",
144
+ "description": "Model description",
145
+ },
146
+ # Add more models...
147
+ ]
148
+ ```
149
+
150
+ ## 📚 References
151
+
152
+ - [Gradio Documentation](https://www.gradio.app/docs)
153
+ - [HuggingFace Transformers](https://huggingface.co/docs/transformers)
154
+ - [AImpower StutteredSpeechASR](https://huggingface.co/AImpower/StutteredSpeechASR)
155
+ - [OpenAI Whisper](https://github.com/openai/whisper)
156
+ - [Wav2Vec 2.0](https://huggingface.co/facebook/wav2vec2-base-960h)
app.py ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Speech-to-Text Model Arena
3
+ A Gradio demo for comparing multiple STT models side-by-side.
4
+ """
5
+
6
+ import gradio as gr
7
+ import time
8
+ import torch
9
+ import librosa
10
+ import logging
11
+ from transformers import (
12
+ AutoModelForSpeechSeq2Seq,
13
+ AutoProcessor,
14
+ WhisperForConditionalGeneration,
15
+ WhisperProcessor,
16
+ Wav2Vec2ForCTC,
17
+ Wav2Vec2Processor,
18
+ )
19
+
20
+ # Configure logging
21
+ logging.basicConfig(
22
+ level=logging.INFO,
23
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
24
+ datefmt="%Y-%m-%d %H:%M:%S",
25
+ )
26
+ logger = logging.getLogger("stt_arena")
27
+
28
+ # Determine device
29
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
30
+ TORCH_DTYPE = torch.float16 if torch.cuda.is_available() else torch.float32
31
+
32
+ logger.info(f"Using device: {DEVICE}")
33
+ logger.info(f"Torch dtype: {TORCH_DTYPE}")
34
+
35
+ # Model configurations
36
+ MODELS = [
37
+ {
38
+ "name": "🗣️ StutteredSpeechASR",
39
+ "id": "stuttered",
40
+ "hf_id": "AImpower/StutteredSpeechASR",
41
+ "description": "Whisper fine-tuned for stuttered speech (Mandarin)",
42
+ },
43
+ {
44
+ "name": "🎙️ Whisper Base",
45
+ "id": "whisper",
46
+ "hf_id": "openai/whisper-base",
47
+ "description": "OpenAI Whisper base model",
48
+ },
49
+ {
50
+ "name": "🔊 Wav2Vec2",
51
+ "id": "wav2vec",
52
+ "hf_id": "facebook/wav2vec2-base-960h",
53
+ "description": "Meta's Wav2Vec2 (English)",
54
+ },
55
+ ]
56
+
57
+ # Global model cache
58
+ _model_cache = {}
59
+
60
+
61
+ def load_model(model_config: dict):
62
+ """
63
+ Load and cache a model based on its configuration.
64
+ """
65
+ model_id = model_config["id"]
66
+ hf_id = model_config["hf_id"]
67
+
68
+ if model_id in _model_cache:
69
+ logger.debug(f"Model {model_id} found in cache")
70
+ return _model_cache[model_id]
71
+
72
+ logger.info(f"Loading model: {hf_id}...")
73
+
74
+ if model_id == "stuttered":
75
+ # StutteredSpeechASR - Whisper-based model
76
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
77
+ processor = AutoProcessor.from_pretrained(hf_id)
78
+ model.to(DEVICE)
79
+ _model_cache[model_id] = (model, processor, "whisper")
80
+
81
+ elif model_id == "whisper":
82
+ # Standard Whisper model
83
+ model = WhisperForConditionalGeneration.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
84
+ processor = WhisperProcessor.from_pretrained(hf_id)
85
+ model.to(DEVICE)
86
+ _model_cache[model_id] = (model, processor, "whisper")
87
+
88
+ elif model_id == "wav2vec":
89
+ # Wav2Vec2 model
90
+ model = Wav2Vec2ForCTC.from_pretrained(hf_id, torch_dtype=TORCH_DTYPE)
91
+ processor = Wav2Vec2Processor.from_pretrained(hf_id)
92
+ model.to(DEVICE)
93
+ _model_cache[model_id] = (model, processor, "wav2vec")
94
+
95
+ logger.info(f"Model {hf_id} loaded successfully!")
96
+ return _model_cache[model_id]
97
+
98
+
99
+ def run_inference(audio_path: str, model_config: dict) -> tuple[str, float]:
100
+ """
101
+ Run inference on a single model.
102
+
103
+ Args:
104
+ audio_path: Path to the audio file
105
+ model_config: Model configuration dictionary
106
+
107
+ Returns:
108
+ Tuple of (transcribed_text, inference_time_in_seconds)
109
+ """
110
+ if audio_path is None:
111
+ logger.warning("No audio provided")
112
+ return "⚠️ No audio provided. Please record or upload audio first.", 0.0
113
+
114
+ try:
115
+ logger.info(f"Running inference with model: {model_config['name']}")
116
+ logger.debug(f"Audio path: {audio_path}")
117
+
118
+ # Load audio file
119
+ waveform, sampling_rate = librosa.load(audio_path, sr=16000)
120
+ logger.debug(f"Audio loaded: {len(waveform)} samples at {sampling_rate}Hz")
121
+
122
+ # Load model
123
+ model, processor, model_type = load_model(model_config)
124
+
125
+ # Start timing
126
+ start_time = time.time()
127
+
128
+ if model_type == "whisper":
129
+ # Whisper-style inference
130
+ input_features = processor(
131
+ waveform,
132
+ sampling_rate=16000,
133
+ return_tensors="pt"
134
+ ).input_features
135
+ input_features = input_features.to(DEVICE, dtype=TORCH_DTYPE)
136
+
137
+ with torch.no_grad():
138
+ predicted_ids = model.generate(input_features)
139
+
140
+ transcription = processor.batch_decode(
141
+ predicted_ids,
142
+ skip_special_tokens=True
143
+ )[0]
144
+
145
+ elif model_type == "wav2vec":
146
+ # Wav2Vec2-style inference
147
+ inputs = processor(
148
+ waveform,
149
+ sampling_rate=16000,
150
+ return_tensors="pt",
151
+ padding=True
152
+ )
153
+ input_values = inputs.input_values.to(DEVICE, dtype=TORCH_DTYPE)
154
+
155
+ with torch.no_grad():
156
+ logits = model(input_values).logits
157
+
158
+ predicted_ids = torch.argmax(logits, dim=-1)
159
+ transcription = processor.batch_decode(predicted_ids)[0]
160
+
161
+ else:
162
+ transcription = "Unknown model type"
163
+
164
+ # Calculate inference time
165
+ inference_time = time.time() - start_time
166
+
167
+ logger.info(f"Inference complete for {model_config['name']}: {inference_time:.3f}s")
168
+ logger.debug(f"Transcription: {transcription[:100]}..." if len(transcription) > 100 else f"Transcription: {transcription}")
169
+
170
+ return transcription.strip(), round(inference_time, 3)
171
+
172
+ except Exception as e:
173
+ logger.error(f"Error during inference with {model_config['name']}: {str(e)}", exc_info=True)
174
+ return f"❌ Error: {str(e)}", 0.0
175
+
176
+
177
+ def run_all_models(audio):
178
+ """
179
+ Run inference on all models sequentially.
180
+
181
+ Note: Running sequentially to avoid GPU memory issues and ensure
182
+ models are loaded one at a time if needed.
183
+
184
+ Args:
185
+ audio: Audio input from Gradio component
186
+
187
+ Returns:
188
+ List of results for each model (text1, time1, text2, time2, text3, time3)
189
+ """
190
+ logger.info(f"Starting inference on {len(MODELS)} models")
191
+ results = []
192
+
193
+ for model_config in MODELS:
194
+ text, inference_time = run_inference(audio, model_config)
195
+ results.extend([text, inference_time])
196
+
197
+ logger.info("All models completed")
198
+ return results
199
+
200
+
201
+ # Build the Gradio interface
202
+ with gr.Blocks(
203
+ theme=gr.themes.Soft(),
204
+ title="Speech-to-Text Model Arena",
205
+ css="""
206
+ .model-card {
207
+ border: 1px solid #e0e0e0;
208
+ border-radius: 12px;
209
+ padding: 16px;
210
+ background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
211
+ }
212
+ .run-button {
213
+ background: linear-gradient(90deg, #667eea 0%, #764ba2 100%) !important;
214
+ font-size: 1.2em !important;
215
+ font-weight: bold !important;
216
+ }
217
+ .title-text {
218
+ text-align: center;
219
+ background: linear-gradient(90deg, #667eea, #764ba2);
220
+ -webkit-background-clip: text;
221
+ -webkit-text-fill-color: transparent;
222
+ background-clip: text;
223
+ }
224
+ """
225
+ ) as demo:
226
+
227
+ # Title and Description
228
+ gr.Markdown(
229
+ """
230
+ # 🏆 Speech-to-Text Model Arena
231
+
232
+ **Compare multiple speech recognition models side-by-side!**
233
+
234
+ Upload an audio file or record using your microphone, then click "Run Models"
235
+ to see how different STT models transcribe your speech. Compare their outputs
236
+ and inference times to find the best model for your use case.
237
+ """,
238
+ elem_classes=["title-text"]
239
+ )
240
+
241
+ gr.Markdown("---")
242
+
243
+ # Audio Input Section
244
+ with gr.Group():
245
+ gr.Markdown("### 🎤 Audio Input")
246
+ audio_input = gr.Audio(
247
+ sources=["microphone", "upload"],
248
+ type="filepath",
249
+ label="Record or Upload Audio",
250
+ show_label=True,
251
+ )
252
+
253
+ # Run Button
254
+ run_button = gr.Button(
255
+ "🚀 Run Models",
256
+ variant="primary",
257
+ size="lg",
258
+ elem_classes=["run-button"]
259
+ )
260
+
261
+ gr.Markdown("---")
262
+ gr.Markdown("### 📊 Model Results")
263
+
264
+ # Model Output Cards
265
+ with gr.Row(equal_height=True):
266
+ # Create output components for each model
267
+ output_components = []
268
+
269
+ for model in MODELS:
270
+ with gr.Column(elem_classes=["model-card"]):
271
+ gr.Markdown(f"## {model['name']}")
272
+
273
+ text_output = gr.Textbox(
274
+ label="Transcription",
275
+ placeholder="Transcribed text will appear here...",
276
+ lines=4,
277
+ interactive=False,
278
+ )
279
+
280
+ time_output = gr.Number(
281
+ label="⏱️ Inference Time (seconds)",
282
+ value=0.0,
283
+ interactive=False,
284
+ precision=3,
285
+ )
286
+
287
+ output_components.extend([text_output, time_output])
288
+
289
+ # Connect the button to the inference function
290
+ run_button.click(
291
+ fn=run_all_models,
292
+ inputs=[audio_input],
293
+ outputs=output_components,
294
+ show_progress=True,
295
+ )
296
+
297
+ # Footer
298
+ gr.Markdown("---")
299
+ gr.Markdown(
300
+ """
301
+ <center>
302
+
303
+ **💡 Tip:**
304
+ - For best results, use clear audio with minimal background noise
305
+ *Built with ❤️ using Gradio*
306
+
307
+ </center>
308
+ """,
309
+ elem_classes=["footer"]
310
+ )
311
+
312
+
313
+ # Launch the app
314
+ if __name__ == "__main__":
315
+ logger.info("Starting Speech-to-Text Model Arena")
316
+ logger.info(f"Models configured: {[m['name'] for m in MODELS]}")
317
+ demo.launch(
318
+ share=False,
319
+ server_name="0.0.0.0",
320
+ server_port=7860,
321
+ show_error=True,
322
+ )
323
+ logger.info("Application shutdown")
docker-compose.cpu.yml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ services:
2
+ stt-arena:
3
+ build:
4
+ context: .
5
+ dockerfile: Dockerfile.cpu
6
+ image: stt-arena-cpu
7
+ container_name: stt-arena
8
+ ports:
9
+ - "7860:7860"
10
+ volumes:
11
+ # Persist HuggingFace model cache
12
+ - hf-cache:/app/.cache/huggingface
13
+ restart: unless-stopped
14
+
15
+ volumes:
16
+ hf-cache:
docker-compose.yml ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ services:
2
+ stt-arena:
3
+ build:
4
+ context: .
5
+ dockerfile: Dockerfile
6
+ image: stt-arena
7
+ container_name: stt-arena
8
+ ports:
9
+ - "7860:7860"
10
+ volumes:
11
+ # Persist HuggingFace model cache
12
+ - hf-cache:/app/.cache/huggingface
13
+ deploy:
14
+ resources:
15
+ reservations:
16
+ devices:
17
+ - driver: nvidia
18
+ count: 1
19
+ capabilities: [gpu]
20
+ restart: unless-stopped
21
+
22
+ volumes:
23
+ hf-cache:
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ torch>=2.0.0
3
+ transformers>=4.36.0
4
+ librosa>=0.10.0
5
+ soundfile>=0.12.0
6
+ accelerate>=0.25.0