OnyxMunk commited on
Commit
d0fa7b7
·
1 Parent(s): a611150

Implement Stable Audio model integration

Browse files

- Add Stable Audio Open model (stabilityai/stable-audio-open-small)
- Implement model loading with caching mechanism
- Add PyTorch, diffusers, transformers dependencies
- Update Dockerfile for Hugging Face Spaces deployment
- Add comprehensive error handling and fallback mechanisms
- Update README with model information

Files changed (4) hide show
  1. Dockerfile +29 -12
  2. README.md +10 -2
  3. app.py +270 -52
  4. requirements.txt +5 -0
Dockerfile CHANGED
@@ -1,34 +1,51 @@
1
- # Use lightweight Python 3.10 for audio synthesis
 
2
  FROM python:3.10-slim
3
 
4
  # Set working directory
5
  WORKDIR /app
6
 
7
- # Install minimal system dependencies for audio processing
 
 
 
 
 
 
8
  RUN apt-get update && apt-get install -y \
9
  build-essential \
 
 
10
  && rm -rf /var/lib/apt/lists/*
11
 
12
  # Copy requirements file first for better Docker layer caching
13
  COPY requirements.txt .
14
 
15
- # Install Python dependencies (numpy and scipy only)
16
- RUN pip install --no-cache-dir -r requirements.txt
 
 
 
 
 
 
 
 
17
 
18
  # Copy application files
19
  COPY app.py .
20
  COPY README.md .
21
 
22
- # Expose Gradio default port
23
- EXPOSE 7860
24
 
25
- # Set environment variables for Gradio
26
- ENV GRADIO_SERVER_NAME=0.0.0.0
27
- ENV GRADIO_SERVER_PORT=7860
28
 
29
- # Health check (optional)
30
- HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
31
- CMD python -c "import numpy as np; print('Health check passed')"
 
32
 
33
  # Run the application
34
  CMD ["python", "app.py"]
 
1
+ # Use Python 3.10 for Stable Audio model
2
+ # Optimized for Hugging Face Spaces deployment
3
  FROM python:3.10-slim
4
 
5
  # Set working directory
6
  WORKDIR /app
7
 
8
+ # Set environment variables
9
+ ENV PYTHONUNBUFFERED=1
10
+ ENV GRADIO_SERVER_NAME=0.0.0.0
11
+ ENV GRADIO_SERVER_PORT=7860
12
+ ENV HF_HOME=/tmp/.cache/huggingface
13
+
14
+ # Install system dependencies for audio processing and ML libraries
15
  RUN apt-get update && apt-get install -y \
16
  build-essential \
17
+ git \
18
+ curl \
19
  && rm -rf /var/lib/apt/lists/*
20
 
21
  # Copy requirements file first for better Docker layer caching
22
  COPY requirements.txt .
23
 
24
+ # Install Python dependencies
25
+ # Note: PyTorch will be installed with CPU support by default
26
+ # For GPU support on Spaces, use the GPU base image or install CUDA version
27
+ RUN pip install --no-cache-dir --upgrade pip && \
28
+ pip install --no-cache-dir -r requirements.txt
29
+
30
+ # Remove build tools after installation to reduce image size
31
+ RUN apt-get purge -y build-essential && \
32
+ apt-get autoremove -y && \
33
+ rm -rf /var/lib/apt/lists/*
34
 
35
  # Copy application files
36
  COPY app.py .
37
  COPY README.md .
38
 
39
+ # Create cache directory for models
40
+ RUN mkdir -p /tmp/.cache/huggingface
41
 
42
+ # Expose Gradio default port (required for Hugging Face Spaces)
43
+ EXPOSE 7860
 
44
 
45
+ # Health check - verify Gradio server is responding
46
+ # Increased start-period to allow model download on first run
47
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=180s --retries=3 \
48
+ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860').read()" || exit 1
49
 
50
  # Run the application
51
  CMD ["python", "app.py"]
README.md CHANGED
@@ -38,9 +38,17 @@ An open-source web interface for generating high-quality audio from text prompts
38
  ## Technical Details
39
 
40
  This application uses:
 
41
  - **Gradio** for the web interface
42
- - **NumPy** and **SciPy** for intelligent audio synthesis
43
- - **Keyword-based generation** that adapts audio characteristics based on prompt content
 
 
 
 
 
 
 
44
  ## Contributing
45
 
46
  This is an open-source project. Contributions are welcome! Feel free to:
 
38
  ## Technical Details
39
 
40
  This application uses:
41
+ - **Stable Audio Open** (`stabilityai/stable-audio-open-small`) - Advanced AI model for text-to-audio generation
42
  - **Gradio** for the web interface
43
+ - **PyTorch & Diffusers** for model inference
44
+ - **NumPy** for audio processing and fallback synthesis
45
+ - **Automatic fallback** to simple synthesis if model is unavailable
46
+
47
+ ### Model Information
48
+ - **Model**: `stabilityai/stable-audio-open-small`
49
+ - **First Run**: Model will be automatically downloaded (~1-2 GB)
50
+ - **Device**: Automatically uses GPU if available, falls back to CPU
51
+ - **Caching**: Model is cached in memory for faster subsequent generations
52
  ## Contributing
53
 
54
  This is an open-source project. Contributions are welcome! Feel free to:
app.py CHANGED
@@ -1,10 +1,214 @@
1
  import gradio as gr
2
  import numpy as np
3
-
4
- # Simple audio synthesis - avoiding heavy ML models for now
5
- def generate_audio_from_prompt(prompt, duration, seed):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  """
7
- Generate audio using simple synthesis based on prompt characteristics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  """
9
  # Input validation and sanitization
10
  if prompt is None:
@@ -12,34 +216,30 @@ def generate_audio_from_prompt(prompt, duration, seed):
12
  if not isinstance(prompt, str):
13
  prompt = str(prompt)
14
  if duration is None or not isinstance(duration, (int, float)) or duration <= 0:
15
- duration = 10.0 # Default duration
16
- duration = min(max(duration, 1.0), 30.0) # Clamp to reasonable range
17
 
18
  sample_rate = 44100
19
  duration_samples = int(duration * sample_rate)
20
 
21
- # Set seed for reproducibility - handle None case explicitly
22
  if seed is not None:
23
  try:
24
  seed_int = int(seed)
25
  np.random.seed(seed_int)
26
  except (ValueError, TypeError, OverflowError):
27
- # If seed can't be converted to int (including overflow cases like infinity), use system entropy
28
  pass
29
 
30
  # Extract features from prompt to influence audio
31
  prompt_lower = prompt.lower()
32
-
33
- # Base frequency based on prompt content
34
  base_freq = 220 # A3 note
35
 
36
  if 'high' in prompt_lower or 'bright' in prompt_lower:
37
- base_freq *= 2 # Higher octave
38
  elif 'low' in prompt_lower or 'deep' in prompt_lower:
39
- base_freq /= 2 # Lower octave
40
 
41
  if 'fast' in prompt_lower or 'quick' in prompt_lower:
42
- # Add vibrato for "fast" sounds
43
  vibrato_freq = 5
44
  vibrato_depth = 0.1
45
  else:
@@ -51,35 +251,31 @@ def generate_audio_from_prompt(prompt, duration, seed):
51
 
52
  # Create base waveform
53
  if 'noise' in prompt_lower or 'wind' in prompt_lower or 'rain' in prompt_lower:
54
- # White noise for atmospheric sounds
55
  audio = np.random.normal(0, 0.3, duration_samples)
56
  elif 'pulse' in prompt_lower or 'beep' in prompt_lower:
57
- # Square wave for electronic sounds
58
  audio = 0.3 * np.sign(np.sin(2 * np.pi * base_freq * t))
59
  else:
60
- # Sine wave with optional vibrato
61
  if vibrato_freq > 0:
62
- modulated_freq = base_freq * (1 + vibrato_depth * np.sin(2 * np.pi * vibrato_freq * t))
63
- audio = 0.3 * np.sin(2 * np.pi * np.cumsum(modulated_freq) * (t[1] - t[0]))
64
  else:
65
  audio = 0.3 * np.sin(2 * np.pi * base_freq * t)
66
 
67
- # Add harmonics for richer sound
68
  if 'rich' in prompt_lower or 'full' in prompt_lower or 'warm' in prompt_lower:
69
- # Add octave higher harmonic
70
  harmonic = 0.2 * np.sin(2 * np.pi * (base_freq * 2) * t)
71
  audio += harmonic
72
 
73
- # Add some natural variation
74
  if 'natural' in prompt_lower or 'organic' in prompt_lower:
75
- # Add slight random variation
76
  variation = np.random.normal(0, 0.05, duration_samples)
77
  audio += variation
78
 
79
- # Normalize to prevent clipping
80
  audio = np.clip(audio, -0.95, 0.95)
 
81
 
82
- return (sample_rate, audio)
83
 
84
  def create_audio_generation_interface():
85
  """
@@ -88,36 +284,54 @@ def create_audio_generation_interface():
88
 
89
  def generate_audio(prompt, duration, seed):
90
  """
91
- Generate audio based on text prompt using intelligent synthesis
92
  """
93
  try:
94
- # Input validation for main function
95
- if prompt is None:
96
  prompt = "gentle melody"
97
  if not isinstance(prompt, str):
98
  prompt = str(prompt)
99
  if duration is None or not isinstance(duration, (int, float)):
100
  duration = 10.0
101
- duration = float(max(1.0, min(30.0, duration))) # Ensure valid range
102
 
103
  print(f"Generating audio for prompt: '{prompt}', duration: {duration}s, seed: {seed}")
104
 
105
- # Use our intelligent synthesis function
106
- sample_rate, audio = generate_audio_from_prompt(prompt, duration, seed)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
- return (sample_rate, audio), "Audio generated successfully!"
109
 
110
  except Exception as e:
111
  print(f"Error generating audio: {e}")
112
- # Ultimate fallback with safety checks
 
 
 
113
  try:
114
  safe_duration = float(max(1.0, min(30.0, duration if isinstance(duration, (int, float)) else 10.0)))
115
  sample_rate = 44100
116
  duration_samples = int(safe_duration * sample_rate)
117
  t = np.linspace(0, safe_duration, duration_samples, endpoint=False)
118
- audio = 0.3 * np.sin(2 * np.pi * 440 * t) # Simple A4 tone
 
119
 
120
- return (sample_rate, audio), f"Error: {str(e)}. Using simple fallback."
121
  except Exception as fallback_error:
122
  print(f"Fallback also failed: {fallback_error}")
123
  # Absolute minimum fallback
@@ -125,16 +339,18 @@ def create_audio_generation_interface():
125
  duration_samples = 441000 # 10 seconds
126
  t = np.linspace(0, 10.0, duration_samples, endpoint=False)
127
  audio = 0.3 * np.sin(2 * np.pi * 440 * t)
 
128
 
129
- return (sample_rate, audio), "Critical error occurred. Using emergency fallback."
130
 
131
  # Create the Gradio interface
 
132
  with gr.Blocks(title="Stable Audio Open", theme=gr.themes.Soft()) as interface:
133
- gr.Markdown("""
134
  # 🎵 Stable Audio Open
135
  Generate high-quality audio from text prompts using Stable Audio technology.
136
-
137
- **Note:** This is a demo interface. The actual Stable Audio model integration is coming soon.
138
  """)
139
 
140
  with gr.Row():
@@ -159,7 +375,7 @@ def create_audio_generation_interface():
159
  value=None,
160
  precision=0,
161
  minimum=0,
162
- maximum=999999 # Reasonable upper limit
163
  )
164
 
165
  generate_btn = gr.Button("🎵 Generate Audio", variant="primary")
@@ -168,35 +384,37 @@ def create_audio_generation_interface():
168
  audio_output = gr.Audio(label="Generated Audio")
169
  status_output = gr.Textbox(label="Status", interactive=False)
170
 
171
- # Connect the generate button to the function
172
  generate_btn.click(
173
  fn=generate_audio,
174
  inputs=[prompt_input, duration_input, seed_input],
175
- outputs=[audio_output, status_output]
176
- )
177
-
178
- # Add loading state
179
- generate_btn.click(
180
- fn=lambda: "🎵 Generating audio... Please wait.",
181
- inputs=[],
182
- outputs=[status_output],
183
- queue=False
184
  )
185
 
186
  # Add some example prompts
187
- gr.Examples(
188
  examples=[
189
  ["A calming ocean wave sound with seagulls", 15, 42],
190
  ["Upbeat electronic dance music", 20, 123],
191
  ["Classical violin concerto", 25, 999],
192
  ["Rain falling on a tin roof", 10, 777]
193
  ],
194
- inputs=[prompt_input, duration_input, seed_input]
 
 
 
195
  )
196
 
197
  return interface
198
 
199
  # Launch the interface
200
  if __name__ == "__main__":
 
 
 
 
 
 
201
  interface = create_audio_generation_interface()
202
- interface.launch()
 
1
  import gradio as gr
2
  import numpy as np
3
+ import torch
4
+ import os
5
+ import warnings
6
+
7
+ # Try to import Stable Audio pipeline
8
+ try:
9
+ from diffusers import StableAudioPipeline
10
+ STABLE_AUDIO_AVAILABLE = True
11
+ except ImportError:
12
+ try:
13
+ # Alternative import path
14
+ from diffusers import DiffusionPipeline
15
+ STABLE_AUDIO_AVAILABLE = True
16
+ StableAudioPipeline = None # Will use DiffusionPipeline instead
17
+ except ImportError:
18
+ STABLE_AUDIO_AVAILABLE = False
19
+ StableAudioPipeline = None
20
+
21
+ # Suppress warnings for cleaner output
22
+ warnings.filterwarnings("ignore", category=UserWarning)
23
+
24
+ # Model configuration
25
+ MODEL_ID = "stabilityai/stable-audio-open-small"
26
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
27
+ DTYPE = torch.float16 if DEVICE == "cuda" else torch.float32
28
+
29
+ # Global model cache
30
+ model_cache = {
31
+ "pipeline": None,
32
+ "loaded": False
33
+ }
34
+
35
+ def load_model():
36
+ """
37
+ Load the Stable Audio model with caching to avoid reloading on every request
38
+ """
39
+ if not STABLE_AUDIO_AVAILABLE:
40
+ raise ImportError("diffusers library not available. Please install: pip install diffusers transformers accelerate")
41
+
42
+ if model_cache["loaded"] and model_cache["pipeline"] is not None:
43
+ print("Using cached model")
44
+ return model_cache["pipeline"]
45
+
46
+ try:
47
+ print(f"Loading Stable Audio model: {MODEL_ID}")
48
+ print(f"Device: {DEVICE}, Dtype: {DTYPE}")
49
+
50
+ # Try StableAudioPipeline first, fallback to DiffusionPipeline
51
+ if StableAudioPipeline is not None:
52
+ pipeline = StableAudioPipeline.from_pretrained(
53
+ MODEL_ID,
54
+ torch_dtype=DTYPE,
55
+ )
56
+ else:
57
+ from diffusers import DiffusionPipeline
58
+ pipeline = DiffusionPipeline.from_pretrained(
59
+ MODEL_ID,
60
+ torch_dtype=DTYPE,
61
+ )
62
+
63
+ pipeline = pipeline.to(DEVICE)
64
+
65
+ # Enable memory efficient attention if available
66
+ if hasattr(pipeline, "enable_attention_slicing"):
67
+ pipeline.enable_attention_slicing()
68
+ if hasattr(pipeline, "enable_vae_slicing"):
69
+ pipeline.enable_vae_slicing()
70
+
71
+ # Cache the model
72
+ model_cache["pipeline"] = pipeline
73
+ model_cache["loaded"] = True
74
+
75
+ print("Model loaded successfully!")
76
+ return pipeline
77
+
78
+ except Exception as e:
79
+ print(f"Error loading model: {e}")
80
+ import traceback
81
+ traceback.print_exc()
82
+ model_cache["loaded"] = False
83
+ raise
84
+
85
+ def generate_audio_with_model(prompt, duration, seed):
86
  """
87
+ Generate audio using the Stable Audio model
88
+ """
89
+ try:
90
+ # Load model (will use cache if already loaded)
91
+ pipeline = load_model()
92
+
93
+ # Prepare seed
94
+ generator = None
95
+ if seed is not None:
96
+ try:
97
+ seed_int = int(seed)
98
+ generator = torch.Generator(device=DEVICE).manual_seed(seed_int)
99
+ except (ValueError, TypeError, OverflowError):
100
+ generator = None
101
+
102
+ # Generate audio
103
+ print(f"Generating audio: prompt='{prompt}', duration={duration}s, seed={seed}")
104
+
105
+ # Stable Audio expects duration in seconds
106
+ # Note: The model may have limits on duration, so we clamp it
107
+ audio_duration = float(max(1.0, min(30.0, duration)))
108
+
109
+ # Generate audio using the pipeline
110
+ # Stable Audio Open API - try different parameter combinations
111
+ output = None
112
+ try:
113
+ # Try the standard Stable Audio API
114
+ output = pipeline(
115
+ prompt=prompt,
116
+ num_inference_steps=50, # Balance between quality and speed
117
+ audio_length_in_s=audio_duration,
118
+ generator=generator,
119
+ )
120
+ except TypeError as e1:
121
+ try:
122
+ # Try alternative parameter name
123
+ output = pipeline(
124
+ prompt=prompt,
125
+ num_inference_steps=50,
126
+ duration=audio_duration,
127
+ generator=generator,
128
+ )
129
+ except TypeError as e2:
130
+ try:
131
+ # Try with audio_length_in_s as positional or different name
132
+ output = pipeline(
133
+ prompt=prompt,
134
+ num_inference_steps=50,
135
+ generator=generator,
136
+ )
137
+ # If duration not supported, model will use default
138
+ print(f"Warning: Duration parameter not supported, using model default")
139
+ except Exception as e3:
140
+ raise RuntimeError(f"Failed to generate audio with any parameter combination: {e1}, {e2}, {e3}")
141
+
142
+ if output is None:
143
+ raise RuntimeError("Pipeline returned None")
144
+
145
+ # Extract audio array and sample rate
146
+ # Handle different output formats from diffusers
147
+ audio = None
148
+ sample_rate = 44100 # Default
149
+
150
+ # Try different output attribute names
151
+ if hasattr(output, 'audios'):
152
+ audio_data = output.audios
153
+ if isinstance(audio_data, (list, tuple)) and len(audio_data) > 0:
154
+ audio = audio_data[0]
155
+ else:
156
+ audio = audio_data
157
+ elif hasattr(output, 'audio'):
158
+ audio_data = output.audio
159
+ if isinstance(audio_data, (list, tuple)) and len(audio_data) > 0:
160
+ audio = audio_data[0]
161
+ else:
162
+ audio = audio_data
163
+ elif isinstance(output, dict):
164
+ audio = output.get('audios', output.get('audio', None))
165
+ if isinstance(audio, (list, tuple)) and len(audio) > 0:
166
+ audio = audio[0]
167
+ elif isinstance(output, (list, tuple)) and len(output) > 0:
168
+ audio = output[0]
169
+ elif isinstance(output, np.ndarray):
170
+ audio = output
171
+ elif isinstance(output, torch.Tensor):
172
+ audio = output
173
+
174
+ # Get sample rate
175
+ if hasattr(output, 'sample_rate'):
176
+ sample_rate = output.sample_rate
177
+ elif isinstance(output, dict):
178
+ sample_rate = output.get('sample_rate', 44100)
179
+
180
+ if audio is None:
181
+ raise ValueError("Could not extract audio from pipeline output")
182
+
183
+ # Handle different audio shapes
184
+ if len(audio.shape) > 1:
185
+ # If multi-channel, convert to mono by averaging
186
+ if audio.shape[0] > audio.shape[1]:
187
+ audio = audio.mean(axis=0)
188
+ else:
189
+ audio = audio.mean(axis=1)
190
+
191
+ # Ensure audio is numpy array and float32
192
+ if isinstance(audio, torch.Tensor):
193
+ audio = audio.cpu().numpy()
194
+ audio = audio.astype(np.float32)
195
+
196
+ # Normalize to prevent clipping
197
+ max_val = np.abs(audio).max()
198
+ if max_val > 0:
199
+ audio = audio / max_val * 0.95
200
+
201
+ print(f"Audio generated: shape={audio.shape}, dtype={audio.dtype}, sample_rate={sample_rate}")
202
+
203
+ return sample_rate, audio
204
+
205
+ except Exception as e:
206
+ print(f"Error in model generation: {e}")
207
+ raise
208
+
209
+ def generate_audio_fallback(prompt, duration, seed):
210
+ """
211
+ Fallback audio generation using simple synthesis
212
  """
213
  # Input validation and sanitization
214
  if prompt is None:
 
216
  if not isinstance(prompt, str):
217
  prompt = str(prompt)
218
  if duration is None or not isinstance(duration, (int, float)) or duration <= 0:
219
+ duration = 10.0
220
+ duration = min(max(duration, 1.0), 30.0)
221
 
222
  sample_rate = 44100
223
  duration_samples = int(duration * sample_rate)
224
 
225
+ # Set seed for reproducibility
226
  if seed is not None:
227
  try:
228
  seed_int = int(seed)
229
  np.random.seed(seed_int)
230
  except (ValueError, TypeError, OverflowError):
 
231
  pass
232
 
233
  # Extract features from prompt to influence audio
234
  prompt_lower = prompt.lower()
 
 
235
  base_freq = 220 # A3 note
236
 
237
  if 'high' in prompt_lower or 'bright' in prompt_lower:
238
+ base_freq *= 2
239
  elif 'low' in prompt_lower or 'deep' in prompt_lower:
240
+ base_freq /= 2
241
 
242
  if 'fast' in prompt_lower or 'quick' in prompt_lower:
 
243
  vibrato_freq = 5
244
  vibrato_depth = 0.1
245
  else:
 
251
 
252
  # Create base waveform
253
  if 'noise' in prompt_lower or 'wind' in prompt_lower or 'rain' in prompt_lower:
 
254
  audio = np.random.normal(0, 0.3, duration_samples)
255
  elif 'pulse' in prompt_lower or 'beep' in prompt_lower:
 
256
  audio = 0.3 * np.sign(np.sin(2 * np.pi * base_freq * t))
257
  else:
 
258
  if vibrato_freq > 0:
259
+ phase_modulation = vibrato_depth * np.sin(2 * np.pi * vibrato_freq * t)
260
+ audio = 0.3 * np.sin(2 * np.pi * base_freq * t + phase_modulation)
261
  else:
262
  audio = 0.3 * np.sin(2 * np.pi * base_freq * t)
263
 
264
+ # Add harmonics
265
  if 'rich' in prompt_lower or 'full' in prompt_lower or 'warm' in prompt_lower:
 
266
  harmonic = 0.2 * np.sin(2 * np.pi * (base_freq * 2) * t)
267
  audio += harmonic
268
 
269
+ # Add natural variation
270
  if 'natural' in prompt_lower or 'organic' in prompt_lower:
 
271
  variation = np.random.normal(0, 0.05, duration_samples)
272
  audio += variation
273
 
274
+ # Normalize
275
  audio = np.clip(audio, -0.95, 0.95)
276
+ audio = audio.astype(np.float32)
277
 
278
+ return sample_rate, audio
279
 
280
  def create_audio_generation_interface():
281
  """
 
284
 
285
  def generate_audio(prompt, duration, seed):
286
  """
287
+ Generate audio based on text prompt using Stable Audio model
288
  """
289
  try:
290
+ # Input validation
291
+ if prompt is None or prompt.strip() == "":
292
  prompt = "gentle melody"
293
  if not isinstance(prompt, str):
294
  prompt = str(prompt)
295
  if duration is None or not isinstance(duration, (int, float)):
296
  duration = 10.0
297
+ duration = float(max(1.0, min(30.0, duration)))
298
 
299
  print(f"Generating audio for prompt: '{prompt}', duration: {duration}s, seed: {seed}")
300
 
301
+ # Try to use the model first
302
+ try:
303
+ sample_rate, audio = generate_audio_with_model(prompt, duration, seed)
304
+ status_msg = f"✅ Audio generated successfully using Stable Audio! ({len(audio)/sample_rate:.1f}s)"
305
+ except Exception as model_error:
306
+ print(f"Model generation failed: {model_error}")
307
+ print("Falling back to simple synthesis...")
308
+ # Fallback to simple synthesis
309
+ sample_rate, audio = generate_audio_fallback(prompt, duration, seed)
310
+ status_msg = f"⚠️ Model unavailable, using fallback synthesis. Error: {str(model_error)[:100]}"
311
+
312
+ # Verify audio was generated correctly
313
+ if audio is None or len(audio) == 0:
314
+ raise ValueError("Generated audio is empty")
315
+
316
+ print(f"Audio generated: shape={audio.shape}, dtype={audio.dtype}, sample_rate={sample_rate}")
317
 
318
+ return (sample_rate, audio), status_msg
319
 
320
  except Exception as e:
321
  print(f"Error generating audio: {e}")
322
+ import traceback
323
+ traceback.print_exc()
324
+
325
+ # Ultimate fallback
326
  try:
327
  safe_duration = float(max(1.0, min(30.0, duration if isinstance(duration, (int, float)) else 10.0)))
328
  sample_rate = 44100
329
  duration_samples = int(safe_duration * sample_rate)
330
  t = np.linspace(0, safe_duration, duration_samples, endpoint=False)
331
+ audio = 0.3 * np.sin(2 * np.pi * 440 * t)
332
+ audio = audio.astype(np.float32)
333
 
334
+ return (sample_rate, audio), f"Error: {str(e)[:100]}. Using emergency fallback."
335
  except Exception as fallback_error:
336
  print(f"Fallback also failed: {fallback_error}")
337
  # Absolute minimum fallback
 
339
  duration_samples = 441000 # 10 seconds
340
  t = np.linspace(0, 10.0, duration_samples, endpoint=False)
341
  audio = 0.3 * np.sin(2 * np.pi * 440 * t)
342
+ audio = audio.astype(np.float32)
343
 
344
+ return (sample_rate, audio), "Critical error occurred. Using emergency fallback."
345
 
346
  # Create the Gradio interface
347
+ device_info = "GPU" if DEVICE == "cuda" else "CPU"
348
  with gr.Blocks(title="Stable Audio Open", theme=gr.themes.Soft()) as interface:
349
+ gr.Markdown(f"""
350
  # 🎵 Stable Audio Open
351
  Generate high-quality audio from text prompts using Stable Audio technology.
352
+
353
+ **Device:** {device_info} | **Model:** {MODEL_ID}
354
  """)
355
 
356
  with gr.Row():
 
375
  value=None,
376
  precision=0,
377
  minimum=0,
378
+ maximum=999999
379
  )
380
 
381
  generate_btn = gr.Button("🎵 Generate Audio", variant="primary")
 
384
  audio_output = gr.Audio(label="Generated Audio")
385
  status_output = gr.Textbox(label="Status", interactive=False)
386
 
387
+ # Connect the generate button to the function
388
  generate_btn.click(
389
  fn=generate_audio,
390
  inputs=[prompt_input, duration_input, seed_input],
391
+ outputs=[audio_output, status_output],
392
+ show_progress=True
 
 
 
 
 
 
 
393
  )
394
 
395
  # Add some example prompts
396
+ examples = gr.Examples(
397
  examples=[
398
  ["A calming ocean wave sound with seagulls", 15, 42],
399
  ["Upbeat electronic dance music", 20, 123],
400
  ["Classical violin concerto", 25, 999],
401
  ["Rain falling on a tin roof", 10, 777]
402
  ],
403
+ inputs=[prompt_input, duration_input, seed_input],
404
+ outputs=[audio_output, status_output],
405
+ fn=generate_audio,
406
+ cache_examples=False
407
  )
408
 
409
  return interface
410
 
411
  # Launch the interface
412
  if __name__ == "__main__":
413
+ print(f"Starting Stable Audio Open application...")
414
+ print(f"PyTorch version: {torch.__version__}")
415
+ print(f"CUDA available: {torch.cuda.is_available()}")
416
+ if torch.cuda.is_available():
417
+ print(f"CUDA device: {torch.cuda.get_device_name(0)}")
418
+
419
  interface = create_audio_generation_interface()
420
+ interface.launch(server_name="0.0.0.0", server_port=7860)
requirements.txt CHANGED
@@ -1,2 +1,7 @@
1
  numpy>=1.21.0
 
 
 
 
 
2
  scipy>=1.7.0
 
1
  numpy>=1.21.0
2
+ gradio>=4.0.0
3
+ torch>=2.0.0
4
+ diffusers>=0.25.0
5
+ transformers>=4.35.0
6
+ accelerate>=0.25.0
7
  scipy>=1.7.0