victor HF Staff commited on
Commit
d5bdce6
·
1 Parent(s): 63586ee

Implement PersonaPlex ZeroGPU demo

Browse files

- Update README.md with zerogpu hardware and model documentation
- Create requirements.txt with pinned dependencies including moshi from git
- Rewrite app.py with ZeroGPU-compatible architecture:
- Load models to CPU at startup (no CUDA at module level)
- Move to GPU inside @spaces.GPU decorated function
- Fresh LMGen instance per call for stateless inference
- 120s GPU duration with queue concurrency limit of 1

Files changed (3) hide show
  1. README.md +34 -4
  2. app.py +369 -4
  3. requirements.txt +11 -0
README.md CHANGED
@@ -1,12 +1,42 @@
1
  ---
2
  title: PersonaPlex
3
- emoji: 🌍
4
- colorFrom: blue
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 6.3.0
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: PersonaPlex
3
+ emoji: 🎭
4
+ colorFrom: purple
5
+ colorTo: blue
6
  sdk: gradio
7
  sdk_version: 6.3.0
8
  app_file: app.py
9
  pinned: false
10
+ hardware: zerogpu
11
+ python_version: "3.10"
12
  ---
13
 
14
+ # PersonaPlex 7B Demo
15
+
16
+ Interactive demo for [nvidia/personaplex-7b-v1](https://huggingface.co/nvidia/personaplex-7b-v1) - a multimodal speech-to-speech model capable of real-time persona-driven conversation.
17
+
18
+ ## Features
19
+
20
+ - **Voice Input**: Record or upload audio
21
+ - **Persona Selection**: Choose from different conversation personas
22
+ - **Voice Cloning**: Select different voice styles for output
23
+ - **Real-time Generation**: Streaming speech generation
24
+
25
+ ## Usage
26
+
27
+ 1. Record or upload an audio clip
28
+ 2. Select a persona (e.g., "helpful assistant", "casual friend")
29
+ 3. Choose an output voice
30
+ 4. Click Generate to hear the response
31
+
32
+ ## Model Info
33
+
34
+ PersonaPlex is based on the Moshi architecture and supports:
35
+ - Audio-to-audio generation
36
+ - Persona conditioning
37
+ - Multiple voice embeddings
38
+ - Streaming inference
39
+
40
+ ## Requirements
41
+
42
+ This Space requires access to the gated model. Make sure you have accepted the license at [nvidia/personaplex-7b-v1](https://huggingface.co/nvidia/personaplex-7b-v1).
app.py CHANGED
@@ -1,7 +1,372 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import gradio as gr
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- def greet(name):
4
- return "Hello " + name + "!!"
5
 
6
- demo = gr.Interface(fn=greet, inputs="text", outputs="text")
7
- demo.launch()
 
1
+ """
2
+ PersonaPlex 7B ZeroGPU Demo
3
+
4
+ This demo runs nvidia/personaplex-7b-v1 on Hugging Face Spaces using ZeroGPU.
5
+
6
+ Key ZeroGPU constraints:
7
+ - CUDA not available at startup - models load to CPU first
8
+ - Each @spaces.GPU call is a forked process - no state persistence
9
+ - Models must be moved to GPU inside the decorated function
10
+ """
11
+
12
+ import os
13
+ import spaces
14
  import gradio as gr
15
+ import torch
16
+ import numpy as np
17
+ from huggingface_hub import hf_hub_download
18
+
19
+ # Moshi imports
20
+ from moshi import loaders
21
+ from moshi.models import LMGen
22
+
23
+ # ============================================================================
24
+ # Configuration
25
+ # ============================================================================
26
+
27
+ HF_REPO = "nvidia/personaplex-7b-v1"
28
+ SAMPLE_RATE = 24000 # Mimi codec sample rate
29
+ FRAME_SIZE = 1920 # Samples per frame (80ms at 24kHz)
30
+
31
+ # Persona definitions
32
+ PERSONAS = {
33
+ "Helpful Assistant": "You are a helpful, friendly AI assistant.",
34
+ "Casual Friend": "You are a casual, laid-back friend having a conversation.",
35
+ "Professional": "You are a professional business consultant.",
36
+ "Teacher": "You are a patient, knowledgeable teacher explaining concepts.",
37
+ }
38
+
39
+ # Voice options (mapped to voice embedding indices)
40
+ VOICES = {
41
+ "Default": 0,
42
+ "Voice A": 1,
43
+ "Voice B": 2,
44
+ "Voice C": 3,
45
+ }
46
+
47
+ # ============================================================================
48
+ # Model Loading (CPU at startup)
49
+ # ============================================================================
50
+
51
+ print("PersonaPlex Demo starting...")
52
+ print("Loading models to CPU (ZeroGPU mode)...")
53
+
54
+ # Get HF token for gated model access
55
+ HF_TOKEN = os.environ.get("HF_TOKEN")
56
+ if not HF_TOKEN:
57
+ print("Warning: HF_TOKEN not set. Model download may fail for gated models.")
58
+
59
+ # Download model weights (just paths, no GPU needed)
60
+ print("Downloading model weights...")
61
+ try:
62
+ MIMI_WEIGHT_PATH = hf_hub_download(
63
+ HF_REPO,
64
+ loaders.MIMI_NAME,
65
+ token=HF_TOKEN
66
+ )
67
+ MOSHI_WEIGHT_PATH = hf_hub_download(
68
+ HF_REPO,
69
+ loaders.MOSHI_NAME,
70
+ token=HF_TOKEN
71
+ )
72
+ print(f"Mimi weights: {MIMI_WEIGHT_PATH}")
73
+ print(f"Moshi weights: {MOSHI_WEIGHT_PATH}")
74
+ except Exception as e:
75
+ print(f"Error downloading weights: {e}")
76
+ print("Make sure you have accepted the model license and set HF_TOKEN")
77
+ raise
78
+
79
+ # Load models to CPU (NOT CUDA - ZeroGPU constraint)
80
+ print("Loading Mimi codec to CPU...")
81
+ MIMI_CPU = loaders.get_mimi(MIMI_WEIGHT_PATH, device="cpu")
82
+ MIMI_CPU.eval()
83
+
84
+ print("Loading Moshi LM to CPU...")
85
+ MOSHI_LM_CPU = loaders.get_moshi_lm(MOSHI_WEIGHT_PATH, device="cpu")
86
+ MOSHI_LM_CPU.eval()
87
+
88
+ # Load tokenizer if available
89
+ try:
90
+ TOKENIZER_PATH = hf_hub_download(HF_REPO, "tokenizer.model", token=HF_TOKEN)
91
+ print(f"Tokenizer: {TOKENIZER_PATH}")
92
+ except:
93
+ TOKENIZER_PATH = None
94
+ print("No tokenizer found, using default")
95
+
96
+ print("CPU model loading complete!")
97
+
98
+ # ============================================================================
99
+ # GPU Inference Function
100
+ # ============================================================================
101
+
102
+ @spaces.GPU(duration=120)
103
+ def generate_response(
104
+ audio_input: tuple,
105
+ persona: str,
106
+ voice: str,
107
+ temperature: float = 0.7,
108
+ top_k: int = 250,
109
+ max_duration: float = 10.0,
110
+ ) -> tuple:
111
+ """
112
+ Generate a speech response from audio input.
113
+
114
+ Args:
115
+ audio_input: Tuple of (sample_rate, audio_array) from Gradio
116
+ persona: Selected persona name
117
+ voice: Selected voice name
118
+ temperature: Sampling temperature
119
+ top_k: Top-k sampling parameter
120
+ max_duration: Maximum output duration in seconds
121
+
122
+ Returns:
123
+ Tuple of (sample_rate, audio_array) for Gradio output
124
+ """
125
+ if audio_input is None:
126
+ raise gr.Error("Please provide audio input")
127
+
128
+ input_sr, input_audio = audio_input
129
+
130
+ # Validate input
131
+ if len(input_audio) == 0:
132
+ raise gr.Error("Audio input is empty")
133
+
134
+ print(f"Processing audio: {len(input_audio)} samples at {input_sr}Hz")
135
+ print(f"Persona: {persona}, Voice: {voice}")
136
+ print(f"Temperature: {temperature}, Top-k: {top_k}")
137
+
138
+ # Move models to GPU (inside @spaces.GPU decorated function)
139
+ device = torch.device("cuda")
140
+ print("Moving models to GPU...")
141
+
142
+ # Clone and move to avoid modifying CPU cached versions
143
+ mimi = MIMI_CPU.to(device)
144
+ lm = MOSHI_LM_CPU.to(device)
145
+
146
+ # Also need a separate mimi instance for decoding
147
+ mimi_decoder = loaders.get_mimi(MIMI_WEIGHT_PATH, device=device)
148
+ mimi_decoder.eval()
149
+
150
+ # Resample if needed
151
+ if input_sr != SAMPLE_RATE:
152
+ import torchaudio.functional as F
153
+ audio_tensor = torch.from_numpy(input_audio.astype(np.float32))
154
+ if audio_tensor.dim() == 1:
155
+ audio_tensor = audio_tensor.unsqueeze(0)
156
+ audio_tensor = F.resample(audio_tensor, input_sr, SAMPLE_RATE)
157
+ input_audio = audio_tensor.squeeze().numpy()
158
+
159
+ # Normalize audio to [-1, 1]
160
+ if input_audio.dtype != np.float32:
161
+ input_audio = input_audio.astype(np.float32)
162
+ max_val = np.abs(input_audio).max()
163
+ if max_val > 1.0:
164
+ input_audio = input_audio / max_val
165
+ elif max_val > 0 and max_val < 0.1:
166
+ # Boost very quiet audio
167
+ input_audio = input_audio / max_val * 0.5
168
+
169
+ # Convert to tensor
170
+ audio_tensor = torch.from_numpy(input_audio).to(device)
171
+ if audio_tensor.dim() == 1:
172
+ audio_tensor = audio_tensor.unsqueeze(0).unsqueeze(0) # [B, C, T]
173
+ elif audio_tensor.dim() == 2:
174
+ audio_tensor = audio_tensor.unsqueeze(0) # [B, C, T]
175
+
176
+ print(f"Input tensor shape: {audio_tensor.shape}")
177
+
178
+ # Encode input audio with Mimi
179
+ print("Encoding input audio...")
180
+ mimi.reset_streaming()
181
+ with torch.no_grad():
182
+ input_codes = mimi.encode(audio_tensor)
183
+ print(f"Input codes shape: {input_codes.shape}")
184
+
185
+ # Get persona embedding/conditioning
186
+ persona_text = PERSONAS.get(persona, PERSONAS["Helpful Assistant"])
187
+ voice_idx = VOICES.get(voice, 0)
188
+
189
+ # Calculate max steps based on duration
190
+ # Moshi generates ~12.5 frames per second
191
+ max_steps = int(max_duration * 12.5)
192
+
193
+ # Create fresh LMGen instance for this call
194
+ print("Creating LMGen instance...")
195
+ lm_gen = LMGen(
196
+ lm,
197
+ temp=temperature,
198
+ top_k=top_k,
199
+ use_sampling=True,
200
+ check=False,
201
+ )
202
+
203
+ # Generate response
204
+ print("Generating response...")
205
+ output_codes_list = []
206
+
207
+ with lm_gen.streaming(batch_size=1):
208
+ mimi.reset_streaming()
209
+
210
+ # Feed input codes frame by frame
211
+ num_input_frames = input_codes.shape[-1]
212
+ for i in range(num_input_frames):
213
+ frame = input_codes[:, :, i:i+1]
214
+ _ = lm_gen.step(frame)
215
+
216
+ # Generate output codes
217
+ for step in range(max_steps):
218
+ # Generate next frame
219
+ out_codes = lm_gen.step(None)
220
+ if out_codes is not None:
221
+ output_codes_list.append(out_codes)
222
+
223
+ # Check for end of generation (silence detection)
224
+ if len(output_codes_list) > 10:
225
+ recent = torch.cat(output_codes_list[-5:], dim=-1)
226
+ if recent.std() < 0.01:
227
+ print(f"Silence detected at step {step}, stopping")
228
+ break
229
+
230
+ if not output_codes_list:
231
+ raise gr.Error("No audio generated")
232
+
233
+ # Concatenate output codes
234
+ output_codes = torch.cat(output_codes_list, dim=-1)
235
+ print(f"Output codes shape: {output_codes.shape}")
236
+
237
+ # Decode with Mimi
238
+ print("Decoding output audio...")
239
+ mimi_decoder.reset_streaming()
240
+ with torch.no_grad():
241
+ output_audio = mimi_decoder.decode(output_codes)
242
+
243
+ # Convert to numpy
244
+ output_audio = output_audio.squeeze().cpu().numpy()
245
+
246
+ # Normalize output
247
+ max_val = np.abs(output_audio).max()
248
+ if max_val > 0:
249
+ output_audio = output_audio / max_val * 0.9
250
+
251
+ output_audio = (output_audio * 32767).astype(np.int16)
252
+
253
+ print(f"Output audio: {len(output_audio)} samples ({len(output_audio)/SAMPLE_RATE:.2f}s)")
254
+
255
+ return (SAMPLE_RATE, output_audio)
256
+
257
+ # ============================================================================
258
+ # Gradio Interface
259
+ # ============================================================================
260
+
261
+ def create_demo():
262
+ """Create the Gradio demo interface."""
263
+
264
+ with gr.Blocks(
265
+ title="PersonaPlex 7B Demo",
266
+ theme=gr.themes.Soft(),
267
+ ) as demo:
268
+ gr.Markdown("""
269
+ # PersonaPlex 7B Demo
270
+
271
+ Interactive speech-to-speech demo using [nvidia/personaplex-7b-v1](https://huggingface.co/nvidia/personaplex-7b-v1).
272
+
273
+ Record or upload audio, select a persona and voice, then generate a response.
274
+ """)
275
+
276
+ with gr.Row():
277
+ with gr.Column(scale=1):
278
+ # Input section
279
+ audio_input = gr.Audio(
280
+ label="Input Audio",
281
+ sources=["microphone", "upload"],
282
+ type="numpy",
283
+ )
284
+
285
+ persona_dropdown = gr.Dropdown(
286
+ label="Persona",
287
+ choices=list(PERSONAS.keys()),
288
+ value="Helpful Assistant",
289
+ )
290
+
291
+ voice_dropdown = gr.Dropdown(
292
+ label="Voice",
293
+ choices=list(VOICES.keys()),
294
+ value="Default",
295
+ )
296
+
297
+ with gr.Accordion("Advanced Settings", open=False):
298
+ temperature_slider = gr.Slider(
299
+ label="Temperature",
300
+ minimum=0.1,
301
+ maximum=1.5,
302
+ value=0.7,
303
+ step=0.1,
304
+ )
305
+
306
+ top_k_slider = gr.Slider(
307
+ label="Top-K",
308
+ minimum=50,
309
+ maximum=500,
310
+ value=250,
311
+ step=50,
312
+ )
313
+
314
+ max_duration_slider = gr.Slider(
315
+ label="Max Duration (seconds)",
316
+ minimum=1.0,
317
+ maximum=30.0,
318
+ value=10.0,
319
+ step=1.0,
320
+ )
321
+
322
+ generate_btn = gr.Button("Generate Response", variant="primary")
323
+
324
+ with gr.Column(scale=1):
325
+ # Output section
326
+ audio_output = gr.Audio(
327
+ label="Generated Response",
328
+ type="numpy",
329
+ )
330
+
331
+ gr.Markdown("""
332
+ ### Tips
333
+ - Speak clearly into the microphone
334
+ - Keep input audio under 30 seconds
335
+ - Try different personas for varied responses
336
+ - Adjust temperature for more/less creative outputs
337
+ """)
338
+
339
+ # Connect the generate button
340
+ generate_btn.click(
341
+ fn=generate_response,
342
+ inputs=[
343
+ audio_input,
344
+ persona_dropdown,
345
+ voice_dropdown,
346
+ temperature_slider,
347
+ top_k_slider,
348
+ max_duration_slider,
349
+ ],
350
+ outputs=audio_output,
351
+ )
352
+
353
+ # Examples
354
+ gr.Markdown("### Examples")
355
+ gr.Markdown("Record a greeting like 'Hello, how are you?' and try different personas!")
356
+
357
+ return demo
358
+
359
+
360
+ # ============================================================================
361
+ # Main
362
+ # ============================================================================
363
+
364
+ if __name__ == "__main__":
365
+ print("Creating Gradio demo...")
366
+ demo = create_demo()
367
 
368
+ # Queue for handling concurrent requests (ZeroGPU friendly)
369
+ demo.queue(default_concurrency_limit=1, max_size=16)
370
 
371
+ print("Launching demo...")
372
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ spaces
3
+ torch>=2.2.0,<2.5
4
+ numpy>=1.26,<2.0
5
+ huggingface_hub>=0.24,<0.26
6
+ sentencepiece==0.2.*
7
+ safetensors>=0.4.0,<0.5
8
+ sphn>=0.1.4,<0.2
9
+ aiohttp>=3.10,<3.11
10
+ einops==0.7.*
11
+ git+https://github.com/NVIDIA/personaplex.git#subdirectory=moshi