Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
Offline Audio Processing System - Usage Guide
Quick Start
1. Launch the Application
python app.py
The Gradio interface will open in your browser.
How to Use
Step 1: Prepare Audio
You have two options:
- Upload: Click "Upload or Record Audio File" and select a pre-recorded audio file (WAV, MP3, etc.)
- Record: Click the microphone icon to record audio directly
Step 2: Configure LLM Prompt
Edit the "LLM prompt" textbox to customize the system behavior:
- Default: General purpose conversation assistant
- Translation: "You are a translator. Translate user text into English."
- Summarization: "You are summarizer. Summarize user's utterance."
- Custom: Write your own prompt for specific use cases
Step 3: Select Models (Optional)
Choose the models for each component:
- ASR (Automatic Speech Recognition): Transcribes your audio
- Default:
pyf98/owsm_ctc_v3.1_1B
- Default:
- LLM (Language Model): Generates the response
- Default:
meta-llama/Llama-3.2-1B-Instruct
- Default:
- TTS (Text-to-Speech): Creates audio output
- Default:
espnet/kan-bayashi_ljspeech_vits
- Default:
Step 4: Process
Click the "Process Audio" button
Step 5: View Results
The system will display:
- ASR Transcription: What was transcribed from your audio
- LLM Response: The generated text response
- TTS Output: Audio playback of the response (auto-plays)
Example Use Cases
1. Voice Translation
Audio: "Bonjour, comment allez-vous?"
LLM Prompt: "You are a translator. Translate user text into English."
Output: "Hello, how are you?"
2. Voice Summarization
Audio: "Today I went to the store and bought apples, oranges, bananas, and some milk. Then I went to the park..."
LLM Prompt: "You are summarizer. Summarize user's utterance."
Output: "User went shopping and to the park."
3. Voice Assistant
Audio: "What's the weather like today?"
LLM Prompt: "You are a helpful and friendly AI assistant..."
Output: "I don't have access to real-time weather data..."
Technical Details
Audio Processing Pipeline
Audio File → ASR → Transcription → LLM → Response → TTS → Audio Output
Supported Audio Formats
- WAV
- MP3
- FLAC
- OGG
- Any format supported by Gradio's Audio component
Processing Time
- Depends on audio length and selected models
- Typically 2-10 seconds for 5-second audio clips
- GPU acceleration enabled via
@spaces.GPUdecorator
Troubleshooting
"Please upload an audio file" message
- Ensure you've either uploaded or recorded audio before clicking "Process Audio"
No audio output
- Check that TTS model loaded correctly
- Check browser audio settings
Long processing time
- Longer audio files take more time to process
- First run may be slower due to model loading
Model loading errors
- Check
HF_TOKENenvironment variable for Hugging Face authentication - Verify internet connection for model downloads
Differences from Streaming Mode
| Feature | Streaming Mode (Old) | Offline Mode (New) |
|---|---|---|
| Input | Real-time microphone | Recorded files |
| Processing | Chunk-by-chunk | Complete file |
| Time Limit | 5 minutes | None |
| Use Case | Live conversation | Batch processing |
| Complexity | High (state management) | Low (single pass) |
Tips for Best Results
- Audio Quality: Use clear audio with minimal background noise
- Prompt Engineering: Craft specific prompts for better LLM responses
- Model Selection: Experiment with different models for quality vs. speed tradeoffs
- Audio Length: Start with shorter clips (5-15 seconds) for faster results