Spaces:
Sleeping
Sleeping
| # Offline Audio Processing System - Usage Guide | |
| ## Quick Start | |
| ### 1. Launch the Application | |
| ```bash | |
| python app.py | |
| ``` | |
| The Gradio interface will open in your browser. | |
| ## How to Use | |
| ### Step 1: Prepare Audio | |
| You have two options: | |
| - **Upload**: Click "Upload or Record Audio File" and select a pre-recorded audio file (WAV, MP3, etc.) | |
| - **Record**: Click the microphone icon to record audio directly | |
| ### Step 2: Configure LLM Prompt | |
| Edit the "LLM prompt" textbox to customize the system behavior: | |
| - **Default**: General purpose conversation assistant | |
| - **Translation**: "You are a translator. Translate user text into English." | |
| - **Summarization**: "You are summarizer. Summarize user's utterance." | |
| - **Custom**: Write your own prompt for specific use cases | |
| ### Step 3: Select Models (Optional) | |
| Choose the models for each component: | |
| - **ASR (Automatic Speech Recognition)**: Transcribes your audio | |
| - Default: `pyf98/owsm_ctc_v3.1_1B` | |
| - **LLM (Language Model)**: Generates the response | |
| - Default: `meta-llama/Llama-3.2-1B-Instruct` | |
| - **TTS (Text-to-Speech)**: Creates audio output | |
| - Default: `espnet/kan-bayashi_ljspeech_vits` | |
| ### Step 4: Process | |
| Click the **"Process Audio"** button | |
| ### Step 5: View Results | |
| The system will display: | |
| 1. **ASR Transcription**: What was transcribed from your audio | |
| 2. **LLM Response**: The generated text response | |
| 3. **TTS Output**: Audio playback of the response (auto-plays) | |
| ## Example Use Cases | |
| ### 1. Voice Translation | |
| ``` | |
| Audio: "Bonjour, comment allez-vous?" | |
| LLM Prompt: "You are a translator. Translate user text into English." | |
| Output: "Hello, how are you?" | |
| ``` | |
| ### 2. Voice Summarization | |
| ``` | |
| Audio: "Today I went to the store and bought apples, oranges, bananas, and some milk. Then I went to the park..." | |
| LLM Prompt: "You are summarizer. Summarize user's utterance." | |
| Output: "User went shopping and to the park." | |
| ``` | |
| ### 3. Voice Assistant | |
| ``` | |
| Audio: "What's the weather like today?" | |
| LLM Prompt: "You are a helpful and friendly AI assistant..." | |
| Output: "I don't have access to real-time weather data..." | |
| ``` | |
| ## Technical Details | |
| ### Audio Processing Pipeline | |
| ``` | |
| Audio File β ASR β Transcription β LLM β Response β TTS β Audio Output | |
| ``` | |
| ### Supported Audio Formats | |
| - WAV | |
| - MP3 | |
| - FLAC | |
| - OGG | |
| - Any format supported by Gradio's Audio component | |
| ### Processing Time | |
| - Depends on audio length and selected models | |
| - Typically 2-10 seconds for 5-second audio clips | |
| - GPU acceleration enabled via `@spaces.GPU` decorator | |
| ## Troubleshooting | |
| ### "Please upload an audio file" message | |
| - Ensure you've either uploaded or recorded audio before clicking "Process Audio" | |
| ### No audio output | |
| - Check that TTS model loaded correctly | |
| - Check browser audio settings | |
| ### Long processing time | |
| - Longer audio files take more time to process | |
| - First run may be slower due to model loading | |
| ### Model loading errors | |
| - Check `HF_TOKEN` environment variable for Hugging Face authentication | |
| - Verify internet connection for model downloads | |
| ## Differences from Streaming Mode | |
| | Feature | Streaming Mode (Old) | Offline Mode (New) | | |
| |---------|---------------------|-------------------| | |
| | Input | Real-time microphone | Recorded files | | |
| | Processing | Chunk-by-chunk | Complete file | | |
| | Time Limit | 5 minutes | None | | |
| | Use Case | Live conversation | Batch processing | | |
| | Complexity | High (state management) | Low (single pass) | | |
| ## Tips for Best Results | |
| 1. **Audio Quality**: Use clear audio with minimal background noise | |
| 2. **Prompt Engineering**: Craft specific prompts for better LLM responses | |
| 3. **Model Selection**: Experiment with different models for quality vs. speed tradeoffs | |
| 4. **Audio Length**: Start with shorter clips (5-15 seconds) for faster results | |