# Offline Audio Processing System - Usage Guide ## Quick Start ### 1. Launch the Application ```bash python app.py ``` The Gradio interface will open in your browser. ## How to Use ### Step 1: Prepare Audio You have two options: - **Upload**: Click "Upload or Record Audio File" and select a pre-recorded audio file (WAV, MP3, etc.) - **Record**: Click the microphone icon to record audio directly ### Step 2: Configure LLM Prompt Edit the "LLM prompt" textbox to customize the system behavior: - **Default**: General purpose conversation assistant - **Translation**: "You are a translator. Translate user text into English." - **Summarization**: "You are summarizer. Summarize user's utterance." - **Custom**: Write your own prompt for specific use cases ### Step 3: Select Models (Optional) Choose the models for each component: - **ASR (Automatic Speech Recognition)**: Transcribes your audio - Default: `pyf98/owsm_ctc_v3.1_1B` - **LLM (Language Model)**: Generates the response - Default: `meta-llama/Llama-3.2-1B-Instruct` - **TTS (Text-to-Speech)**: Creates audio output - Default: `espnet/kan-bayashi_ljspeech_vits` ### Step 4: Process Click the **"Process Audio"** button ### Step 5: View Results The system will display: 1. **ASR Transcription**: What was transcribed from your audio 2. **LLM Response**: The generated text response 3. **TTS Output**: Audio playback of the response (auto-plays) ## Example Use Cases ### 1. Voice Translation ``` Audio: "Bonjour, comment allez-vous?" LLM Prompt: "You are a translator. Translate user text into English." Output: "Hello, how are you?" ``` ### 2. Voice Summarization ``` Audio: "Today I went to the store and bought apples, oranges, bananas, and some milk. Then I went to the park..." LLM Prompt: "You are summarizer. Summarize user's utterance." Output: "User went shopping and to the park." ``` ### 3. Voice Assistant ``` Audio: "What's the weather like today?" LLM Prompt: "You are a helpful and friendly AI assistant..." Output: "I don't have access to real-time weather data..." ``` ## Technical Details ### Audio Processing Pipeline ``` Audio File → ASR → Transcription → LLM → Response → TTS → Audio Output ``` ### Supported Audio Formats - WAV - MP3 - FLAC - OGG - Any format supported by Gradio's Audio component ### Processing Time - Depends on audio length and selected models - Typically 2-10 seconds for 5-second audio clips - GPU acceleration enabled via `@spaces.GPU` decorator ## Troubleshooting ### "Please upload an audio file" message - Ensure you've either uploaded or recorded audio before clicking "Process Audio" ### No audio output - Check that TTS model loaded correctly - Check browser audio settings ### Long processing time - Longer audio files take more time to process - First run may be slower due to model loading ### Model loading errors - Check `HF_TOKEN` environment variable for Hugging Face authentication - Verify internet connection for model downloads ## Differences from Streaming Mode | Feature | Streaming Mode (Old) | Offline Mode (New) | |---------|---------------------|-------------------| | Input | Real-time microphone | Recorded files | | Processing | Chunk-by-chunk | Complete file | | Time Limit | 5 minutes | None | | Use Case | Live conversation | Batch processing | | Complexity | High (state management) | Low (single pass) | ## Tips for Best Results 1. **Audio Quality**: Use clear audio with minimal background noise 2. **Prompt Engineering**: Craft specific prompts for better LLM responses 3. **Model Selection**: Experiment with different models for quality vs. speed tradeoffs 4. **Audio Length**: Start with shorter clips (5-15 seconds) for faster results