Prompt_Edit_Demo / USAGE_GUIDE.md
owaski's picture
add usage guide
ac1d867
# Offline Audio Processing System - Usage Guide
## Quick Start
### 1. Launch the Application
```bash
python app.py
```
The Gradio interface will open in your browser.
## How to Use
### Step 1: Prepare Audio
You have two options:
- **Upload**: Click "Upload or Record Audio File" and select a pre-recorded audio file (WAV, MP3, etc.)
- **Record**: Click the microphone icon to record audio directly
### Step 2: Configure LLM Prompt
Edit the "LLM prompt" textbox to customize the system behavior:
- **Default**: General purpose conversation assistant
- **Translation**: "You are a translator. Translate user text into English."
- **Summarization**: "You are summarizer. Summarize user's utterance."
- **Custom**: Write your own prompt for specific use cases
### Step 3: Select Models (Optional)
Choose the models for each component:
- **ASR (Automatic Speech Recognition)**: Transcribes your audio
- Default: `pyf98/owsm_ctc_v3.1_1B`
- **LLM (Language Model)**: Generates the response
- Default: `meta-llama/Llama-3.2-1B-Instruct`
- **TTS (Text-to-Speech)**: Creates audio output
- Default: `espnet/kan-bayashi_ljspeech_vits`
### Step 4: Process
Click the **"Process Audio"** button
### Step 5: View Results
The system will display:
1. **ASR Transcription**: What was transcribed from your audio
2. **LLM Response**: The generated text response
3. **TTS Output**: Audio playback of the response (auto-plays)
## Example Use Cases
### 1. Voice Translation
```
Audio: "Bonjour, comment allez-vous?"
LLM Prompt: "You are a translator. Translate user text into English."
Output: "Hello, how are you?"
```
### 2. Voice Summarization
```
Audio: "Today I went to the store and bought apples, oranges, bananas, and some milk. Then I went to the park..."
LLM Prompt: "You are summarizer. Summarize user's utterance."
Output: "User went shopping and to the park."
```
### 3. Voice Assistant
```
Audio: "What's the weather like today?"
LLM Prompt: "You are a helpful and friendly AI assistant..."
Output: "I don't have access to real-time weather data..."
```
## Technical Details
### Audio Processing Pipeline
```
Audio File β†’ ASR β†’ Transcription β†’ LLM β†’ Response β†’ TTS β†’ Audio Output
```
### Supported Audio Formats
- WAV
- MP3
- FLAC
- OGG
- Any format supported by Gradio's Audio component
### Processing Time
- Depends on audio length and selected models
- Typically 2-10 seconds for 5-second audio clips
- GPU acceleration enabled via `@spaces.GPU` decorator
## Troubleshooting
### "Please upload an audio file" message
- Ensure you've either uploaded or recorded audio before clicking "Process Audio"
### No audio output
- Check that TTS model loaded correctly
- Check browser audio settings
### Long processing time
- Longer audio files take more time to process
- First run may be slower due to model loading
### Model loading errors
- Check `HF_TOKEN` environment variable for Hugging Face authentication
- Verify internet connection for model downloads
## Differences from Streaming Mode
| Feature | Streaming Mode (Old) | Offline Mode (New) |
|---------|---------------------|-------------------|
| Input | Real-time microphone | Recorded files |
| Processing | Chunk-by-chunk | Complete file |
| Time Limit | 5 minutes | None |
| Use Case | Live conversation | Batch processing |
| Complexity | High (state management) | Low (single pass) |
## Tips for Best Results
1. **Audio Quality**: Use clear audio with minimal background noise
2. **Prompt Engineering**: Craft specific prompts for better LLM responses
3. **Model Selection**: Experiment with different models for quality vs. speed tradeoffs
4. **Audio Length**: Start with shorter clips (5-15 seconds) for faster results