File size: 3,721 Bytes
ac1d867
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# Offline Audio Processing System - Usage Guide

## Quick Start

### 1. Launch the Application
```bash
python app.py
```

The Gradio interface will open in your browser.

## How to Use

### Step 1: Prepare Audio
You have two options:
- **Upload**: Click "Upload or Record Audio File" and select a pre-recorded audio file (WAV, MP3, etc.)
- **Record**: Click the microphone icon to record audio directly

### Step 2: Configure LLM Prompt
Edit the "LLM prompt" textbox to customize the system behavior:
- **Default**: General purpose conversation assistant
- **Translation**: "You are a translator. Translate user text into English."
- **Summarization**: "You are summarizer. Summarize user's utterance."
- **Custom**: Write your own prompt for specific use cases

### Step 3: Select Models (Optional)
Choose the models for each component:
- **ASR (Automatic Speech Recognition)**: Transcribes your audio
  - Default: `pyf98/owsm_ctc_v3.1_1B`
- **LLM (Language Model)**: Generates the response
  - Default: `meta-llama/Llama-3.2-1B-Instruct`
- **TTS (Text-to-Speech)**: Creates audio output
  - Default: `espnet/kan-bayashi_ljspeech_vits`

### Step 4: Process
Click the **"Process Audio"** button

### Step 5: View Results
The system will display:
1. **ASR Transcription**: What was transcribed from your audio
2. **LLM Response**: The generated text response
3. **TTS Output**: Audio playback of the response (auto-plays)

## Example Use Cases

### 1. Voice Translation
```
Audio: "Bonjour, comment allez-vous?"
LLM Prompt: "You are a translator. Translate user text into English."
Output: "Hello, how are you?"
```

### 2. Voice Summarization
```
Audio: "Today I went to the store and bought apples, oranges, bananas, and some milk. Then I went to the park..."
LLM Prompt: "You are summarizer. Summarize user's utterance."
Output: "User went shopping and to the park."
```

### 3. Voice Assistant
```
Audio: "What's the weather like today?"
LLM Prompt: "You are a helpful and friendly AI assistant..."
Output: "I don't have access to real-time weather data..."
```

## Technical Details

### Audio Processing Pipeline
```
Audio File → ASR → Transcription → LLM → Response → TTS → Audio Output
```

### Supported Audio Formats
- WAV
- MP3
- FLAC
- OGG
- Any format supported by Gradio's Audio component

### Processing Time
- Depends on audio length and selected models
- Typically 2-10 seconds for 5-second audio clips
- GPU acceleration enabled via `@spaces.GPU` decorator

## Troubleshooting

### "Please upload an audio file" message
- Ensure you've either uploaded or recorded audio before clicking "Process Audio"

### No audio output
- Check that TTS model loaded correctly
- Check browser audio settings

### Long processing time
- Longer audio files take more time to process
- First run may be slower due to model loading

### Model loading errors
- Check `HF_TOKEN` environment variable for Hugging Face authentication
- Verify internet connection for model downloads

## Differences from Streaming Mode

| Feature | Streaming Mode (Old) | Offline Mode (New) |
|---------|---------------------|-------------------|
| Input | Real-time microphone | Recorded files |
| Processing | Chunk-by-chunk | Complete file |
| Time Limit | 5 minutes | None |
| Use Case | Live conversation | Batch processing |
| Complexity | High (state management) | Low (single pass) |

## Tips for Best Results

1. **Audio Quality**: Use clear audio with minimal background noise
2. **Prompt Engineering**: Craft specific prompts for better LLM responses
3. **Model Selection**: Experiment with different models for quality vs. speed tradeoffs
4. **Audio Length**: Start with shorter clips (5-15 seconds) for faster results