Prompt_Edit_Demo

Sleeping

App Files Files Community

Prompt_Edit_Demo / USAGE_GUIDE.md

owaski

add usage guide

ac1d867 about 1 month ago

preview code

raw

history blame contribute delete

3.72 kB

	# Offline Audio Processing System - Usage Guide

	## Quick Start

	### 1. Launch the Application
	```bash
	python app.py
	```

	The Gradio interface will open in your browser.

	## How to Use

	### Step 1: Prepare Audio
	You have two options:
	- Upload: Click "Upload or Record Audio File" and select a pre-recorded audio file (WAV, MP3, etc.)
	- Record: Click the microphone icon to record audio directly

	### Step 2: Configure LLM Prompt
	Edit the "LLM prompt" textbox to customize the system behavior:
	- Default: General purpose conversation assistant
	- Translation: "You are a translator. Translate user text into English."
	- Summarization: "You are summarizer. Summarize user's utterance."
	- Custom: Write your own prompt for specific use cases

	### Step 3: Select Models (Optional)
	Choose the models for each component:
	- ASR (Automatic Speech Recognition): Transcribes your audio
	- Default: `pyf98/owsm_ctc_v3.1_1B`
	- LLM (Language Model): Generates the response
	- Default: `meta-llama/Llama-3.2-1B-Instruct`
	- TTS (Text-to-Speech): Creates audio output
	- Default: `espnet/kan-bayashi_ljspeech_vits`

	### Step 4: Process
	Click the "Process Audio" button

	### Step 5: View Results
	The system will display:
	1. ASR Transcription: What was transcribed from your audio
	2. LLM Response: The generated text response
	3. TTS Output: Audio playback of the response (auto-plays)

	## Example Use Cases

	### 1. Voice Translation
	```
	Audio: "Bonjour, comment allez-vous?"
	LLM Prompt: "You are a translator. Translate user text into English."
	Output: "Hello, how are you?"
	```

	### 2. Voice Summarization
	```
	Audio: "Today I went to the store and bought apples, oranges, bananas, and some milk. Then I went to the park..."
	LLM Prompt: "You are summarizer. Summarize user's utterance."
	Output: "User went shopping and to the park."
	```

	### 3. Voice Assistant
	```
	Audio: "What's the weather like today?"
	LLM Prompt: "You are a helpful and friendly AI assistant..."
	Output: "I don't have access to real-time weather data..."
	```

	## Technical Details

	### Audio Processing Pipeline
	```
	Audio File → ASR → Transcription → LLM → Response → TTS → Audio Output
	```

	### Supported Audio Formats
	- WAV
	- MP3
	- FLAC
	- OGG
	- Any format supported by Gradio's Audio component

	### Processing Time
	- Depends on audio length and selected models
	- Typically 2-10 seconds for 5-second audio clips
	- GPU acceleration enabled via `@spaces.GPU` decorator

	## Troubleshooting

	### "Please upload an audio file" message
	- Ensure you've either uploaded or recorded audio before clicking "Process Audio"

	### No audio output
	- Check that TTS model loaded correctly
	- Check browser audio settings

	### Long processing time
	- Longer audio files take more time to process
	- First run may be slower due to model loading

	### Model loading errors
	- Check `HF_TOKEN` environment variable for Hugging Face authentication
	- Verify internet connection for model downloads

	## Differences from Streaming Mode

	\| Feature \| Streaming Mode (Old) \| Offline Mode (New) \|
	\|---------\|---------------------\|-------------------\|
	\| Input \| Real-time microphone \| Recorded files \|
	\| Processing \| Chunk-by-chunk \| Complete file \|
	\| Time Limit \| 5 minutes \| None \|
	\| Use Case \| Live conversation \| Batch processing \|
	\| Complexity \| High (state management) \| Low (single pass) \|

	## Tips for Best Results

	1. Audio Quality: Use clear audio with minimal background noise
	2. Prompt Engineering: Craft specific prompts for better LLM responses
	3. Model Selection: Experiment with different models for quality vs. speed tradeoffs
	4. Audio Length: Start with shorter clips (5-15 seconds) for faster results