Prompt_Edit_Demo / USAGE_GUIDE.md
owaski's picture
add usage guide
ac1d867

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

Offline Audio Processing System - Usage Guide

Quick Start

1. Launch the Application

python app.py

The Gradio interface will open in your browser.

How to Use

Step 1: Prepare Audio

You have two options:

  • Upload: Click "Upload or Record Audio File" and select a pre-recorded audio file (WAV, MP3, etc.)
  • Record: Click the microphone icon to record audio directly

Step 2: Configure LLM Prompt

Edit the "LLM prompt" textbox to customize the system behavior:

  • Default: General purpose conversation assistant
  • Translation: "You are a translator. Translate user text into English."
  • Summarization: "You are summarizer. Summarize user's utterance."
  • Custom: Write your own prompt for specific use cases

Step 3: Select Models (Optional)

Choose the models for each component:

  • ASR (Automatic Speech Recognition): Transcribes your audio
    • Default: pyf98/owsm_ctc_v3.1_1B
  • LLM (Language Model): Generates the response
    • Default: meta-llama/Llama-3.2-1B-Instruct
  • TTS (Text-to-Speech): Creates audio output
    • Default: espnet/kan-bayashi_ljspeech_vits

Step 4: Process

Click the "Process Audio" button

Step 5: View Results

The system will display:

  1. ASR Transcription: What was transcribed from your audio
  2. LLM Response: The generated text response
  3. TTS Output: Audio playback of the response (auto-plays)

Example Use Cases

1. Voice Translation

Audio: "Bonjour, comment allez-vous?"
LLM Prompt: "You are a translator. Translate user text into English."
Output: "Hello, how are you?"

2. Voice Summarization

Audio: "Today I went to the store and bought apples, oranges, bananas, and some milk. Then I went to the park..."
LLM Prompt: "You are summarizer. Summarize user's utterance."
Output: "User went shopping and to the park."

3. Voice Assistant

Audio: "What's the weather like today?"
LLM Prompt: "You are a helpful and friendly AI assistant..."
Output: "I don't have access to real-time weather data..."

Technical Details

Audio Processing Pipeline

Audio File → ASR → Transcription → LLM → Response → TTS → Audio Output

Supported Audio Formats

  • WAV
  • MP3
  • FLAC
  • OGG
  • Any format supported by Gradio's Audio component

Processing Time

  • Depends on audio length and selected models
  • Typically 2-10 seconds for 5-second audio clips
  • GPU acceleration enabled via @spaces.GPU decorator

Troubleshooting

"Please upload an audio file" message

  • Ensure you've either uploaded or recorded audio before clicking "Process Audio"

No audio output

  • Check that TTS model loaded correctly
  • Check browser audio settings

Long processing time

  • Longer audio files take more time to process
  • First run may be slower due to model loading

Model loading errors

  • Check HF_TOKEN environment variable for Hugging Face authentication
  • Verify internet connection for model downloads

Differences from Streaming Mode

Feature Streaming Mode (Old) Offline Mode (New)
Input Real-time microphone Recorded files
Processing Chunk-by-chunk Complete file
Time Limit 5 minutes None
Use Case Live conversation Batch processing
Complexity High (state management) Low (single pass)

Tips for Best Results

  1. Audio Quality: Use clear audio with minimal background noise
  2. Prompt Engineering: Craft specific prompts for better LLM responses
  3. Model Selection: Experiment with different models for quality vs. speed tradeoffs
  4. Audio Length: Start with shorter clips (5-15 seconds) for faster results