Spaces:
Sleeping
title: GenAI 2 Demo
emoji: π
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
Voice to Image Agent
A Streamlit-based application that transforms voice messages into AI-generated images through a multi-step pipeline combining speech-to-text, natural language processing, and image generation.
Overview
This application takes a voice recording and converts it into a visual representation by:
- Transcribing audio to text using OpenAI Whisper
- Enhancing the transcript into a detailed image prompt using GPT-4
- Generating an image from the prompt using DALL-E
The entire pipeline is transparent, showing intermediate results at each step.
Features
- ποΈ Audio Upload: Support for multiple formats (WAV, MP3, M4A, OGG, WebM)
- π Speech-to-Text: Automatic transcription using Whisper
- π€ Prompt Enhancement: LLM-powered conversion of speech to detailed image descriptions
- π¨ Image Generation: High-quality image synthesis using DALL-E
- π Full Transparency: View transcripts, prompts, and metadata for each step
- βοΈ Configurable: Adjust models and parameters via the UI
- π Detailed Logging: Console logs for monitoring and debugging
Prerequisites
- Python: 3.12 or higher
- OpenAI API Key: Get one here
- FFmpeg (optional): Required for certain audio formats (M4A, MP3)
- Organization Verification (for
gpt-image-1model): Your OpenAI organization must be verified to use thegpt-image-1image generation model. See Organization Verification section below.
Installation
1. Clone the Repository
git clone <your-repo-url>
cd GenerativeAI2
2. Create Virtual Environment
python3.12 -m venv .venv
source .venv/bin/activate # On macOS/Linux
# .venv\Scripts\activate # On Windows
3. Install Dependencies
pip install -r requirements.txt
4. Configure API Key
Option A: Environment Variable
export OPENAI_API_KEY="your-api-key-here" # macOS/Linux
Option B: .env File (Recommended)
Create a .env file in the project root:
OPENAI_API_KEY=your-api-key-here
Usage
Start the Application
streamlit run app.py
The app will open in your browser at http://localhost:8501
Workflow
- Upload Audio: Click "Browse files" and select your voice recording
- Configure Models (Optional): Adjust settings in the left sidebar
- Transcription model (default:
whisper-1) - LLM model (default:
gpt-5-nano) - Image model (default:
gpt-image-1) - Image size (512x512, 768x768, or 1024x1024)
- Transcription model (default:
- Run Pipeline: Click "Run Voice β Image Pipeline"
- View Results: See the transcript, enhanced prompt, and generated image
Example Voice Prompts
- "Create a futuristic city at sunset with flying cars and neon lights"
- "Show me a peaceful forest with a waterfall and wildlife"
- "Generate a portrait of a robot reading a book in a library"
Project Structure
GenerativeAI2/
βββ app.py # Streamlit UI and orchestration
βββ llm_pipeline.py # Core pipeline functions
βββ requirements.txt # Python dependencies
βββ README.md # Documentation
βββ screenshots/ # UI screenshots
β βββ UI_1.png # Main interface
β βββ UI_2.png # Settings configuration
β βββ UI_3.png # Results display
β βββ UI_4.png # Detailed results
βββ .env # Environment variables (not in git)
Configuration
Model Selection
Configure models in the sidebar or modify defaults in the code:
# Default models in app.py
{
"transcription_model": "whisper-1",
"llm_model": "gpt-5-nano",
"image_model": "gpt-image-1"
}
Image Sizes
Available sizes:
512x512- Fast, lower cost768x768- Balanced1024x1024- High quality (default)
Logging
Logs are printed to the console where you run streamlit run app.py. Adjust log level:
# In app.py
logging.basicConfig(level=logging.INFO) # Change to DEBUG for verbose output
Organization Verification
Important: The gpt-image-1 image generation model requires your OpenAI organization to be verified before use. This is a one-time setup process.
Verification Steps
- Go to OpenAI Organization Settings
- Click on "Verify Organization"
- Complete the verification process (may require providing business/organization details)
- Wait for access propagation: After verification, it can take up to 15 minutes for access to the
gpt-image-1model to become available
What to Expect
- If you attempt to use
gpt-image-1before verification, you'll receive aPermissionDeniedErrorwith error code 403 - The error message will indicate that organization verification is required
- After verification, wait 15 minutes before retrying image generation
- If you need immediate access, consider using alternative image models like
dall-e-3(if available)
Alternative Models
If you need to use image generation immediately without waiting for verification, you can:
- Switch to
dall-e-3model in the sidebar settings (if available for your account) - Or wait for the verification to complete and access to propagate
API Costs
Approximate costs per generation (as of 2024):
- Whisper: ~$0.006 per minute of audio
- GPT-4o-mini: ~$0.15 per 1M input tokens
- DALL-E 3:
- 1024x1024: $0.040 per image
- 1024x1792 or 1792x1024: $0.080 per image
Architecture
Core Components
Frontend (app.py)
- Streamlit UI for user interaction
- Session state management
- Error handling and display
Backend (llm_pipeline.py)
transcribe_audio(): Whisper API integrationbuild_image_prompt(): LLM prompt enhancementgenerate_image(): DALL-E image generation
Pipeline Flow
Audio Upload β Transcription β Prompt Enhancement β Image Generation β Display
β β β β β
User File Whisper API GPT API DALL-E API Results UI
Troubleshooting
Common Issues
"OPENAI_API_KEY environment variable is not set"
- Ensure your
.envfile exists and contains the API key - Or export the environment variable in your terminal
"PermissionDeniedError: Your organization must be verified to use the model gpt-image-1"
- This error occurs when trying to use
gpt-image-1without organization verification - Solution:
- Visit OpenAI Organization Settings
- Click "Verify Organization" and complete the verification process
- Wait up to 15 minutes for access to propagate after verification
- Retry the image generation after the waiting period
- Alternative: Switch to
dall-e-3model in the sidebar if you need immediate access (if available for your account)
Audio format not supported
- Install FFmpeg:
brew install ffmpeg(macOS) orapt-get install ffmpeg(Linux)
Rate limit errors
- Wait a few moments and try again
- Check your OpenAI account usage limits
Module not found errors
- Ensure virtual environment is activated
- Reinstall dependencies:
pip install -r requirements.txt
Development
Running Tests
# Install test dependencies
pip install pytest pytest-mock
# Run tests
pytest
Code Style
This project follows PEP 8 guidelines. Format code with:
pip install black
black app.py llm_pipeline.py
Limitations
- Maximum audio file size: 25MB
- Very long or noisy audio may produce less accurate transcripts
- Generated images depend on prompt quality and model capabilities
- No offline mode - requires internet connection for all operations
Security Notes
- Never commit your
.envfile or API keys to version control - Add
.envto.gitignore - Rotate API keys if accidentally exposed
- Monitor API usage to prevent unexpected costs



