--- title: GenAI 2 Demo emoji: 🚀 colorFrom: red colorTo: red sdk: docker app_port: 8501 tags: - streamlit pinned: false short_description: Streamlit template space --- # Voice to Image Agent A Streamlit-based application that transforms voice messages into AI-generated images through a multi-step pipeline combining speech-to-text, natural language processing, and image generation. ## Overview This application takes a voice recording and converts it into a visual representation by: 1. **Transcribing** audio to text using OpenAI Whisper 2. **Enhancing** the transcript into a detailed image prompt using GPT-4 3. **Generating** an image from the prompt using DALL-E The entire pipeline is transparent, showing intermediate results at each step. ## Features - 🎙️ **Audio Upload**: Support for multiple formats (WAV, MP3, M4A, OGG, WebM) - 📝 **Speech-to-Text**: Automatic transcription using Whisper - 🤖 **Prompt Enhancement**: LLM-powered conversion of speech to detailed image descriptions - 🎨 **Image Generation**: High-quality image synthesis using DALL-E - 🔍 **Full Transparency**: View transcripts, prompts, and metadata for each step - ⚙️ **Configurable**: Adjust models and parameters via the UI - 📊 **Detailed Logging**: Console logs for monitoring and debugging ## Prerequisites - **Python**: 3.12 or higher - **OpenAI API Key**: [Get one here](https://platform.openai.com/api-keys) - **FFmpeg** (optional): Required for certain audio formats (M4A, MP3) - **Organization Verification** (for `gpt-image-1` model): Your OpenAI organization must be verified to use the `gpt-image-1` image generation model. See [Organization Verification](#organization-verification) section below. ## Installation ### 1. Clone the Repository ```bash git clone cd GenerativeAI2 ``` ### 2. Create Virtual Environment ```bash python3.12 -m venv .venv source .venv/bin/activate # On macOS/Linux # .venv\Scripts\activate # On Windows ``` ### 3. Install Dependencies ```bash pip install -r requirements.txt ``` ### 4. Configure API Key **Option A: Environment Variable** ```bash export OPENAI_API_KEY="your-api-key-here" # macOS/Linux ``` **Option B: .env File (Recommended)** Create a `.env` file in the project root: ```env OPENAI_API_KEY=your-api-key-here ``` ## Usage ### Start the Application ```bash streamlit run app.py ``` The app will open in your browser at `http://localhost:8501` ![Main Interface](screenshots/UI_1.png) ### Workflow 1. **Upload Audio**: Click "Browse files" and select your voice recording 2. **Configure Models** (Optional): Adjust settings in the left sidebar - Transcription model (default: `whisper-1`) - LLM model (default: `gpt-5-nano`) - Image model (default: `gpt-image-1`) - Image size (512x512, 768x768, or 1024x1024) ![Settings Configuration](screenshots/UI_2.png) 3. **Run Pipeline**: Click "Run Voice → Image Pipeline" 4. **View Results**: See the transcript, enhanced prompt, and generated image ![Results Display](screenshots/UI_3.png) ![Detailed Results](screenshots/UI_4.png) ### Example Voice Prompts - "Create a futuristic city at sunset with flying cars and neon lights" - "Show me a peaceful forest with a waterfall and wildlife" - "Generate a portrait of a robot reading a book in a library" ## Project Structure ``` GenerativeAI2/ ├── app.py # Streamlit UI and orchestration ├── llm_pipeline.py # Core pipeline functions ├── requirements.txt # Python dependencies ├── README.md # Documentation ├── screenshots/ # UI screenshots │ ├── UI_1.png # Main interface │ ├── UI_2.png # Settings configuration │ ├── UI_3.png # Results display │ └── UI_4.png # Detailed results └── .env # Environment variables (not in git) ``` ## Configuration ### Model Selection Configure models in the sidebar or modify defaults in the code: ```python # Default models in app.py { "transcription_model": "whisper-1", "llm_model": "gpt-5-nano", "image_model": "gpt-image-1" } ``` ### Image Sizes Available sizes: - `512x512` - Fast, lower cost - `768x768` - Balanced - `1024x1024` - High quality (default) ### Logging Logs are printed to the console where you run `streamlit run app.py`. Adjust log level: ```python # In app.py logging.basicConfig(level=logging.INFO) # Change to DEBUG for verbose output ``` ### Organization Verification **Important**: The `gpt-image-1` image generation model requires your OpenAI organization to be verified before use. This is a one-time setup process. #### Verification Steps 1. Go to [OpenAI Organization Settings](https://platform.openai.com/settings/organization/general) 2. Click on **"Verify Organization"** 3. Complete the verification process (may require providing business/organization details) 4. **Wait for access propagation**: After verification, it can take **up to 15 minutes** for access to the `gpt-image-1` model to become available #### What to Expect - If you attempt to use `gpt-image-1` before verification, you'll receive a `PermissionDeniedError` with error code 403 - The error message will indicate that organization verification is required - After verification, wait 15 minutes before retrying image generation - If you need immediate access, consider using alternative image models like `dall-e-3` (if available) #### Alternative Models If you need to use image generation immediately without waiting for verification, you can: - Switch to `dall-e-3` model in the sidebar settings (if available for your account) - Or wait for the verification to complete and access to propagate ## API Costs Approximate costs per generation (as of 2024): - **Whisper**: ~$0.006 per minute of audio - **GPT-4o-mini**: ~$0.15 per 1M input tokens - **DALL-E 3**: - 1024x1024: $0.040 per image - 1024x1792 or 1792x1024: $0.080 per image ## Architecture ### Core Components **Frontend (`app.py`)** - Streamlit UI for user interaction - Session state management - Error handling and display **Backend (`llm_pipeline.py`)** - `transcribe_audio()`: Whisper API integration - `build_image_prompt()`: LLM prompt enhancement - `generate_image()`: DALL-E image generation ### Pipeline Flow ``` Audio Upload → Transcription → Prompt Enhancement → Image Generation → Display ↓ ↓ ↓ ↓ ↓ User File Whisper API GPT API DALL-E API Results UI ``` ## Troubleshooting ### Common Issues **"OPENAI_API_KEY environment variable is not set"** - Ensure your `.env` file exists and contains the API key - Or export the environment variable in your terminal **"PermissionDeniedError: Your organization must be verified to use the model `gpt-image-1`"** - This error occurs when trying to use `gpt-image-1` without organization verification - **Solution**: 1. Visit [OpenAI Organization Settings](https://platform.openai.com/settings/organization/general) 2. Click "Verify Organization" and complete the verification process 3. **Wait up to 15 minutes** for access to propagate after verification 4. Retry the image generation after the waiting period - **Alternative**: Switch to `dall-e-3` model in the sidebar if you need immediate access (if available for your account) **Audio format not supported** - Install FFmpeg: `brew install ffmpeg` (macOS) or `apt-get install ffmpeg` (Linux) **Rate limit errors** - Wait a few moments and try again - Check your OpenAI account usage limits **Module not found errors** - Ensure virtual environment is activated - Reinstall dependencies: `pip install -r requirements.txt` ## Development ### Running Tests ```bash # Install test dependencies pip install pytest pytest-mock # Run tests pytest ``` ### Code Style This project follows PEP 8 guidelines. Format code with: ```bash pip install black black app.py llm_pipeline.py ``` ## Limitations - Maximum audio file size: 25MB - Very long or noisy audio may produce less accurate transcripts - Generated images depend on prompt quality and model capabilities - No offline mode - requires internet connection for all operations ## Security Notes - Never commit your `.env` file or API keys to version control - Add `.env` to `.gitignore` - Rotate API keys if accidentally exposed - Monitor API usage to prevent unexpected costs ## Acknowledgments - [OpenAI](https://openai.com/) for Whisper, GPT, and DALL-E APIs - [Streamlit](https://streamlit.io/) for the web framework - Community contributors and testers