generativeai2 / README.md
Nazim Tairov
update readme
5b064a4
metadata
title: GenAI 2 Demo
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: Streamlit template space

Voice to Image Agent

A Streamlit-based application that transforms voice messages into AI-generated images through a multi-step pipeline combining speech-to-text, natural language processing, and image generation.

Overview

This application takes a voice recording and converts it into a visual representation by:

  1. Transcribing audio to text using OpenAI Whisper
  2. Enhancing the transcript into a detailed image prompt using GPT-4
  3. Generating an image from the prompt using DALL-E

The entire pipeline is transparent, showing intermediate results at each step.

Features

  • πŸŽ™οΈ Audio Upload: Support for multiple formats (WAV, MP3, M4A, OGG, WebM)
  • πŸ“ Speech-to-Text: Automatic transcription using Whisper
  • πŸ€– Prompt Enhancement: LLM-powered conversion of speech to detailed image descriptions
  • 🎨 Image Generation: High-quality image synthesis using DALL-E
  • πŸ” Full Transparency: View transcripts, prompts, and metadata for each step
  • βš™οΈ Configurable: Adjust models and parameters via the UI
  • πŸ“Š Detailed Logging: Console logs for monitoring and debugging

Prerequisites

  • Python: 3.12 or higher
  • OpenAI API Key: Get one here
  • FFmpeg (optional): Required for certain audio formats (M4A, MP3)
  • Organization Verification (for gpt-image-1 model): Your OpenAI organization must be verified to use the gpt-image-1 image generation model. See Organization Verification section below.

Installation

1. Clone the Repository

git clone <your-repo-url>
cd GenerativeAI2

2. Create Virtual Environment

python3.12 -m venv .venv
source .venv/bin/activate  # On macOS/Linux
# .venv\Scripts\activate   # On Windows

3. Install Dependencies

pip install -r requirements.txt

4. Configure API Key

Option A: Environment Variable

export OPENAI_API_KEY="your-api-key-here"  # macOS/Linux

Option B: .env File (Recommended)

Create a .env file in the project root:

OPENAI_API_KEY=your-api-key-here

Usage

Start the Application

streamlit run app.py

The app will open in your browser at http://localhost:8501

Main Interface

Workflow

  1. Upload Audio: Click "Browse files" and select your voice recording
  2. Configure Models (Optional): Adjust settings in the left sidebar
    • Transcription model (default: whisper-1)
    • LLM model (default: gpt-5-nano)
    • Image model (default: gpt-image-1)
    • Image size (512x512, 768x768, or 1024x1024)

Settings Configuration

  1. Run Pipeline: Click "Run Voice β†’ Image Pipeline"
  2. View Results: See the transcript, enhanced prompt, and generated image

Results Display Detailed Results

Example Voice Prompts

  • "Create a futuristic city at sunset with flying cars and neon lights"
  • "Show me a peaceful forest with a waterfall and wildlife"
  • "Generate a portrait of a robot reading a book in a library"

Project Structure

GenerativeAI2/
β”œβ”€β”€ app.py                  # Streamlit UI and orchestration
β”œβ”€β”€ llm_pipeline.py         # Core pipeline functions
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ README.md              # Documentation
β”œβ”€β”€ screenshots/            # UI screenshots
β”‚   β”œβ”€β”€ UI_1.png           # Main interface
β”‚   β”œβ”€β”€ UI_2.png           # Settings configuration
β”‚   β”œβ”€β”€ UI_3.png           # Results display
β”‚   └── UI_4.png           # Detailed results
└── .env                   # Environment variables (not in git)

Configuration

Model Selection

Configure models in the sidebar or modify defaults in the code:

# Default models in app.py
{
    "transcription_model": "whisper-1",
    "llm_model": "gpt-5-nano",
    "image_model": "gpt-image-1"
}

Image Sizes

Available sizes:

  • 512x512 - Fast, lower cost
  • 768x768 - Balanced
  • 1024x1024 - High quality (default)

Logging

Logs are printed to the console where you run streamlit run app.py. Adjust log level:

# In app.py
logging.basicConfig(level=logging.INFO)  # Change to DEBUG for verbose output

Organization Verification

Important: The gpt-image-1 image generation model requires your OpenAI organization to be verified before use. This is a one-time setup process.

Verification Steps

  1. Go to OpenAI Organization Settings
  2. Click on "Verify Organization"
  3. Complete the verification process (may require providing business/organization details)
  4. Wait for access propagation: After verification, it can take up to 15 minutes for access to the gpt-image-1 model to become available

What to Expect

  • If you attempt to use gpt-image-1 before verification, you'll receive a PermissionDeniedError with error code 403
  • The error message will indicate that organization verification is required
  • After verification, wait 15 minutes before retrying image generation
  • If you need immediate access, consider using alternative image models like dall-e-3 (if available)

Alternative Models

If you need to use image generation immediately without waiting for verification, you can:

  • Switch to dall-e-3 model in the sidebar settings (if available for your account)
  • Or wait for the verification to complete and access to propagate

API Costs

Approximate costs per generation (as of 2024):

  • Whisper: ~$0.006 per minute of audio
  • GPT-4o-mini: ~$0.15 per 1M input tokens
  • DALL-E 3:
    • 1024x1024: $0.040 per image
    • 1024x1792 or 1792x1024: $0.080 per image

Architecture

Core Components

Frontend (app.py)

  • Streamlit UI for user interaction
  • Session state management
  • Error handling and display

Backend (llm_pipeline.py)

  • transcribe_audio(): Whisper API integration
  • build_image_prompt(): LLM prompt enhancement
  • generate_image(): DALL-E image generation

Pipeline Flow

Audio Upload β†’ Transcription β†’ Prompt Enhancement β†’ Image Generation β†’ Display
     ↓              ↓                  ↓                    ↓              ↓
  User File    Whisper API         GPT API            DALL-E API      Results UI

Troubleshooting

Common Issues

"OPENAI_API_KEY environment variable is not set"

  • Ensure your .env file exists and contains the API key
  • Or export the environment variable in your terminal

"PermissionDeniedError: Your organization must be verified to use the model gpt-image-1"

  • This error occurs when trying to use gpt-image-1 without organization verification
  • Solution:
    1. Visit OpenAI Organization Settings
    2. Click "Verify Organization" and complete the verification process
    3. Wait up to 15 minutes for access to propagate after verification
    4. Retry the image generation after the waiting period
  • Alternative: Switch to dall-e-3 model in the sidebar if you need immediate access (if available for your account)

Audio format not supported

  • Install FFmpeg: brew install ffmpeg (macOS) or apt-get install ffmpeg (Linux)

Rate limit errors

  • Wait a few moments and try again
  • Check your OpenAI account usage limits

Module not found errors

  • Ensure virtual environment is activated
  • Reinstall dependencies: pip install -r requirements.txt

Development

Running Tests

# Install test dependencies
pip install pytest pytest-mock

# Run tests
pytest

Code Style

This project follows PEP 8 guidelines. Format code with:

pip install black
black app.py llm_pipeline.py

Limitations

  • Maximum audio file size: 25MB
  • Very long or noisy audio may produce less accurate transcripts
  • Generated images depend on prompt quality and model capabilities
  • No offline mode - requires internet connection for all operations

Security Notes

  • Never commit your .env file or API keys to version control
  • Add .env to .gitignore
  • Rotate API keys if accidentally exposed
  • Monitor API usage to prevent unexpected costs

Acknowledgments

  • OpenAI for Whisper, GPT, and DALL-E APIs
  • Streamlit for the web framework
  • Community contributors and testers