Spaces:

ntairov
/

generativeai2

Sleeping

App Files Files Community

generativeai2 / README.md

Nazim Tairov

update readme

5b064a4 4 months ago

preview code

raw

history blame contribute delete

8.71 kB

	---
	title: GenAI 2 Demo
	emoji: 🚀
	colorFrom: red
	colorTo: red
	sdk: docker
	app_port: 8501
	tags:
	- streamlit
	pinned: false
	short_description: Streamlit template space
	---

	# Voice to Image Agent

	A Streamlit-based application that transforms voice messages into AI-generated images through a multi-step pipeline combining speech-to-text, natural language processing, and image generation.

	## Overview

	This application takes a voice recording and converts it into a visual representation by:

	1. Transcribing audio to text using OpenAI Whisper
	2. Enhancing the transcript into a detailed image prompt using GPT-4
	3. Generating an image from the prompt using DALL-E

	The entire pipeline is transparent, showing intermediate results at each step.

	## Features

	- 🎙️ Audio Upload: Support for multiple formats (WAV, MP3, M4A, OGG, WebM)
	- 📝 Speech-to-Text: Automatic transcription using Whisper
	- 🤖 Prompt Enhancement: LLM-powered conversion of speech to detailed image descriptions
	- 🎨 Image Generation: High-quality image synthesis using DALL-E
	- 🔍 Full Transparency: View transcripts, prompts, and metadata for each step
	- ⚙️ Configurable: Adjust models and parameters via the UI
	- 📊 Detailed Logging: Console logs for monitoring and debugging

	## Prerequisites

	- Python: 3.12 or higher
	- OpenAI API Key: [Get one here](https://platform.openai.com/api-keys)
	- FFmpeg (optional): Required for certain audio formats (M4A, MP3)
	- Organization Verification (for `gpt-image-1` model): Your OpenAI organization must be verified to use the `gpt-image-1` image generation model. See [Organization Verification](#organization-verification) section below.

	## Installation

	### 1. Clone the Repository

	```bash
	git clone <your-repo-url>
	cd GenerativeAI2
	```

	### 2. Create Virtual Environment

	```bash
	python3.12 -m venv .venv
	source .venv/bin/activate # On macOS/Linux
	# .venv\Scripts\activate # On Windows
	```

	### 3. Install Dependencies

	```bash
	pip install -r requirements.txt
	```

	### 4. Configure API Key

	Option A: Environment Variable

	```bash
	export OPENAI_API_KEY="your-api-key-here" # macOS/Linux
	```

	Option B: .env File (Recommended)

	Create a `.env` file in the project root:

	```env
	OPENAI_API_KEY=your-api-key-here
	```

	## Usage

	### Start the Application

	```bash
	streamlit run app.py
	```

	The app will open in your browser at `http://localhost:8501`

	![Main Interface](screenshots/UI_1.png)

	### Workflow

	1. Upload Audio: Click "Browse files" and select your voice recording
	2. Configure Models (Optional): Adjust settings in the left sidebar
	- Transcription model (default: `whisper-1`)
	- LLM model (default: `gpt-5-nano`)
	- Image model (default: `gpt-image-1`)
	- Image size (512x512, 768x768, or 1024x1024)

	![Settings Configuration](screenshots/UI_2.png)

	3. Run Pipeline: Click "Run Voice → Image Pipeline"
	4. View Results: See the transcript, enhanced prompt, and generated image

	![Results Display](screenshots/UI_3.png)
	![Detailed Results](screenshots/UI_4.png)

	### Example Voice Prompts

	- "Create a futuristic city at sunset with flying cars and neon lights"
	- "Show me a peaceful forest with a waterfall and wildlife"
	- "Generate a portrait of a robot reading a book in a library"

	## Project Structure

	```
	GenerativeAI2/
	├── app.py # Streamlit UI and orchestration
	├── llm_pipeline.py # Core pipeline functions
	├── requirements.txt # Python dependencies
	├── README.md # Documentation
	├── screenshots/ # UI screenshots
	│ ├── UI_1.png # Main interface
	│ ├── UI_2.png # Settings configuration
	│ ├── UI_3.png # Results display
	│ └── UI_4.png # Detailed results
	└── .env # Environment variables (not in git)
	```

	## Configuration

	### Model Selection

	Configure models in the sidebar or modify defaults in the code:

	```python
	# Default models in app.py
	{
	"transcription_model": "whisper-1",
	"llm_model": "gpt-5-nano",
	"image_model": "gpt-image-1"
	}
	```

	### Image Sizes

	Available sizes:
	- `512x512` - Fast, lower cost
	- `768x768` - Balanced
	- `1024x1024` - High quality (default)

	### Logging

	Logs are printed to the console where you run `streamlit run app.py`. Adjust log level:

	```python
	# In app.py
	logging.basicConfig(level=logging.INFO) # Change to DEBUG for verbose output
	```

	### Organization Verification

	Important: The `gpt-image-1` image generation model requires your OpenAI organization to be verified before use. This is a one-time setup process.

	#### Verification Steps

	1. Go to [OpenAI Organization Settings](https://platform.openai.com/settings/organization/general)
	2. Click on "Verify Organization"
	3. Complete the verification process (may require providing business/organization details)
	4. Wait for access propagation: After verification, it can take up to 15 minutes for access to the `gpt-image-1` model to become available

	#### What to Expect

	- If you attempt to use `gpt-image-1` before verification, you'll receive a `PermissionDeniedError` with error code 403
	- The error message will indicate that organization verification is required
	- After verification, wait 15 minutes before retrying image generation
	- If you need immediate access, consider using alternative image models like `dall-e-3` (if available)

	#### Alternative Models

	If you need to use image generation immediately without waiting for verification, you can:
	- Switch to `dall-e-3` model in the sidebar settings (if available for your account)
	- Or wait for the verification to complete and access to propagate

	## API Costs

	Approximate costs per generation (as of 2024):

	- Whisper: ~$0.006 per minute of audio
	- GPT-4o-mini: ~$0.15 per 1M input tokens
	- DALL-E 3:
	- 1024x1024: $0.040 per image
	- 1024x1792 or 1792x1024: $0.080 per image

	## Architecture

	### Core Components

	Frontend (`app.py`)
	- Streamlit UI for user interaction
	- Session state management
	- Error handling and display

	Backend (`llm_pipeline.py`)
	- `transcribe_audio()`: Whisper API integration
	- `build_image_prompt()`: LLM prompt enhancement
	- `generate_image()`: DALL-E image generation

	### Pipeline Flow

	```
	Audio Upload → Transcription → Prompt Enhancement → Image Generation → Display
	↓ ↓ ↓ ↓ ↓
	User File Whisper API GPT API DALL-E API Results UI
	```

	## Troubleshooting

	### Common Issues

	"OPENAI_API_KEY environment variable is not set"
	- Ensure your `.env` file exists and contains the API key
	- Or export the environment variable in your terminal

	"PermissionDeniedError: Your organization must be verified to use the model `gpt-image-1`"
	- This error occurs when trying to use `gpt-image-1` without organization verification
	- Solution:
	1. Visit [OpenAI Organization Settings](https://platform.openai.com/settings/organization/general)
	2. Click "Verify Organization" and complete the verification process
	3. Wait up to 15 minutes for access to propagate after verification
	4. Retry the image generation after the waiting period
	- Alternative: Switch to `dall-e-3` model in the sidebar if you need immediate access (if available for your account)

	Audio format not supported
	- Install FFmpeg: `brew install ffmpeg` (macOS) or `apt-get install ffmpeg` (Linux)

	Rate limit errors
	- Wait a few moments and try again
	- Check your OpenAI account usage limits

	Module not found errors
	- Ensure virtual environment is activated
	- Reinstall dependencies: `pip install -r requirements.txt`

	## Development

	### Running Tests

	```bash
	# Install test dependencies
	pip install pytest pytest-mock

	# Run tests
	pytest
	```

	### Code Style

	This project follows PEP 8 guidelines. Format code with:

	```bash
	pip install black
	black app.py llm_pipeline.py
	```

	## Limitations

	- Maximum audio file size: 25MB
	- Very long or noisy audio may produce less accurate transcripts
	- Generated images depend on prompt quality and model capabilities
	- No offline mode - requires internet connection for all operations

	## Security Notes

	- Never commit your `.env` file or API keys to version control
	- Add `.env` to `.gitignore`
	- Rotate API keys if accidentally exposed
	- Monitor API usage to prevent unexpected costs

	## Acknowledgments

	- [OpenAI](https://openai.com/) for Whisper, GPT, and DALL-E APIs
	- [Streamlit](https://streamlit.io/) for the web framework
	- Community contributors and testers