Spaces:

mknolan
/

cursor_slides_internvl2

Paused

App Files Files Community

cursor_slides_internvl2 / README.md

mknolan

Upload InternVL2 implementation

e59dc66 verified 12 months ago

preview code

raw

history blame contribute delete

4.71 kB

	---
	title: Image Description with Qwen-VL
	emoji: 🖼️
	colorFrom: indigo
	colorTo: purple
	sdk: docker
	sdk_version: 3.0.0
	app_file: app.py
	pinned: false
	---

	# Image Description Application with Qwen-VL

	This application uses the advanced Qwen-VL-Chat vision language model to generate detailed descriptions for images. It's specifically set up to describe the image in the `data_temp` folder, but can also analyze any uploaded image.

	## Features

	- Loads an image from the data_temp folder or via upload
	- Generates multiple types of descriptions using state-of-the-art AI:
	- Basic description (brief overview)
	- Detailed analysis (comprehensive description)
	- Technical analysis (assessment of technical aspects)
	- Displays the image (optional)
	- Uses 8-bit quantization for efficient model loading
	- Provides a user-friendly Gradio UI

	## Requirements

	- Python 3.8 or higher
	- PyTorch
	- Transformers (version 4.35.2+)
	- Pillow
	- Matplotlib
	- Accelerate
	- Bitsandbytes
	- Safetensors
	- Gradio for the web interface

	## Hardware Requirements

	This application uses a vision-language model which requires:
	- A CUDA-capable GPU with at least 8GB VRAM
	- 8GB+ system RAM

	## Deployment Options

	### 1. Hugging Face Spaces (Recommended)

	This repository is ready to be deployed on Hugging Face Spaces.

	Steps:
	1. Create a new Space on [Hugging Face Spaces](https://huggingface.co/spaces)
	2. Select "Docker" as the Space SDK
	3. Link this GitHub repository
	4. Select a GPU (T4 or better is recommended)
	5. Create the Space

	The application will automatically deploy with the Gradio UI frontend.

	### 2. AWS SageMaker

	For production deployment on AWS SageMaker:

	1. Package the application using the provided Dockerfile
	2. Upload the Docker image to Amazon ECR
	3. Create a SageMaker Model using the ECR image
	4. Deploy an endpoint with an instance type like ml.g4dn.xlarge
	5. Set up API Gateway for HTTP access (optional)

	Detailed AWS instructions can be found in the `docs/aws_deployment.md` file.

	### 3. Azure Machine Learning

	For Azure deployment:

	1. Create an Azure ML workspace
	2. Register the model on Azure ML
	3. Create an inference configuration
	4. Deploy to AKS or ACI with a GPU-enabled instance

	Detailed Azure instructions can be found in the `docs/azure_deployment.md` file.

	## How It Works

	The application uses the Qwen-VL-Chat model, a state-of-the-art multimodal AI model that can understand and describe images with impressive detail.

	The script:
	1. Processes the image with three different prompts:
	- "Describe this image briefly in a single paragraph."
	- "Analyze this image in detail. Describe the main elements, any text visible, the colors, and the overall composition."
	- "What can you tell me about the technical aspects of this image?"
	2. Uses 8-bit quantization to reduce memory requirements
	3. Formats and displays the results

	## Repository Structure

	- `app.py` - Gradio UI for web interface
	- `Dockerfile` - For containerized deployment
	- `requirements.txt` - Python dependencies
	- `data_temp/` - Sample images for testing

	## Local Development

	1. Install the required packages:
	```
	pip install -r requirements.txt
	```

	2. Run the Gradio UI:
	```
	python app.py
	```

	3. Visit `http://localhost:7860` in your browser

	## Example Output

	```
	Processing image: data_temp/page_2.png
	Loading model...
	Generating descriptions...

	==== Image Description Results (Qwen-VL) ====

	Basic Description:
	The image shows a webpage or document with text content organized in multiple columns.

	Detailed Description:
	The image displays a structured document or webpage with multiple sections of text organized in a grid layout. The content appears to be technical or educational in nature, with what looks like headings and paragraphs of text. The color scheme is primarily black text on a white background, creating a clean, professional appearance. There appear to be multiple columns of information, possibly representing different topics or categories. The layout suggests this might be documentation, a reference guide, or an educational resource related to technical content.

	Technical Analysis:
	This appears to be a screenshot of a digital document or webpage. The image quality is good with clear text rendering, suggesting it was captured at an appropriate resolution. The image uses a standard document layout with what appears to be a grid or multi-column structure. The screenshot has been taken of what seems to be a text-heavy interface with minimal graphics, consistent with technical documentation or reference materials.
	```

	Note: Actual descriptions will vary based on the specific image content and may be more detailed than this example.