| | --- |
| | title: Image Description with Qwen-VL |
| | emoji: 🖼️ |
| | colorFrom: indigo |
| | colorTo: purple |
| | sdk: docker |
| | sdk_version: 3.0.0 |
| | app_file: app.py |
| | pinned: false |
| | --- |
| | |
| | # Image Description Application with Qwen-VL |
| |
|
| | This application uses the advanced Qwen-VL-Chat vision language model to generate detailed descriptions for images. It's specifically set up to describe the image in the `data_temp` folder, but can also analyze any uploaded image. |
| |
|
| | ## Features |
| |
|
| | - Loads an image from the data_temp folder or via upload |
| | - Generates multiple types of descriptions using state-of-the-art AI: |
| | - Basic description (brief overview) |
| | - Detailed analysis (comprehensive description) |
| | - Technical analysis (assessment of technical aspects) |
| | - Displays the image (optional) |
| | - Uses 8-bit quantization for efficient model loading |
| | - Provides a user-friendly Gradio UI |
| | |
| | ## Requirements |
| | |
| | - Python 3.8 or higher |
| | - PyTorch |
| | - Transformers (version 4.35.2+) |
| | - Pillow |
| | - Matplotlib |
| | - Accelerate |
| | - Bitsandbytes |
| | - Safetensors |
| | - Gradio for the web interface |
| | |
| | ## Hardware Requirements |
| | |
| | This application uses a vision-language model which requires: |
| | - A CUDA-capable GPU with at least 8GB VRAM |
| | - 8GB+ system RAM |
| | |
| | ## Deployment Options |
| | |
| | ### 1. Hugging Face Spaces (Recommended) |
| | |
| | This repository is ready to be deployed on Hugging Face Spaces. |
| | |
| | **Steps:** |
| | 1. Create a new Space on [Hugging Face Spaces](https://huggingface.co/spaces) |
| | 2. Select "Docker" as the Space SDK |
| | 3. Link this GitHub repository |
| | 4. Select a GPU (T4 or better is recommended) |
| | 5. Create the Space |
| | |
| | The application will automatically deploy with the Gradio UI frontend. |
| | |
| | ### 2. AWS SageMaker |
| | |
| | For production deployment on AWS SageMaker: |
| | |
| | 1. Package the application using the provided Dockerfile |
| | 2. Upload the Docker image to Amazon ECR |
| | 3. Create a SageMaker Model using the ECR image |
| | 4. Deploy an endpoint with an instance type like ml.g4dn.xlarge |
| | 5. Set up API Gateway for HTTP access (optional) |
| | |
| | Detailed AWS instructions can be found in the `docs/aws_deployment.md` file. |
| |
|
| | ### 3. Azure Machine Learning |
| |
|
| | For Azure deployment: |
| |
|
| | 1. Create an Azure ML workspace |
| | 2. Register the model on Azure ML |
| | 3. Create an inference configuration |
| | 4. Deploy to AKS or ACI with a GPU-enabled instance |
| |
|
| | Detailed Azure instructions can be found in the `docs/azure_deployment.md` file. |
| |
|
| | ## How It Works |
| |
|
| | The application uses the Qwen-VL-Chat model, a state-of-the-art multimodal AI model that can understand and describe images with impressive detail. |
| |
|
| | The script: |
| | 1. Processes the image with three different prompts: |
| | - "Describe this image briefly in a single paragraph." |
| | - "Analyze this image in detail. Describe the main elements, any text visible, the colors, and the overall composition." |
| | - "What can you tell me about the technical aspects of this image?" |
| | 2. Uses 8-bit quantization to reduce memory requirements |
| | 3. Formats and displays the results |
| |
|
| | ## Repository Structure |
| |
|
| | - `app.py` - Gradio UI for web interface |
| | - `Dockerfile` - For containerized deployment |
| | - `requirements.txt` - Python dependencies |
| | - `data_temp/` - Sample images for testing |
| |
|
| | ## Local Development |
| |
|
| | 1. Install the required packages: |
| | ``` |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | 2. Run the Gradio UI: |
| | ``` |
| | python app.py |
| | ``` |
| |
|
| | 3. Visit `http://localhost:7860` in your browser |
| |
|
| | ## Example Output |
| |
|
| | ``` |
| | Processing image: data_temp/page_2.png |
| | Loading model... |
| | Generating descriptions... |
| | |
| | ==== Image Description Results (Qwen-VL) ==== |
| | |
| | Basic Description: |
| | The image shows a webpage or document with text content organized in multiple columns. |
| | |
| | Detailed Description: |
| | The image displays a structured document or webpage with multiple sections of text organized in a grid layout. The content appears to be technical or educational in nature, with what looks like headings and paragraphs of text. The color scheme is primarily black text on a white background, creating a clean, professional appearance. There appear to be multiple columns of information, possibly representing different topics or categories. The layout suggests this might be documentation, a reference guide, or an educational resource related to technical content. |
| | |
| | Technical Analysis: |
| | This appears to be a screenshot of a digital document or webpage. The image quality is good with clear text rendering, suggesting it was captured at an appropriate resolution. The image uses a standard document layout with what appears to be a grid or multi-column structure. The screenshot has been taken of what seems to be a text-heavy interface with minimal graphics, consistent with technical documentation or reference materials. |
| | ``` |
| |
|
| | Note: Actual descriptions will vary based on the specific image content and may be more detailed than this example. |