Upload 9 files
Browse files- .gitattributes +5 -0
- OD_test.png +3 -0
- Object_detection.ipynb +0 -0
- Object_detection.png +3 -0
- README.md +104 -14
- class_object.PNG +3 -0
- helper.py +103 -0
- kid_bike.jpeg +3 -0
- object_detection.py +68 -0
- pipeline.PNG +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
class_object.PNG filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
kid_bike.jpeg filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
Object_detection.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
OD_test.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
pipeline.PNG filter=lfs diff=lfs merge=lfs -text
|
OD_test.png
ADDED
|
Git LFS Details
|
Object_detection.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Object_detection.png
ADDED
|
Git LFS Details
|
README.md
CHANGED
|
@@ -1,14 +1,104 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# A Object Identification and text to speech model using HuggingFace Transformers
|
| 2 |
+
|
| 3 |
+
Learn how to build the below pipeline using Gradio (for user interface and deployment), Facebook' detr-resnet-50 model for Object Identification and kakao-enterprise' vits-ljs model for text to speech.
|
| 4 |
+
|
| 5 |
+
<img src="pipeline.PNG" alt="project pipeline" width="400">
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
## Description
|
| 9 |
+
|
| 10 |
+
The Image Description and Audio Transcript App is a web-based application that leverages artificial intelligence to generate descriptions for uploaded images. Additionally, the tool provides an audio transcript of the description for users with visual impairments, making it more accessible.
|
| 11 |
+
|
| 12 |
+
This app uses BLIP (Bootstrapping Language-Image Pre-training) for image captioning and gTTS (Google Text-to-Speech) for converting the description into an audio file.
|
| 13 |
+
|
| 14 |
+
## Features
|
| 15 |
+
|
| 16 |
+
* Upload an image and receive an AI-generated description.
|
| 17 |
+
* Convert the description into an audio file for accessibility.
|
| 18 |
+
* Responsive web interface built using Gradio.
|
| 19 |
+
* Simple, user-friendly design for a seamless experience.
|
| 20 |
+
|
| 21 |
+
## Technologies Used
|
| 22 |
+
|
| 23 |
+
* Programming Language: Python 3.7+
|
| 24 |
+
* AI Model: BLIP for image captioning
|
| 25 |
+
* Text-to-Speech: gTTS (Google Text-to-Speech)
|
| 26 |
+
* Web Interface: Gradio
|
| 27 |
+
* Libraries: PyTorch, Transformers, Gradio, gTTS
|
| 28 |
+
|
| 29 |
+
## Libraries and Dependencies
|
| 30 |
+
|
| 31 |
+
* torch: Deep learning framework for the BLIP model
|
| 32 |
+
* transformers: Hugging Face library for pre-trained models like BLIP
|
| 33 |
+
* gtts: Library for text-to-speech conversion
|
| 34 |
+
* gradio: For building the web interface
|
| 35 |
+
|
| 36 |
+
## To install the necessary packages, run:
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
pip install torch transformers gtts gradio
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## Installation and Setup
|
| 43 |
+
|
| 44 |
+
* Clone the repository:
|
| 45 |
+
```bash
|
| 46 |
+
git clone https://github.com/your-username/image-description-audio-transcript.git
|
| 47 |
+
cd image-description-audio-transcript
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
* Create a virtual environment (optional but recommended):
|
| 51 |
+
```bash
|
| 52 |
+
python -m venv venv
|
| 53 |
+
source venv/bin/activate # On Windows, use venv\Scripts\activate
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
* Install the required packages:
|
| 57 |
+
```bash
|
| 58 |
+
pip install torch transformers gtts gradio
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
* Ensure that the necessary models are downloaded: The BLIP model will automatically be downloaded when the script is run, and gTTS will use an online service to convert text to speech.
|
| 62 |
+
|
| 63 |
+
## Usage
|
| 64 |
+
|
| 65 |
+
1. Run the application:
|
| 66 |
+
```bash
|
| 67 |
+
python object_detection.py
|
| 68 |
+
```
|
| 69 |
+
2. Open a web browser and navigate to http://127.0.0.1:7860 to access the app.
|
| 70 |
+
3. Upload an image through the provided input.
|
| 71 |
+
4. Click the `Generate Description` button to get a text description of the image.
|
| 72 |
+
5. Click the `Click here for an audio transcript` button to hear the description.
|
| 73 |
+
|
| 74 |
+
## Configuration
|
| 75 |
+
|
| 76 |
+
You can modify the following parameters in the app.py file:
|
| 77 |
+
|
| 78 |
+
* host: The IP address on which the server runs (default: '127.0.0.1')
|
| 79 |
+
* port: The port number (default: 7860)
|
| 80 |
+
* debug: Debug mode for development (default: True)
|
| 81 |
+
|
| 82 |
+
## Contributing
|
| 83 |
+
|
| 84 |
+
Contributions to improve the Image Description and Audio Transcript App are welcome. Please follow these steps:
|
| 85 |
+
|
| 86 |
+
* Fork the repository.
|
| 87 |
+
* Create a new branch (git checkout -b feature/AmazingFeature).
|
| 88 |
+
* Commit your changes (git commit -m 'Add some AmazingFeature').
|
| 89 |
+
* Push to the branch (git push origin feature/AmazingFeature).
|
| 90 |
+
* Open a Pull Request.
|
| 91 |
+
|
| 92 |
+
## License
|
| 93 |
+
|
| 94 |
+
This project is licensed under the MIT License - see the LICENSE.md file for details.
|
| 95 |
+
|
| 96 |
+
## Acknowledgments
|
| 97 |
+
|
| 98 |
+
* Salesforce for the BLIP image captioning model.
|
| 99 |
+
* Google for the gTTS service.
|
| 100 |
+
* Gradio for the easy-to-use interface framework.
|
| 101 |
+
|
| 102 |
+
## Disclaimer
|
| 103 |
+
|
| 104 |
+
This tool is designed to assist with generating descriptions and audio transcripts from images, but always review the generated content for accuracy and appropriateness before use.
|
class_object.PNG
ADDED
|
|
Git LFS Details
|
helper.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""helper.ipynb
|
| 3 |
+
|
| 4 |
+
Automatically generated by Colaboratory.
|
| 5 |
+
|
| 6 |
+
Original file is located at
|
| 7 |
+
https://colab.research.google.com/drive/1IDhEhDLbnCTaBfIbuMtlNFW3ntQiZBwA
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
import io
|
| 11 |
+
import matplotlib.pyplot as plt
|
| 12 |
+
import requests
|
| 13 |
+
import inflect
|
| 14 |
+
from PIL import Image
|
| 15 |
+
|
| 16 |
+
def load_image_from_url(url):
|
| 17 |
+
return Image.open(requests.get(url, stream=True).raw)
|
| 18 |
+
|
| 19 |
+
def render_results_in_image(in_pil_img, in_results):
|
| 20 |
+
plt.figure(figsize=(16, 10))
|
| 21 |
+
plt.imshow(in_pil_img)
|
| 22 |
+
|
| 23 |
+
ax = plt.gca()
|
| 24 |
+
|
| 25 |
+
for prediction in in_results:
|
| 26 |
+
|
| 27 |
+
x, y = prediction['box']['xmin'], prediction['box']['ymin']
|
| 28 |
+
w = prediction['box']['xmax'] - prediction['box']['xmin']
|
| 29 |
+
h = prediction['box']['ymax'] - prediction['box']['ymin']
|
| 30 |
+
|
| 31 |
+
ax.add_patch(plt.Rectangle((x, y),
|
| 32 |
+
w,
|
| 33 |
+
h,
|
| 34 |
+
fill=False,
|
| 35 |
+
color="green",
|
| 36 |
+
linewidth=2))
|
| 37 |
+
ax.text(
|
| 38 |
+
x,
|
| 39 |
+
y,
|
| 40 |
+
f"{prediction['label']}: {round(prediction['score']*100, 1)}%",
|
| 41 |
+
color='red'
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
plt.axis("off")
|
| 45 |
+
|
| 46 |
+
# Save the modified image to a BytesIO object
|
| 47 |
+
img_buf = io.BytesIO()
|
| 48 |
+
plt.savefig(img_buf, format='png',
|
| 49 |
+
bbox_inches='tight',
|
| 50 |
+
pad_inches=0)
|
| 51 |
+
img_buf.seek(0)
|
| 52 |
+
modified_image = Image.open(img_buf)
|
| 53 |
+
|
| 54 |
+
# Close the plot to prevent it from being displayed
|
| 55 |
+
plt.close()
|
| 56 |
+
|
| 57 |
+
return modified_image
|
| 58 |
+
|
| 59 |
+
def summarize_predictions_natural_language(predictions):
|
| 60 |
+
summary = {}
|
| 61 |
+
p = inflect.engine()
|
| 62 |
+
|
| 63 |
+
for prediction in predictions:
|
| 64 |
+
label = prediction['label']
|
| 65 |
+
if label in summary:
|
| 66 |
+
summary[label] += 1
|
| 67 |
+
else:
|
| 68 |
+
summary[label] = 1
|
| 69 |
+
|
| 70 |
+
result_string = "In this image, there are "
|
| 71 |
+
for i, (label, count) in enumerate(summary.items()):
|
| 72 |
+
count_string = p.number_to_words(count)
|
| 73 |
+
result_string += f"{count_string} {label}"
|
| 74 |
+
if count > 1:
|
| 75 |
+
result_string += "s"
|
| 76 |
+
|
| 77 |
+
result_string += " "
|
| 78 |
+
|
| 79 |
+
if i == len(summary) - 2:
|
| 80 |
+
result_string += "and "
|
| 81 |
+
|
| 82 |
+
# Remove the trailing comma and space
|
| 83 |
+
result_string = result_string.rstrip(', ') + "."
|
| 84 |
+
|
| 85 |
+
return result_string
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
##### To ignore warnings #####
|
| 89 |
+
import warnings
|
| 90 |
+
import logging
|
| 91 |
+
from transformers import logging as hf_logging
|
| 92 |
+
|
| 93 |
+
def ignore_warnings():
|
| 94 |
+
# Ignore specific Python warnings
|
| 95 |
+
warnings.filterwarnings("ignore", message="Some weights of the model checkpoint")
|
| 96 |
+
warnings.filterwarnings("ignore", message="Could not find image processor class")
|
| 97 |
+
warnings.filterwarnings("ignore", message="The `max_size` parameter is deprecated")
|
| 98 |
+
|
| 99 |
+
# Adjust logging for libraries using the logging module
|
| 100 |
+
logging.basicConfig(level=logging.ERROR)
|
| 101 |
+
hf_logging.set_verbosity_error()
|
| 102 |
+
|
| 103 |
+
########
|
kid_bike.jpeg
ADDED
|
Git LFS Details
|
object_detection.py
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
from transformers import BlipProcessor, BlipForConditionalGeneration
|
| 3 |
+
from gtts import gTTS
|
| 4 |
+
import tempfile
|
| 5 |
+
import subprocess
|
| 6 |
+
import sys
|
| 7 |
+
import gradio
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def ensure_package_installed(package_name):
|
| 11 |
+
try:
|
| 12 |
+
__import__(package_name)
|
| 13 |
+
except ImportError:
|
| 14 |
+
print(f"{package_name} package not found. Installing...")
|
| 15 |
+
subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
|
| 16 |
+
__import__(package_name)
|
| 17 |
+
|
| 18 |
+
# Check and install openai
|
| 19 |
+
ensure_package_installed("gradio")
|
| 20 |
+
ensure_package_installed("transformers")
|
| 21 |
+
ensure_package_installed("gtts")
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
# Load the image captioning model
|
| 25 |
+
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
|
| 26 |
+
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
|
| 27 |
+
|
| 28 |
+
def generate_description(image):
|
| 29 |
+
"""Generates a textual description of the given image using a pre-trained BLIP model."""
|
| 30 |
+
inputs = processor(image, return_tensors="pt").to(model.device)
|
| 31 |
+
output = model.generate(**inputs)
|
| 32 |
+
description = processor.decode(output[0], skip_special_tokens=True)
|
| 33 |
+
return description
|
| 34 |
+
|
| 35 |
+
def text_to_speech(text):
|
| 36 |
+
"""Converts text to speech using gTTS and returns the audio file path."""
|
| 37 |
+
tts = gTTS(text=text, lang='en')
|
| 38 |
+
temp_audio = tempfile.NamedTemporaryFile(delete=False, suffix=".mp3")
|
| 39 |
+
tts.save(temp_audio.name)
|
| 40 |
+
return temp_audio.name
|
| 41 |
+
|
| 42 |
+
def process_image(image):
|
| 43 |
+
"""Processes the uploaded image to generate description and return audio file."""
|
| 44 |
+
description = generate_description(image)
|
| 45 |
+
return description
|
| 46 |
+
|
| 47 |
+
def get_audio(description):
|
| 48 |
+
"""Generates the audio file for the given description."""
|
| 49 |
+
return text_to_speech(description)
|
| 50 |
+
|
| 51 |
+
# Build Gradio Interface
|
| 52 |
+
with gradio.Blocks() as demo:
|
| 53 |
+
gradio.Markdown("# Image Description and Audio Transcript App")
|
| 54 |
+
gradio.Markdown("Upload an image to get an AI-generated description. Click the button to hear the description.")
|
| 55 |
+
|
| 56 |
+
with gradio.Row():
|
| 57 |
+
image_input = gradio.Image(type="pil")
|
| 58 |
+
text_output = gradio.Textbox(label="Generated Description")
|
| 59 |
+
|
| 60 |
+
generate_btn = gradio.Button("Generate Description")
|
| 61 |
+
audio_btn = gradio.Button("Click here for an audio transcript")
|
| 62 |
+
audio_output = gradio.Audio()
|
| 63 |
+
|
| 64 |
+
generate_btn.click(process_image, inputs=[image_input], outputs=[text_output])
|
| 65 |
+
audio_btn.click(get_audio, inputs=[text_output], outputs=[audio_output])
|
| 66 |
+
|
| 67 |
+
# Launch the Gradio app
|
| 68 |
+
demo.launch()
|
pipeline.PNG
ADDED
|
|
Git LFS Details
|