Upload 5 files
Browse files- 09_image_captioning/README.md +28 -0
- 09_image_captioning/main.py +27 -0
- 09_image_captioning/remiai.png +0 -0
- 09_image_captioning/requirements.txt +12 -0
- README.md +43 -0
09_image_captioning/README.md
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Image Captioning (CPU/GPU)
|
| 2 |
+
|
| 3 |
+
- **Model:** `nlpconnect/vit-gpt2-image-captioning` (MIT)
|
| 4 |
+
- **Task:** Generate a caption for a given image.
|
| 5 |
+
- **Note:** Here we just provide the resources for to run this models in the laptops we didn't develop this entire models we just use the open source models for the experiment this model is developed by nlpconnect
|
| 6 |
+
|
| 7 |
+
## Quick start (any project)
|
| 8 |
+
|
| 9 |
+
```bash
|
| 10 |
+
# 1) Create env
|
| 11 |
+
python -m venv venv && source .venv/bin/activate # Windows: ./venv/Scripts/activate
|
| 12 |
+
|
| 13 |
+
# 2) Install deps
|
| 14 |
+
pip install -r requirements.txt
|
| 15 |
+
|
| 16 |
+
# 3) Run
|
| 17 |
+
python main.py --help
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
> Tip: If you have a GPU + CUDA, PyTorch will auto-use it. If not, everything runs on CPU (slower but works).
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
and while running the main.py code using command then only you the output
|
| 24 |
+
**Use:** python main.py --image remiai.png or python main.py --image sample.jpg
|
| 25 |
+
|
| 26 |
+
other wise you get the output like this
|
| 27 |
+
usage: main.py [-h] --image IMAGE
|
| 28 |
+
error: the following arguments are required: --image
|
09_image_captioning/main.py
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse, torch
|
| 2 |
+
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
|
| 3 |
+
from PIL import Image
|
| 4 |
+
|
| 5 |
+
def main():
|
| 6 |
+
parser = argparse.ArgumentParser()
|
| 7 |
+
parser.add_argument("--image", type=str, required=True)
|
| 8 |
+
parser.add_argument("--max_length", type=int, default=20)
|
| 9 |
+
args = parser.parse_args()
|
| 10 |
+
|
| 11 |
+
model_id = "nlpconnect/vit-gpt2-image-captioning"
|
| 12 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 13 |
+
|
| 14 |
+
model = VisionEncoderDecoderModel.from_pretrained(model_id).to(device)
|
| 15 |
+
feature_extractor = ViTImageProcessor.from_pretrained(model_id)
|
| 16 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 17 |
+
|
| 18 |
+
img = Image.open(args.image).convert("RGB")
|
| 19 |
+
pixel_values = feature_extractor(images=[img], return_tensors="pt").pixel_values.to(device)
|
| 20 |
+
|
| 21 |
+
with torch.no_grad():
|
| 22 |
+
output_ids = model.generate(pixel_values, max_length=args.max_length)[0]
|
| 23 |
+
caption = tokenizer.decode(output_ids, skip_special_tokens=True)
|
| 24 |
+
print(caption)
|
| 25 |
+
|
| 26 |
+
if __name__ == "__main__":
|
| 27 |
+
main()
|
09_image_captioning/remiai.png
ADDED
|
09_image_captioning/requirements.txt
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
torch==2.1.0
|
| 2 |
+
torchvision==0.16.0
|
| 3 |
+
torchaudio==2.1.0
|
| 4 |
+
transformers==4.38.2
|
| 5 |
+
datasets==2.18.0
|
| 6 |
+
Pillow==10.2.0
|
| 7 |
+
numpy==1.26.4
|
| 8 |
+
tqdm==4.66.2
|
| 9 |
+
sentencepiece==0.1.99
|
| 10 |
+
sentence-transformers==2.6.1
|
| 11 |
+
easyocr==1.7.1
|
| 12 |
+
openai-whisper
|
README.md
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# remiai3 β Universal AI Project Pack (9 CPU/GPU-Compatible Demos)
|
| 2 |
+
|
| 3 |
+
Each project runs **on CPU-only or GPU** with the same dependencies. All use **Apache 2.0 / MIT** licensed models.
|
| 4 |
+
|
| 5 |
+
## Quick start (any project)
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
# 1) Create env
|
| 9 |
+
python -m venv venv && source .venv/bin/activate # Windows: ./venv/Scripts/activate
|
| 10 |
+
|
| 11 |
+
# 2) Install deps
|
| 12 |
+
pip install -r requirements.txt
|
| 13 |
+
|
| 14 |
+
# 3) Run
|
| 15 |
+
python main.py --help
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
> Tip: If you have a GPU + CUDA, PyTorch will auto-use it. If not, everything runs on CPU (slower but works).
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## Projects
|
| 23 |
+
|
| 24 |
+
1. **Sentiment Analysis** β `distilbert-base-uncased-finetuned-sst-2-english` (Apache-2.0)
|
| 25 |
+
2. **Named Entity Recognition (NER)** β `dslim/bert-base-NER` (Apache-2.0)
|
| 26 |
+
3. **Text Summarization** β `sshleifer/distilbart-cnn-12-6` (Apache-2.0)
|
| 27 |
+
4. **Keyword Extraction (Embeddings)** β `sentence-transformers/all-MiniLM-L6-v2` (Apache-2.0)
|
| 28 |
+
5. **Simple Chatbot** β `microsoft/DialoGPT-small` (MIT)
|
| 29 |
+
6. **Image Classification** β `torchvision.models.mobilenet_v2` (Apache-2.0)
|
| 30 |
+
7. **OCR** β `easyocr` (Apache-2.0)
|
| 31 |
+
8. **Speech-to-Text** β `openai/whisper-tiny` (MIT)
|
| 32 |
+
9. **Image Captioning** β `nlpconnect/vit-gpt2-image-captioning` (MIT)
|
| 33 |
+
|
| 34 |
+
## Universal requirements
|
| 35 |
+
|
| 36 |
+
Each folder has its own `requirements.txt` identical across all projects.
|
| 37 |
+
If you need CPU-only PyTorch wheels, install the default first (`pip install torch`) β it will fetch the right build automatically.
|
| 38 |
+
If you already have a CUDA wheel, the scripts will use it.
|
| 39 |
+
|
| 40 |
+
## System notes
|
| 41 |
+
- **FFmpeg needed** for Whisper STT (`brew install ffmpeg`, `choco install ffmpeg`, or use your package manager).
|
| 42 |
+
- **TTS downloads models on first run** to `~/.local/share/tts` (Linux/macOS) or `%APPDATA%\tts` (Windows).
|
| 43 |
+
- All code defaults to **English**; you can change languages in the code comments.
|