Update README.md
Browse files
README.md
CHANGED
|
@@ -4,3 +4,61 @@ pipeline_tag: image-to-text
|
|
| 4 |
datasets:
|
| 5 |
- Mouwiya/image-in-Words400
|
| 6 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
datasets:
|
| 5 |
- Mouwiya/image-in-Words400
|
| 6 |
---
|
| 7 |
+
# BLIP Image Captioning
|
| 8 |
+
|
| 9 |
+
## Model Description
|
| 10 |
+
BLIP_image_captioning is a model based on the BLIP (Bootstrapping Language-Image Pre-training) architecture, specifically designed for image captioning tasks. The model has been fine-tuned on the "image-in-words400" dataset, which consists of images and their corresponding descriptive captions. This model leverages both visual and textual data to generate accurate and contextually relevant captions for images.
|
| 11 |
+
|
| 12 |
+
## Model Details
|
| 13 |
+
- **Model Architecture**: BLIP (Bootstrapping Language-Image Pre-training)
|
| 14 |
+
- **Base Model**: Salesforce/blip-image-captioning-base
|
| 15 |
+
- **Fine-tuning Dataset**: mouwiya/image-in-words400
|
| 16 |
+
- **Number of Parameters**: 109 million
|
| 17 |
+
|
| 18 |
+
## Training Data
|
| 19 |
+
The model was fine-tuned on a shuffled and subsetted version of the **"image-in-words400"** dataset. A total of 400 examples were used during the fine-tuning process to allow for faster iteration and development.
|
| 20 |
+
|
| 21 |
+
## Training Procedure
|
| 22 |
+
- **Optimizer**: AdamW
|
| 23 |
+
- **Learning Rate**: 2e-5
|
| 24 |
+
- **Batch Size**: 16
|
| 25 |
+
- **Epochs**: 3
|
| 26 |
+
- **Evaluation Metric**: BLEU Score
|
| 27 |
+
|
| 28 |
+
## Usage
|
| 29 |
+
To use this model for image captioning, you can load it using the Hugging Face transformers library and perform inference as shown below:
|
| 30 |
+
```python
|
| 31 |
+
from transformers import BlipProcessor, BlipForConditionalGeneration
|
| 32 |
+
from PIL import Image
|
| 33 |
+
import requests
|
| 34 |
+
from io import BytesIO
|
| 35 |
+
|
| 36 |
+
# Load the processor and model
|
| 37 |
+
model_name = "Mouwiya/BLIP_image_captioning"
|
| 38 |
+
processor = BlipProcessor.from_pretrained(model_name)
|
| 39 |
+
model = BlipForConditionalGeneration.from_pretrained(model_name)
|
| 40 |
+
|
| 41 |
+
# Example usage
|
| 42 |
+
image_url = "URL_OF_THE_IMAGE"
|
| 43 |
+
response = requests.get(image_url)
|
| 44 |
+
image = Image.open(BytesIO(response.content)).convert("RGB")
|
| 45 |
+
|
| 46 |
+
inputs = processor(images=image, return_tensors="pt")
|
| 47 |
+
outputs = model.generate(**inputs)
|
| 48 |
+
caption = processor.decode(outputs[0], skip_special_tokens=True)
|
| 49 |
+
print(caption)
|
| 50 |
+
|
| 51 |
+
```
|
| 52 |
+
## Evaluation
|
| 53 |
+
The model was evaluated on a subset of the "image-in-words400" dataset using the BLEU score. The evaluation results are as follows:
|
| 54 |
+
|
| 55 |
+
- **Average BLEU Score**: 0.35
|
| 56 |
+
This score indicates the model's ability to generate captions that closely match the reference descriptions in terms of overlapping n-grams.
|
| 57 |
+
|
| 58 |
+
## Limitations
|
| 59 |
+
- **Dataset Size**: The model was fine-tuned on a relatively small subset of the dataset, which may limit its generalization capabilities.
|
| 60 |
+
- **Domain-Specific**: This model was trained on a specific dataset and may not perform as well on images from different domains.
|
| 61 |
+
|
| 62 |
+
## Contact
|
| 63 |
+
**Mouwiya S. A. Al-Qaisieh**
|
| 64 |
+
mo3awiya@gmail.com
|