Update README.md
Browse files
README.md
CHANGED
|
@@ -15,7 +15,7 @@ base_model: Salesforce/blip-vqa-base
|
|
| 15 |
library_name: peft
|
| 16 |
---
|
| 17 |
|
| 18 |
-
# Model Card for
|
| 19 |
|
| 20 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 21 |
This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model specifically adapted for **object-centric Visual Question Answering (VQA) with a focus on generating single-word answers**. Using Parameter-Efficient Fine-tuning (PEFT) with Low-Rank Adaptation (LoRA), the model is trained to answer natural language questions about the content of provided images, primarily aiming for concise, single-word responses about objects based on the Amazon Berkeley Objects (ABO) dataset.
|
|
@@ -27,31 +27,14 @@ This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model spec
|
|
| 27 |
<!-- Provide a longer summary of what this model is. -->
|
| 28 |
This model is a fine-tuned version of the `Salesforce/blip-vqa-base` model, specialized for an object-centric VQA task using the **Amazon Berkeley Objects (ABO) dataset**. A key characteristic of this fine-tuning is its focus on enabling the model to provide **single-word answers** to questions about objects and their properties. The fine-tuning was performed using Low-Rank Adaptation (LoRA), a parameter-efficient technique that injects small, trainable matrices into the pre-trained model's layers. This approach allows for efficient adaptation of the large BLIP model without the need to fine-tune all of its parameters, significantly reducing computational requirements while aiming to maintain strong performance on the specific target VQA domain and answer format. The model takes an image and a natural language question as input and generates a textual answer, primarily optimized for brevity.
|
| 29 |
|
| 30 |
-
- **Developed by:**
|
| 31 |
-
- **Funded by [optional]:** [Mention if funded by a specific grant or program, otherwise state N/A or omit]
|
| 32 |
-
- **Shared by [optional]:** [Your Name/Your Hugging Face Account Name]
|
| 33 |
- **Model type:** Vision-Language Model (VLM), specifically designed for Visual Question Answering (VQA) with single-word answer generation.
|
| 34 |
- **Language(s) (NLP):** English
|
| 35 |
-
- **
|
| 36 |
-
- **Finetuned from model [optional]:** `Salesforce/blip-vqa-base`
|
| 37 |
|
| 38 |
-
### Model Sources [optional]
|
| 39 |
|
| 40 |
-
<!-- Provide the basic links for the model. -->
|
| 41 |
|
| 42 |
-
|
| 43 |
-
- **Paper [optional]:** [Link to the original BLIP paper: https://arxiv.org/abs/2201.05923. Optional: If there's a paper about the ABO dataset, link it here.]
|
| 44 |
-
- **Demo [optional]:** [If you create a Gradio or other simple demo, link it here. Otherwise, state N/A or omit.]
|
| 45 |
-
|
| 46 |
-
## Uses
|
| 47 |
-
|
| 48 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 49 |
-
|
| 50 |
-
### Direct Use
|
| 51 |
-
|
| 52 |
-
The model can be directly used for Visual Question Answering on images, particularly for questions related to objects similar to those in the **Amazon Berkeley Objects (ABO) dataset**. It is specifically fine-tuned to generate **single-word answers** to these questions. Given its training objective, it is expected to perform best on questions about identifying objects, their fundamental attributes, or simple quantities where a single term suffices as the answer. It can be used for inference via the Hugging Face Transformers library.
|
| 53 |
-
|
| 54 |
-
### Downstream Use [optional]
|
| 55 |
|
| 56 |
This model is suitable for integration into applications that require VQA capabilities focused on objects and where a **concise, single-word answer** is acceptable or preferred. Potential downstream uses include:
|
| 57 |
- Rapid object identification in visual search systems.
|
|
@@ -265,7 +248,7 @@ The project utilizes Python 3.11 and key libraries from the PyTorch and Hugging
|
|
| 265 |
- `bert-score` (for evaluation metric calculation)
|
| 266 |
- `scikit-learn` (for general utilities, potentially used in evaluation)
|
| 267 |
|
| 268 |
-
## Citation
|
| 269 |
|
| 270 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 271 |
|
|
@@ -284,10 +267,8 @@ The project utilizes Python 3.11 and key libraries from the PyTorch and Hugging
|
|
| 284 |
**APA:**
|
| 285 |
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. *ArXiv, abs/2201.05923*.
|
| 286 |
|
| 287 |
-
[Optional: Include citation for the Amazon Berkeley Objects dataset if one is available.]
|
| 288 |
-
[Optional: Include a citation for the PEFT library if space allows or if specifically requested by your project guidelines]
|
| 289 |
|
| 290 |
-
## Glossary
|
| 291 |
|
| 292 |
- **VQA:** Visual Question Answering - The task of answering a natural language question about the content of an image.
|
| 293 |
- **BLIP:** Bootstrapping Language-Image Pre-training - The base vision-language model architecture.
|
|
@@ -301,9 +282,11 @@ Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image
|
|
| 301 |
|
| 302 |
[Optional: Add links to related work, dataset details, or any further information about the project or model.]
|
| 303 |
|
| 304 |
-
## Model Card Authors
|
| 305 |
|
| 306 |
-
|
|
|
|
|
|
|
| 307 |
|
| 308 |
## Model Card Contact
|
| 309 |
|
|
|
|
| 15 |
library_name: peft
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# Model Card for aryamanpathak/blip-vqa-abo
|
| 19 |
|
| 20 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 21 |
This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model specifically adapted for **object-centric Visual Question Answering (VQA) with a focus on generating single-word answers**. Using Parameter-Efficient Fine-tuning (PEFT) with Low-Rank Adaptation (LoRA), the model is trained to answer natural language questions about the content of provided images, primarily aiming for concise, single-word responses about objects based on the Amazon Berkeley Objects (ABO) dataset.
|
|
|
|
| 27 |
<!-- Provide a longer summary of what this model is. -->
|
| 28 |
This model is a fine-tuned version of the `Salesforce/blip-vqa-base` model, specialized for an object-centric VQA task using the **Amazon Berkeley Objects (ABO) dataset**. A key characteristic of this fine-tuning is its focus on enabling the model to provide **single-word answers** to questions about objects and their properties. The fine-tuning was performed using Low-Rank Adaptation (LoRA), a parameter-efficient technique that injects small, trainable matrices into the pre-trained model's layers. This approach allows for efficient adaptation of the large BLIP model without the need to fine-tune all of its parameters, significantly reducing computational requirements while aiming to maintain strong performance on the specific target VQA domain and answer format. The model takes an image and a natural language question as input and generates a textual answer, primarily optimized for brevity.
|
| 29 |
|
| 30 |
+
- **Developed by:** Aryaman, Rutul, Shreyas
|
|
|
|
|
|
|
| 31 |
- **Model type:** Vision-Language Model (VLM), specifically designed for Visual Question Answering (VQA) with single-word answer generation.
|
| 32 |
- **Language(s) (NLP):** English
|
| 33 |
+
- **Finetuned from model:** `Salesforce/blip-vqa-base`
|
|
|
|
| 34 |
|
|
|
|
| 35 |
|
|
|
|
| 36 |
|
| 37 |
+
### Downstream Use 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
This model is suitable for integration into applications that require VQA capabilities focused on objects and where a **concise, single-word answer** is acceptable or preferred. Potential downstream uses include:
|
| 40 |
- Rapid object identification in visual search systems.
|
|
|
|
| 248 |
- `bert-score` (for evaluation metric calculation)
|
| 249 |
- `scikit-learn` (for general utilities, potentially used in evaluation)
|
| 250 |
|
| 251 |
+
## Citation
|
| 252 |
|
| 253 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 254 |
|
|
|
|
| 267 |
**APA:**
|
| 268 |
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. *ArXiv, abs/2201.05923*.
|
| 269 |
|
|
|
|
|
|
|
| 270 |
|
| 271 |
+
## Glossary
|
| 272 |
|
| 273 |
- **VQA:** Visual Question Answering - The task of answering a natural language question about the content of an image.
|
| 274 |
- **BLIP:** Bootstrapping Language-Image Pre-training - The base vision-language model architecture.
|
|
|
|
| 282 |
|
| 283 |
[Optional: Add links to related work, dataset details, or any further information about the project or model.]
|
| 284 |
|
| 285 |
+
## Model Card Authors
|
| 286 |
|
| 287 |
+
- Aryaman
|
| 288 |
+
- Rutul
|
| 289 |
+
- Shreyas
|
| 290 |
|
| 291 |
## Model Card Contact
|
| 292 |
|