aryamanpathak
/

blip-vqa-abo

@@ -15,7 +15,7 @@ base_model: Salesforce/blip-vqa-base
 library_name: peft
 ---
-# Model Card for [Your-HuggingFace-Username]/[Your-Model-Name]
 <!-- Provide a quick summary of what the model is/does. -->
 This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model specifically adapted for **object-centric Visual Question Answering (VQA) with a focus on generating single-word answers**. Using Parameter-Efficient Fine-tuning (PEFT) with Low-Rank Adaptation (LoRA), the model is trained to answer natural language questions about the content of provided images, primarily aiming for concise, single-word responses about objects based on the Amazon Berkeley Objects (ABO) dataset.
@@ -27,31 +27,14 @@ This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model spec
 <!-- Provide a longer summary of what this model is. -->
 This model is a fine-tuned version of the `Salesforce/blip-vqa-base` model, specialized for an object-centric VQA task using the **Amazon Berkeley Objects (ABO) dataset**. A key characteristic of this fine-tuning is its focus on enabling the model to provide **single-word answers** to questions about objects and their properties. The fine-tuning was performed using Low-Rank Adaptation (LoRA), a parameter-efficient technique that injects small, trainable matrices into the pre-trained model's layers. This approach allows for efficient adaptation of the large BLIP model without the need to fine-tune all of its parameters, significantly reducing computational requirements while aiming to maintain strong performance on the specific target VQA domain and answer format. The model takes an image and a natural language question as input and generates a textual answer, primarily optimized for brevity.
-- **Developed by:** [Your Name/Team Names]
-- **Funded by [optional]:** [Mention if funded by a specific grant or program, otherwise state N/A or omit]
-- **Shared by [optional]:** [Your Name/Your Hugging Face Account Name]
 - **Model type:** Vision-Language Model (VLM), specifically designed for Visual Question Answering (VQA) with single-word answer generation.
 - **Language(s) (NLP):** English
-- **License:** [Specify the License you are using for your fine-tuned model and code (e.g., MIT, Apache 2.0). If unsure, select a suitable open-source license.]
-- **Finetuned from model [optional]:** `Salesforce/blip-vqa-base`
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [Link to your project repository on GitHub, GitLab, etc., or link to the Kaggle notebook where the training was performed]
-- **Paper [optional]:** [Link to the original BLIP paper: https://arxiv.org/abs/2201.05923. Optional: If there's a paper about the ABO dataset, link it here.]
-- **Demo [optional]:** [If you create a Gradio or other simple demo, link it here. Otherwise, state N/A or omit.]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-The model can be directly used for Visual Question Answering on images, particularly for questions related to objects similar to those in the **Amazon Berkeley Objects (ABO) dataset**. It is specifically fine-tuned to generate **single-word answers** to these questions. Given its training objective, it is expected to perform best on questions about identifying objects, their fundamental attributes, or simple quantities where a single term suffices as the answer. It can be used for inference via the Hugging Face Transformers library.
-### Downstream Use [optional]
 This model is suitable for integration into applications that require VQA capabilities focused on objects and where a **concise, single-word answer** is acceptable or preferred. Potential downstream uses include:
 - Rapid object identification in visual search systems.
@@ -265,7 +248,7 @@ The project utilizes Python 3.11 and key libraries from the PyTorch and Hugging
 - `bert-score` (for evaluation metric calculation)
 - `scikit-learn` (for general utilities, potentially used in evaluation)
-## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
@@ -284,10 +267,8 @@ The project utilizes Python 3.11 and key libraries from the PyTorch and Hugging
 **APA:**
 Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. *ArXiv, abs/2201.05923*.
-[Optional: Include citation for the Amazon Berkeley Objects dataset if one is available.]
-[Optional: Include a citation for the PEFT library if space allows or if specifically requested by your project guidelines]
-## Glossary [optional]
 -   **VQA:** Visual Question Answering - The task of answering a natural language question about the content of an image.
 -   **BLIP:** Bootstrapping Language-Image Pre-training - The base vision-language model architecture.
@@ -301,9 +282,11 @@ Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image
 [Optional: Add links to related work, dataset details, or any further information about the project or model.]
-## Model Card Authors [optional]
-[Your Name/Team Members' Names]
 ## Model Card Contact

 library_name: peft
 ---
+# Model Card for aryamanpathak/blip-vqa-abo
 <!-- Provide a quick summary of what the model is/does. -->
 This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model specifically adapted for **object-centric Visual Question Answering (VQA) with a focus on generating single-word answers**. Using Parameter-Efficient Fine-tuning (PEFT) with Low-Rank Adaptation (LoRA), the model is trained to answer natural language questions about the content of provided images, primarily aiming for concise, single-word responses about objects based on the Amazon Berkeley Objects (ABO) dataset.
 <!-- Provide a longer summary of what this model is. -->
 This model is a fine-tuned version of the `Salesforce/blip-vqa-base` model, specialized for an object-centric VQA task using the **Amazon Berkeley Objects (ABO) dataset**. A key characteristic of this fine-tuning is its focus on enabling the model to provide **single-word answers** to questions about objects and their properties. The fine-tuning was performed using Low-Rank Adaptation (LoRA), a parameter-efficient technique that injects small, trainable matrices into the pre-trained model's layers. This approach allows for efficient adaptation of the large BLIP model without the need to fine-tune all of its parameters, significantly reducing computational requirements while aiming to maintain strong performance on the specific target VQA domain and answer format. The model takes an image and a natural language question as input and generates a textual answer, primarily optimized for brevity.
+- **Developed by:** Aryaman, Rutul, Shreyas
 - **Model type:** Vision-Language Model (VLM), specifically designed for Visual Question Answering (VQA) with single-word answer generation.
 - **Language(s) (NLP):** English
+- **Finetuned from model:** `Salesforce/blip-vqa-base`
+### Downstream Use 1
 This model is suitable for integration into applications that require VQA capabilities focused on objects and where a **concise, single-word answer** is acceptable or preferred. Potential downstream uses include:
 - Rapid object identification in visual search systems.
 - `bert-score` (for evaluation metric calculation)
 - `scikit-learn` (for general utilities, potentially used in evaluation)
+## Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **APA:**
 Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. *ArXiv, abs/2201.05923*.
+## Glossary
 -   **VQA:** Visual Question Answering - The task of answering a natural language question about the content of an image.
 -   **BLIP:** Bootstrapping Language-Image Pre-training - The base vision-language model architecture.
 [Optional: Add links to related work, dataset details, or any further information about the project or model.]
+## Model Card Authors
+- Aryaman
+- Rutul
+- Shreyas
 ## Model Card Contact