aryamanpathak commited on
Commit
b271686
·
verified ·
1 Parent(s): 811f591

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -27
README.md CHANGED
@@ -15,7 +15,7 @@ base_model: Salesforce/blip-vqa-base
15
  library_name: peft
16
  ---
17
 
18
- # Model Card for [Your-HuggingFace-Username]/[Your-Model-Name]
19
 
20
  <!-- Provide a quick summary of what the model is/does. -->
21
  This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model specifically adapted for **object-centric Visual Question Answering (VQA) with a focus on generating single-word answers**. Using Parameter-Efficient Fine-tuning (PEFT) with Low-Rank Adaptation (LoRA), the model is trained to answer natural language questions about the content of provided images, primarily aiming for concise, single-word responses about objects based on the Amazon Berkeley Objects (ABO) dataset.
@@ -27,31 +27,14 @@ This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model spec
27
  <!-- Provide a longer summary of what this model is. -->
28
  This model is a fine-tuned version of the `Salesforce/blip-vqa-base` model, specialized for an object-centric VQA task using the **Amazon Berkeley Objects (ABO) dataset**. A key characteristic of this fine-tuning is its focus on enabling the model to provide **single-word answers** to questions about objects and their properties. The fine-tuning was performed using Low-Rank Adaptation (LoRA), a parameter-efficient technique that injects small, trainable matrices into the pre-trained model's layers. This approach allows for efficient adaptation of the large BLIP model without the need to fine-tune all of its parameters, significantly reducing computational requirements while aiming to maintain strong performance on the specific target VQA domain and answer format. The model takes an image and a natural language question as input and generates a textual answer, primarily optimized for brevity.
29
 
30
- - **Developed by:** [Your Name/Team Names]
31
- - **Funded by [optional]:** [Mention if funded by a specific grant or program, otherwise state N/A or omit]
32
- - **Shared by [optional]:** [Your Name/Your Hugging Face Account Name]
33
  - **Model type:** Vision-Language Model (VLM), specifically designed for Visual Question Answering (VQA) with single-word answer generation.
34
  - **Language(s) (NLP):** English
35
- - **License:** [Specify the License you are using for your fine-tuned model and code (e.g., MIT, Apache 2.0). If unsure, select a suitable open-source license.]
36
- - **Finetuned from model [optional]:** `Salesforce/blip-vqa-base`
37
 
38
- ### Model Sources [optional]
39
 
40
- <!-- Provide the basic links for the model. -->
41
 
42
- - **Repository:** [Link to your project repository on GitHub, GitLab, etc., or link to the Kaggle notebook where the training was performed]
43
- - **Paper [optional]:** [Link to the original BLIP paper: https://arxiv.org/abs/2201.05923. Optional: If there's a paper about the ABO dataset, link it here.]
44
- - **Demo [optional]:** [If you create a Gradio or other simple demo, link it here. Otherwise, state N/A or omit.]
45
-
46
- ## Uses
47
-
48
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
49
-
50
- ### Direct Use
51
-
52
- The model can be directly used for Visual Question Answering on images, particularly for questions related to objects similar to those in the **Amazon Berkeley Objects (ABO) dataset**. It is specifically fine-tuned to generate **single-word answers** to these questions. Given its training objective, it is expected to perform best on questions about identifying objects, their fundamental attributes, or simple quantities where a single term suffices as the answer. It can be used for inference via the Hugging Face Transformers library.
53
-
54
- ### Downstream Use [optional]
55
 
56
  This model is suitable for integration into applications that require VQA capabilities focused on objects and where a **concise, single-word answer** is acceptable or preferred. Potential downstream uses include:
57
  - Rapid object identification in visual search systems.
@@ -265,7 +248,7 @@ The project utilizes Python 3.11 and key libraries from the PyTorch and Hugging
265
  - `bert-score` (for evaluation metric calculation)
266
  - `scikit-learn` (for general utilities, potentially used in evaluation)
267
 
268
- ## Citation [optional]
269
 
270
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
271
 
@@ -284,10 +267,8 @@ The project utilizes Python 3.11 and key libraries from the PyTorch and Hugging
284
  **APA:**
285
  Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. *ArXiv, abs/2201.05923*.
286
 
287
- [Optional: Include citation for the Amazon Berkeley Objects dataset if one is available.]
288
- [Optional: Include a citation for the PEFT library if space allows or if specifically requested by your project guidelines]
289
 
290
- ## Glossary [optional]
291
 
292
  - **VQA:** Visual Question Answering - The task of answering a natural language question about the content of an image.
293
  - **BLIP:** Bootstrapping Language-Image Pre-training - The base vision-language model architecture.
@@ -301,9 +282,11 @@ Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image
301
 
302
  [Optional: Add links to related work, dataset details, or any further information about the project or model.]
303
 
304
- ## Model Card Authors [optional]
305
 
306
- [Your Name/Team Members' Names]
 
 
307
 
308
  ## Model Card Contact
309
 
 
15
  library_name: peft
16
  ---
17
 
18
+ # Model Card for aryamanpathak/blip-vqa-abo
19
 
20
  <!-- Provide a quick summary of what the model is/does. -->
21
  This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model specifically adapted for **object-centric Visual Question Answering (VQA) with a focus on generating single-word answers**. Using Parameter-Efficient Fine-tuning (PEFT) with Low-Rank Adaptation (LoRA), the model is trained to answer natural language questions about the content of provided images, primarily aiming for concise, single-word responses about objects based on the Amazon Berkeley Objects (ABO) dataset.
 
27
  <!-- Provide a longer summary of what this model is. -->
28
  This model is a fine-tuned version of the `Salesforce/blip-vqa-base` model, specialized for an object-centric VQA task using the **Amazon Berkeley Objects (ABO) dataset**. A key characteristic of this fine-tuning is its focus on enabling the model to provide **single-word answers** to questions about objects and their properties. The fine-tuning was performed using Low-Rank Adaptation (LoRA), a parameter-efficient technique that injects small, trainable matrices into the pre-trained model's layers. This approach allows for efficient adaptation of the large BLIP model without the need to fine-tune all of its parameters, significantly reducing computational requirements while aiming to maintain strong performance on the specific target VQA domain and answer format. The model takes an image and a natural language question as input and generates a textual answer, primarily optimized for brevity.
29
 
30
+ - **Developed by:** Aryaman, Rutul, Shreyas
 
 
31
  - **Model type:** Vision-Language Model (VLM), specifically designed for Visual Question Answering (VQA) with single-word answer generation.
32
  - **Language(s) (NLP):** English
33
+ - **Finetuned from model:** `Salesforce/blip-vqa-base`
 
34
 
 
35
 
 
36
 
37
+ ### Downstream Use 1
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  This model is suitable for integration into applications that require VQA capabilities focused on objects and where a **concise, single-word answer** is acceptable or preferred. Potential downstream uses include:
40
  - Rapid object identification in visual search systems.
 
248
  - `bert-score` (for evaluation metric calculation)
249
  - `scikit-learn` (for general utilities, potentially used in evaluation)
250
 
251
+ ## Citation
252
 
253
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
254
 
 
267
  **APA:**
268
  Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. *ArXiv, abs/2201.05923*.
269
 
 
 
270
 
271
+ ## Glossary
272
 
273
  - **VQA:** Visual Question Answering - The task of answering a natural language question about the content of an image.
274
  - **BLIP:** Bootstrapping Language-Image Pre-training - The base vision-language model architecture.
 
282
 
283
  [Optional: Add links to related work, dataset details, or any further information about the project or model.]
284
 
285
+ ## Model Card Authors
286
 
287
+ - Aryaman
288
+ - Rutul
289
+ - Shreyas
290
 
291
  ## Model Card Contact
292