Spaces:

m7mdal7aj
/

KB-VQA

Sleeping

App Files Files Community

m7mdal7aj commited on May 18, 2024

Commit

5234b83

verified ·

1 Parent(s): fac1f70

Update my_model/tabs/model_arch.py

Browse files

Files changed (1) hide show

my_model/tabs/model_arch.py +4 -0

my_model/tabs/model_arch.py CHANGED Viewed

@@ -33,6 +33,7 @@ def run_model_arch() -> None:
         of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have
         transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle
         complex tasks, thereby enhancing KB-VQA systems.
         An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined
         approach that converts visual content into the linguistic domain, creating detailed captions and object
         enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The
@@ -40,11 +41,13 @@ def run_model_arch() -> None:
         to interpret visual contexts. The research also reviews current image representation techniques and knowledge
         sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not
         require specialized expertise.
         Rigorous ablation experiments conducted to assess the impact of various visual context elements on model
         performance, with a particular focus on the importance of image descriptions generated during the captioning
         phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus,
         and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment
         with practical application needs.
         The evaluation results underscore the developed model’s competent and competitive performance. It achieves a
         VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further,
         semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%,
@@ -63,6 +66,7 @@ def run_model_arch() -> None:
         selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more
         advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological
         advancement.
         Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects,
         along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing
         a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model

         of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have
         transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle
         complex tasks, thereby enhancing KB-VQA systems.
         An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined
         approach that converts visual content into the linguistic domain, creating detailed captions and object
         enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The
         to interpret visual contexts. The research also reviews current image representation techniques and knowledge
         sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not
         require specialized expertise.
         Rigorous ablation experiments conducted to assess the impact of various visual context elements on model
         performance, with a particular focus on the importance of image descriptions generated during the captioning
         phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus,
         and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment
         with practical application needs.
         The evaluation results underscore the developed model’s competent and competitive performance. It achieves a
         VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further,
         semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%,
         selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more
         advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological
         advancement.
         Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects,
         along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing
         a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model