Spaces:
Sleeping
Sleeping
Update my_model/tabs/model_arch.py
Browse files
my_model/tabs/model_arch.py
CHANGED
|
@@ -33,6 +33,7 @@ def run_model_arch() -> None:
|
|
| 33 |
of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have
|
| 34 |
transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle
|
| 35 |
complex tasks, thereby enhancing KB-VQA systems.
|
|
|
|
| 36 |
An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined
|
| 37 |
approach that converts visual content into the linguistic domain, creating detailed captions and object
|
| 38 |
enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The
|
|
@@ -40,11 +41,13 @@ def run_model_arch() -> None:
|
|
| 40 |
to interpret visual contexts. The research also reviews current image representation techniques and knowledge
|
| 41 |
sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not
|
| 42 |
require specialized expertise.
|
|
|
|
| 43 |
Rigorous ablation experiments conducted to assess the impact of various visual context elements on model
|
| 44 |
performance, with a particular focus on the importance of image descriptions generated during the captioning
|
| 45 |
phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus,
|
| 46 |
and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment
|
| 47 |
with practical application needs.
|
|
|
|
| 48 |
The evaluation results underscore the developed model’s competent and competitive performance. It achieves a
|
| 49 |
VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further,
|
| 50 |
semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%,
|
|
@@ -63,6 +66,7 @@ def run_model_arch() -> None:
|
|
| 63 |
selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more
|
| 64 |
advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological
|
| 65 |
advancement.
|
|
|
|
| 66 |
Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects,
|
| 67 |
along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing
|
| 68 |
a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model
|
|
|
|
| 33 |
of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have
|
| 34 |
transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle
|
| 35 |
complex tasks, thereby enhancing KB-VQA systems.
|
| 36 |
+
|
| 37 |
An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined
|
| 38 |
approach that converts visual content into the linguistic domain, creating detailed captions and object
|
| 39 |
enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The
|
|
|
|
| 41 |
to interpret visual contexts. The research also reviews current image representation techniques and knowledge
|
| 42 |
sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not
|
| 43 |
require specialized expertise.
|
| 44 |
+
|
| 45 |
Rigorous ablation experiments conducted to assess the impact of various visual context elements on model
|
| 46 |
performance, with a particular focus on the importance of image descriptions generated during the captioning
|
| 47 |
phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus,
|
| 48 |
and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment
|
| 49 |
with practical application needs.
|
| 50 |
+
|
| 51 |
The evaluation results underscore the developed model’s competent and competitive performance. It achieves a
|
| 52 |
VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further,
|
| 53 |
semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%,
|
|
|
|
| 66 |
selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more
|
| 67 |
advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological
|
| 68 |
advancement.
|
| 69 |
+
|
| 70 |
Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects,
|
| 71 |
along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing
|
| 72 |
a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model
|