Trofish
/

RoViQA-vit-roberta-image-captioning

Model card Files Files and versions

RoViQA-vit-roberta-image-captioning / README.md

Trofish's picture

Update README.md

ee6bcd5 verified over 1 year ago

|

history blame contribute delete

1.16 kB

	<div align="center">

	<br><b> RoViQA: Visual-Question-Answering model created by combining Roberta and ViT</b> <br><br>
	This repository contains the code for RoViQA, a Visual Question Answering (VQA) model that combines image features extracted using Vision Transformer (ViT) and text features extracted using RoBERTa. The project includes training, inference, and various utility scripts.
	<br>github: https://github.com/Tro-fish/RoViQA-Visual_Question_Answering
	</div>

	## Model Architecture
	<p align="center">
	<img src="https://github.com/Tro-fish/Visual-Question-Answering/assets/79634774/f7d0eb20-f3b4-4f69-880d-d412ed32ab68" alt="Description of the image" width="100%" />
	</p>

	## RoViQA Overview
	RoViQA is a Visual Question Answering (VQA) model that leverages the power of Vision Transformer (ViT) and RoBERTa to understand and answer questions about images. By combining the strengths of these two models, RoViQA can effectively process and interpret both visual and textual information to provide accurate answers.

	## Model parameter
	- Base Models
	- Roberta-base: 110M parameters
	- ViT-base: 86M parameters
	- RoViQA: 215M parameters