| language: | |
| - en | |
| tags: | |
| - vision-language | |
| - vqa | |
| - text-to-image-evaluation | |
| license: mit | |
| # Tiny Random VQAScore Model | |
| This is a tiny random version of the VQAScore architecture for educational and testing purposes. | |
| ## Model Architecture | |
| - **Vision Encoder**: Tiny CNN + Transformer (64 hidden size) | |
| - **Language Model**: Tiny Transformer (256 hidden size) | |
| - **Multimodal Projector**: MLP with 256 β 128 β 64 β 1 | |
| ## Usage | |
| ```python | |
| from create_tiny_vqa_model import TinyVQAScore | |
| # Load the model | |
| model = TinyVQAScore(device="cpu") | |
| # Score an image | |
| from PIL import Image | |
| image = Image.open("your_image.jpg") | |
| score = model.score(image, "What is shown in this image?") | |
| print(f"VQA Score: {score}") | |
| ``` | |
| ## Model Size | |
| - **Parameters**: ~50K (vs ~11B for the original XXL model) | |
| - **Memory**: ~200KB (vs ~22GB for the original XXL model) | |
| ## Disclaimer | |
| This is a randomly initialized model for testing and educational purposes. It is not trained and will not produce meaningful VQA results. | |