| # MiniGPTv2 Project |
|
|
| ### Overview |
| MiniGPTv2 is a multimodal large language model that combines vision and language capabilities. This repository contains the implementation of MiniGPTv2 fine-tuned on facial emotion recognition and detailed image understanding tasks. |
|
|
| Model Architecture |
| Base Architecture: MiniGPTv2 |
| LLM Backbone: Llama-2-7b-chat |
| Image Size: 448×448 |
| Max Text Length: 3072 tokens |
| LoRA Configuration: r=64, alpha=16 |
| Gradient Checkpointing: Enabled for the vision encoder, disabled for LLM |
| Training Configuration |
| Training Checkpoint: Epoch 88 (56,320 steps) |
| Steps per Epoch: 640 |
| Batch Size: 1 (with gradient accumulation) |
| Gradient Accumulation Steps: 16 |
| Learning Rate: |
| Initial: 3e-5 |
| Minimum: 1e-6 |
| Warmup: 1e-6 |
| LR Schedule: Linear warmup with cosine decay |
| Warmup Steps: 1000 |
| Weight Decay: 0.05 |
| Mixed Precision Training: Enabled |
| Dataset Composition |
| The model was trained on a mixture of datasets with the following sampling ratios: |
|
|
| ShareGPT Detail: 30% |
| General visual conversation data |
| GPT4Vision Face Detail: 10% |
| Facial analysis and description data |
| Realistic Emotions Detail: 20% |
| Emotion recognition and interpretation data |
| Usage |
| Requirements |
|
|
|
|
| Text Only |
| torch>=2.0.0 |
| transformers>=4.28.0 |
| timm |
| fairscale |
| accelerate |
| Loading the Model |
|
|
|
|
| Python |
| from minigptv2.model import MiniGPTv2 |
|
|
| # Initialize the model |
| model = MiniGPTv2.from_pretrained( |
| llama_model_path="/path/to/Llama-2-7b-chat-hf", |
| checkpoint_path="/path/to/minigptv2_checkpoint.pth", |
| image_size=448, |
| max_txt_len=3072 |
| ) |
| |
| # Set to evaluation mode |
| model.eval() |
| Inference |
|
|
|
|
| Python |
| from PIL import Image |
| import torch |
|
|
| # Load image |
| image = Image.open("example.jpg").convert("RGB") |
|
|
| # Process input |
| response = model.generate( |
| image=image, |
| prompt="What emotions is this person expressing?", |
| max_new_tokens=512 |
| ) |
| |
| print(response) |
| Training |
| To continue training from the epoch 88 checkpoint: |
|
|
|
|
|
|
| Bash |
| python train.py --config /path/to/config.yaml --resume_ckpt_path /path/to/epoch88_checkpoint.pth |
| Evaluation |
| |
| |
| Bash |
| python evaluate.py --config /path/to/eval_config.yaml --checkpoint /path/to/epoch88_checkpoint.pth |
| License |
| [Specify license information] |
| |
| Citation |
| |
| |
| Text Only |
| [Citation information for MiniGPTv2 and any relevant papers] |
| Acknowledgements |
| This project builds upon the MiniGPTv2 architecture and utilizes the Llama-2-7b-chat model. We thank the original authors for their contributions to the field. |
| |
| --- |
| license: apache-2.0 |
| --- |
| |