File size: 2,429 Bytes
0f8467e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# MiniGPTv2 Project

### Overview
MiniGPTv2 is a multimodal large language model that combines vision and language capabilities. This repository contains the implementation of MiniGPTv2 fine-tuned on facial emotion recognition and detailed image understanding tasks.

Model Architecture
Base Architecture: MiniGPTv2
LLM Backbone: Llama-2-7b-chat
Image Size: 448×448
Max Text Length: 3072 tokens
LoRA Configuration: r=64, alpha=16
Gradient Checkpointing: Enabled for the vision encoder, disabled for LLM
Training Configuration
Training Checkpoint: Epoch 88 (56,320 steps)
Steps per Epoch: 640
Batch Size: 1 (with gradient accumulation)
Gradient Accumulation Steps: 16
Learning Rate:
Initial: 3e-5
Minimum: 1e-6
Warmup: 1e-6
LR Schedule: Linear warmup with cosine decay
Warmup Steps: 1000
Weight Decay: 0.05
Mixed Precision Training: Enabled
Dataset Composition
The model was trained on a mixture of datasets with the following sampling ratios:

ShareGPT Detail: 30%
General visual conversation data
GPT4Vision Face Detail: 10%
Facial analysis and description data
Realistic Emotions Detail: 20%
Emotion recognition and interpretation data
Usage
Requirements


Text Only
torch>=2.0.0
transformers>=4.28.0
timm
fairscale
accelerate
Loading the Model


Python
from minigptv2.model import MiniGPTv2

# Initialize the model
model = MiniGPTv2.from_pretrained(
    llama_model_path="/path/to/Llama-2-7b-chat-hf",
    checkpoint_path="/path/to/minigptv2_checkpoint.pth",
    image_size=448,
    max_txt_len=3072
)

# Set to evaluation mode
model.eval()
Inference


Python
from PIL import Image
import torch

# Load image
image = Image.open("example.jpg").convert("RGB")

# Process input
response = model.generate(
    image=image,
    prompt="What emotions is this person expressing?",
    max_new_tokens=512
)

print(response)
Training
To continue training from the epoch 88 checkpoint:



Bash
python train.py --config /path/to/config.yaml --resume_ckpt_path /path/to/epoch88_checkpoint.pth
Evaluation


Bash
python evaluate.py --config /path/to/eval_config.yaml --checkpoint /path/to/epoch88_checkpoint.pth
License
[Specify license information]

Citation


Text Only
[Citation information for MiniGPTv2 and any relevant papers]
Acknowledgements
This project builds upon the MiniGPTv2 architecture and utilizes the Llama-2-7b-chat model. We thank the original authors for their contributions to the field.

---
license: apache-2.0
---