|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# Model Summary |
|
|
## 25EMBAI-VLM-FM is a Vision-Language Foundation Model built by combining: |
|
|
### Vision Encoder: ViT-H/14 (OpenCLIP) |
|
|
### Language Model: Qwen-based LLM |
|
|
### Bridging Modules: Resampler + Projector (image → LLM embedding space) |
|
|
|
|
|
It takes an image, encodes it into patch tokens, compresses them into a fixed-length set of visual tokens, projects them into the language model’s hidden space, and then performs multimodal reasoning conditioned on a text prompt. |
|
|
|
|
|
## Architecture Flow |
|
|
Image → ViT-H/14 → Resampler → Projector → Qwen LLM → Text Output |
|
|
LLM Input Format
[Batch, K_image_tokens + T_text_tokens, D_hidden] |
|
|
|
|
|
## Training Summary |
|
|
### Pre-training (Stage 1 & 2) |
|
|
Hardware: 8 × H100 80GB |
|
|
Stage 1 (3.6h):
Freeze ViT + LLM → Train Resampler + Projector |
|
|
Stage 2 (5.4h):
Unfreeze all → Train end-to-end |
|
|
Data: ~2M image–caption pairs (BLIP3 style) |
|
|
### Instruction Fine-tuning |
|
|
~2M images + ~200M text tokens |
|
|
~20 multimodal tasks: VQA, OCR, captioning, commands |
|
|
max_length: 1024 |
|
|
effective batch size: ~64 |
|
|
|
|
|
|
|
|
# Usage |
|
|
|
|
|
## Install |
|
|
pip install torch transformers pillow |
|
|
|
|
|
## Inference Example |
|
|
``` |
|
|
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor |
|
|
import torch |
|
|
from PIL import Image |
|
|
|
|
|
model_path = '/home/raid/models/25EMBAI_save_test' |
|
|
vision_model = 'ViT-H-14-378-quickgelu' |
|
|
vision_pretrained = 'dfn5b' |
|
|
dtype = torch.bfloat16 |
|
|
image_path = '/home/jason/git/UNIVA/25EMBAI_VLM_FM/qwen/train/sample.png' |
|
|
|
|
|
model = AutoModel.from_pretrained( |
|
|
model_path, |
|
|
trust_remote_code=True |
|
|
).to(device = 'cuda', dtype=dtype) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
image_processor = AutoImageProcessor.from_pretrained( |
|
|
model_path, |
|
|
trust_remote_code=True, |
|
|
) |
|
|
model.eval() |
|
|
|
|
|
img = Image.open(image_path).convert("RGB") |
|
|
|
|
|
pixel = image_processor(img, return_tensors="pt")["pixel_values"].to( |
|
|
dtype=dtype, |
|
|
device='cuda', |
|
|
) |
|
|
prompt = 'please describe this image.' |
|
|
|
|
|
output = model.generate_text( |
|
|
images=pixel, |
|
|
prompt=prompt, |
|
|
max_new_tokens=512, |
|
|
do_sample=True, |
|
|
top_p=0.9, |
|
|
temperature=0.7, |
|
|
) |
|
|
|
|
|
print(output) |
|
|
``` |
|
|
|
|
|
# Limitations & Biases |
|
|
This model is an early-stage prototype.
It will be updated and reorganized in future releases. |
|
|
Because it was trained on web-scale multimodal data: |
|
|
It may reflect social biases and stereotypes |
|
|
It may hallucinate, invent facts, or produce unverifiable content |
|
|
It may perform suboptimally on: |
|
|
Non-English languages |
|
|
Specialized and domain-specific tasks |
|
|
Safety-critical contexts |
|
|
This model is not recommended for medical, legal, or safety-critical use without additional validation, guardrails, or fine-tuning. |
|
|
Users should apply external filtering, grounding, and safety alignment before deployment. |
|
|
|