TeamUNIVA
/

VLM_prototype

Model card Files Files and versions

VLM_prototype / README.md

UNIVA-Jason's picture

Update README.md

d55f16d verified about 1 month ago

|

history blame contribute delete

2.8 kB

	---
	license: apache-2.0
	---

	# Model Summary
	## 25EMBAI-VLM-FM is a Vision-Language Foundation Model built by combining:
	### Vision Encoder: ViT-H/14 (OpenCLIP)
	### Language Model: Qwen-based LLM
	### Bridging Modules: Resampler + Projector (image → LLM embedding space)

	It takes an image, encodes it into patch tokens, compresses them into a fixed-length set of visual tokens, projects them into the language model’s hidden space, and then performs multimodal reasoning conditioned on a text prompt.

	## Architecture Flow
	Image → ViT-H/14 → Resampler → Projector → Qwen LLM → Text Output
	LLM Input Format [Batch, K_image_tokens + T_text_tokens, D_hidden]

	## Training Summary
	### Pre-training (Stage 1 & 2)
	Hardware: 8 × H100 80GB
	Stage 1 (3.6h): Freeze ViT + LLM → Train Resampler + Projector
	Stage 2 (5.4h): Unfreeze all → Train end-to-end
	Data: ~2M image–caption pairs (BLIP3 style)
	### Instruction Fine-tuning
	~2M images + ~200M text tokens
	~20 multimodal tasks: VQA, OCR, captioning, commands
	max_length: 1024
	effective batch size: ~64


	# Usage

	## Install
	pip install torch transformers pillow

	## Inference Example
	```
	from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
	import torch
	from PIL import Image

	model_path = '/home/raid/models/25EMBAI_save_test'
	vision_model = 'ViT-H-14-378-quickgelu'
	vision_pretrained = 'dfn5b'
	dtype = torch.bfloat16
	image_path = '/home/jason/git/UNIVA/25EMBAI_VLM_FM/qwen/train/sample.png'

	model = AutoModel.from_pretrained(
	model_path,
	trust_remote_code=True
	).to(device = 'cuda', dtype=dtype)
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	image_processor = AutoImageProcessor.from_pretrained(
	model_path,
	trust_remote_code=True,
	)
	model.eval()

	img = Image.open(image_path).convert("RGB")

	pixel = image_processor(img, return_tensors="pt")["pixel_values"].to(
	dtype=dtype,
	device='cuda',
	)
	prompt = 'please describe this image.'

	output = model.generate_text(
	images=pixel,
	prompt=prompt,
	max_new_tokens=512,
	do_sample=True,
	top_p=0.9,
	temperature=0.7,
	)

	print(output)
	```

	# Limitations & Biases
	This model is an early-stage prototype. It will be updated and reorganized in future releases.
	Because it was trained on web-scale multimodal data:
	It may reflect social biases and stereotypes
	It may hallucinate, invent facts, or produce unverifiable content
	It may perform suboptimally on:
	Non-English languages
	Specialized and domain-specific tasks
	Safety-critical contexts
	This model is not recommended for medical, legal, or safety-critical use without additional validation, guardrails, or fine-tuning.
	Users should apply external filtering, grounding, and safety alignment before deployment.