Upload README.md with huggingface_hub

77b40f5 verified 5 days ago

9.27 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- vision-language
	- video-understanding
	- foveated-attention
	- multimodal
	- smollm2
	- dinov2
	library_name: pytorch
	pipeline_tag: image-text-to-text
	model-index:
	- name: fVLM-1.7B
	results:
	- task:
	type: video-question-answering
	name: Video Question Answering
	dataset:
	type: MVBench
	name: MVBench
	metrics:
	- type: accuracy
	value: 30.8
	name: Accuracy (coarse_only)
	- task:
	type: video-question-answering
	name: Video Question Answering
	dataset:
	type: Video-MME
	name: Video-MME
	metrics:
	- type: accuracy
	value: 30.5
	name: Accuracy (coarse_only)
	- task:
	type: question-answering
	name: Science Question Answering
	dataset:
	type: ScienceQA
	name: ScienceQA
	metrics:
	- type: accuracy
	value: 49.0
	name: Accuracy (coarse_only)
	---

	# fVLM-1.7B (Foveated Vision-Language Model)

	A vision-language model that uses foveated attention to compress each video frame into a single visual token, enabling efficient processing of long videos on a single GPU.

	## Model Description

	fVLM-1.7B is built on SmolLM2-1.7B-Instruct (language backbone) + DINOv2-small (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into 1 visual token. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.

	### Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Language Model \| SmolLM2-1.7B-Instruct \|
	\| Vision Encoder \| DINOv2-small \|
	\| Attention \| Deep query-guided foveated cross-attention \|
	\| Visual Tokens \| 1 token per frame (query-compressed) \|
	\| Total Parameters \| ~1.84B \|
	\| Query Dimension \| 384 \|
	\| LLM Dimension \| 2048 \|
	\| Visual Scale \| 0.14 \|

	### How Foveated Attention Works

	Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a single visual token using a learned query mechanism:

	1. DINOv2 encodes each frame into patch features and caches K/V at every layer
	2. A query vector is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
	3. The single output token is projected to LLM dimension and prepended to the text sequence
	4. The LLM generates the next query from its hidden state, creating a feedback loop where the model learns where to look

	This enables processing 64+ frames with the same memory as a few frames in traditional VLMs.

	### Inference Modes

	fVLM supports three forward modes with different speed/quality tradeoffs:

	\| Mode \| Description \| Use Case \|
	\|------\|-------------\|----------\|
	\| `coarse_only` \| Single static-query pass \| Fastest; good for images and quick inference \|
	\| `coarse_fine` \| Two-pass parallel forward (soft attention) \| Training mode; uses foveated attention \|
	\| `autoregressive` \| Sequential with KV cache (hard attention) \| Iterative foveation for video \|

	## Benchmark Results

	### fVLM-1.7B (Stage 3 DPO)

	\| Benchmark \| Coarse-Only \| Coarse→Fine \| Autoregressive \|
	\|-----------\|-------------\|-------------\|----------------\|
	\| MVBench (3800) \| 30.8% \| 29.9% \| 29.9% \|
	\| Video-MME (2700) \| 30.5% \| 28.2% \| 30.4% \|
	\| ScienceQA (2017) \| 49.0% \| 43.8% \| 46.6% \|

	### fVLM-135M (Stage 3 DPO) — for comparison

	\| Benchmark \| Coarse-Only \| Coarse→Fine \| Autoregressive \|
	\|-----------\|-------------\|-------------\|----------------\|
	\| MVBench \| 27.4% \| 28.0% \| 27.9% \|
	\| Video-MME \| 26.2% \| 29.5% \| 28.7% \|
	\| ScienceQA \| 36.4% \| 35.6% \| 35.4% \|

	Scaling gain (1.7B vs 135M): +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
	## Training

	Trained with a 3-stage pipeline (alignment, SFT, DPO) on a single A100-80GB GPU. Total training time: ~16 hours.

	### Stage 1: Visual Alignment (4.3h, 31,250 steps)
	- Objective: Align DINOv2 visual features with the SmolLM2 text embedding space
	- Data: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
	- Loss: Full-text cross-entropy (predict all tokens)
	- LR: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
	- Batch size: 32

	### Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
	- Objective: Supervised fine-tuning on vision-language tasks
	- Data: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
	- Loss: Answer-only cross-entropy (mask user/system tokens)
	- LR: Flat 3e-5 all components with cosine decay
	- Batch size: 32, gradient checkpointing enabled

	### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
	- Objective: Align outputs with human preferences
	- Data: RLAIF-V (83K preference pairs)
	- Loss: DPO with beta=0.1
	- LR: 5e-7 all components
	- Batch size: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled

	## Bug Fixes in This Version

	This release includes several important bug fixes over earlier checkpoints:

	1. `eos_token` / `ignore_index` collision: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.

	2. Stage 2 OOM skip rate fix: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.

	3. Benchmark letter-bias fix: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `checkpoint.pt` \| Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format \|
	\| `model.safetensors` \| Model weights in safetensors format (previous version) \|
	\| `model.py` \| Full model architecture code \|
	\| `train.py` \| Training script (all 3 stages) \|
	\| `data.py` \| Data loading and preprocessing \|
	\| `benchmark.py` \| Benchmark evaluation code \|
	\| `logger.py` \| Logging utilities \|
	\| `benchmark_results.json` \| Full benchmark results with per-category breakdowns \|

	## Usage

	### Setup

	```python
	import torch
	from torchvision import transforms
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download

	# Download checkpoint
	ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")

	# Build model
	from model import FoveatedVLM

	model = FoveatedVLM(
	llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
	dino_name="facebook/dinov2-small",
	query_dim=384,
	visual_scale=0.14,
	deep_query=True,
	)

	# Load weights
	ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
	model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
	model = model.to("cuda").to(torch.bfloat16).eval()

	tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")

	# Standard DINO preprocessing
	frame_transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
	])
	```

	### Image Input

	Important: fVLM treats all inputs as video. Static images must be replicated to 8 frames to match training distribution.

	```python
	from PIL import Image

	img = Image.open("photo.jpg").convert("RGB")
	frame_tensor = frame_transform(img) # [3, 224, 224]
	frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224]
	frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224]
	```

	### Video Input

	For video, sample up to 64 frames uniformly. No replication needed.

	```python
	tensors = [frame_transform(f) for f in video_frames]
	frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
	# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
	```

	### Inference

	```python
	messages = [
	{"role": "user", "content": "Describe what is happening in this image."},
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
	attention_mask = torch.ones_like(input_ids)
	loss_mask = torch.ones_like(input_ids, dtype=torch.float32)

	with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
	result = model(
	frames=frames,
	input_ids=input_ids,
	attention_mask=attention_mask,
	loss_mask=loss_mask,
	mode="coarse_fine", # or "coarse_only" or "autoregressive"
	)
	# result["logits"]: [B, S, V] text logits
	# result["loss"]: scalar cross-entropy loss
	```

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{fvlm2025,
	title={fVLM: Foveated Vision-Language Model},
	author={Sandeep Sampath Kumar},
	year={2025},
	url={https://huggingface.co/sanps/fVLM-1.7B}
	}
	```

	## License

	Apache 2.0