--- license: apache-2.0 datasets: - liuhaotian/LLaVA-Instruct-150K - liuhaotian/LLaVA-Pretrain base_model: - microsoft/Phi-4-mini-reasoning - kevin510/fast-vit-hd library_name: transformers tags: - vision-language - multimodal - friday - custom_code - bf16 --- # Friday-VLM Friday-VLM is a multimodal (image + text) LLM fine-tuned on image and text instruction data. The architecture and config live in this repo, so callers must load the model with `trust_remote_code=True`. --- # Model variants | Repo ID | Precision | File format | Typical VRAM* | Size on disk | |---------|-----------|-------------|---------------|--------------| | `kevin510/friday` | **bf16** (full) | `safetensors` | 100 % | 100 % | | `kevin510/friday-fp4` | **fp4** (bitsandbytes int4) | `safetensors` | ≈ 30 % | ≈ 25 % | --- # Dependencies ```bash conda create --name friday python=3.12 -y conda activate friday pip install transformers torch torchvision deepspeed accelerate pillow einops timm ``` # Quick start ```python import torch from PIL import Image from transformers import AutoTokenizer, AutoModelForCausalLM from transformers.utils import logging tok = AutoTokenizer.from_pretrained("kevin510/friday", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "kevin510/friday", trust_remote_code=True, device_map="auto" ) model.eval() prompt = "Describe this image." user_prompt = f"<|user|>\n{prompt}\n<|assistant|>" inputs = tok(user_prompt, return_tensors="pt").to(model.device) image = Image.open("my_image.jpg").convert("RGB") with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=256, do_sample=False, images=[image] ) print(tok.decode(out[0], skip_special_tokens=False)) ``` # Architecture at a glance ``` FastViT-HD ─▶ 3072-d patch embeddings ─▶ S2 6144-d patch embeddings ─▶ 2-layer MLP vision-adapter (6144 → 3072) (vision tokens, 3072 d) ─┐ ├─► Φ-4-mini-reasoning (2.7 B params, hidden = 3072) ───┘ │ │ (standard self-attention only; │ language tower is frozen at finetune) ``` # Limitations & Responsible AI Friday-VLM may hallucinate objects, invent facts, or reproduce societal biases. All variants share the same behaviour profile; quantisation does not filter or sanitise model outputs. Users must apply their own content-safety layer before deployment. # Citation ```bibtex @misc{friday2025, title = {Friday VLM: Efficient Instruction-Tuned Vision–Language Modelling}, author = {Kevin Rohling}, year = {2025}, url = {https://huggingface.co/kevin510/friday} } ```