--- license: apache-2.0 library_name: pytorch tags: - multimodal - vision-language - tiny-model - pytorch - experimental --- # Tiny Multimodal Pretrained 68k This repository contains a clean pretrained checkpoint for Tiny Multimodal, a small from-scratch native multimodal Transformer experiment. This is a **pretrained base model**, not an instruction-tuned chat assistant. ## Files ```text ckpt_last.pt args.json tokenizer_8k_clean/ vocab.json merges.txt ``` ## Model - ~14.7M parameters - decoder-only Transformer - 8 layers - hidden size 304 - 8 attention heads - 8192-token ByteLevel BPE tokenizer - 128x128 image input - 16x16 image patches, 64 image patch tokens Images are converted into patch vectors and inserted into the causal token sequence: ```text <|bos|> <|image|> [64 image patch vectors] <|/image|> text... ``` ## Training This checkpoint is from the clean pretraining line, stopped around the ~68k step range. The data mix was: - TinyStories for compact story structure - filtered Cosmopedia-style story/explanation text - COCO captions for visual grounding - Flickr captions for additional image-caption variety ## Current Behavior The model is starting to learn some visual grounding. In simple manual tests, color words and broad scene cues started to affect generations, for example red images increasing red-related text. The vision side is still pretty bad. It is weak at object identity, counting, spatial details, and robust visual question answering. Use this as a toy research checkpoint, not a reliable assistant. ## Usage Clone the code repository: ```bash git clone https://github.com/vidvudsc/MultiModal cd MultiModal ``` Download this model folder into: ```text models/pretrained_68k/ ``` Run local inference: ```bash python3 infer.py \ --checkpoint models/pretrained_68k/ckpt_last.pt \ --tokenizer_dir models/pretrained_68k/tokenizer_8k_clean \ --index index.html \ --host 127.0.0.1 \ --port 7860 \ --device mps \ --dtype float32 \ --temperature 0.5 \ --top_k 30 \ --prompt_format plain ``` Use `--prompt_format plain` for this pretrained base checkpoint. Use chat format only after supervised fine tuning.