Tiny Multimodal Pretrained 68k
This repository contains a clean pretrained checkpoint for Tiny Multimodal, a small from-scratch native multimodal Transformer experiment.
This is a pretrained base model, not an instruction-tuned chat assistant.
Files
ckpt_last.pt
args.json
tokenizer_8k_clean/
vocab.json
merges.txt
Model
- ~14.7M parameters
- decoder-only Transformer
- 8 layers
- hidden size 304
- 8 attention heads
- 8192-token ByteLevel BPE tokenizer
- 128x128 image input
- 16x16 image patches, 64 image patch tokens
Images are converted into patch vectors and inserted into the causal token sequence:
<|bos|> <|image|> [64 image patch vectors] <|/image|> text...
Training
This checkpoint is from the clean pretraining line, stopped around the ~68k step range.
The data mix was:
- TinyStories for compact story structure
- filtered Cosmopedia-style story/explanation text
- COCO captions for visual grounding
- Flickr captions for additional image-caption variety
Current Behavior
The model is starting to learn some visual grounding. In simple manual tests, color words and broad scene cues started to affect generations, for example red images increasing red-related text.
The vision side is still pretty bad. It is weak at object identity, counting, spatial details, and robust visual question answering.
Use this as a toy research checkpoint, not a reliable assistant.
Usage
Clone the code repository:
git clone https://github.com/vidvudsc/MultiModal
cd MultiModal
Download this model folder into:
models/pretrained_68k/
Run local inference:
python3 infer.py \
--checkpoint models/pretrained_68k/ckpt_last.pt \
--tokenizer_dir models/pretrained_68k/tokenizer_8k_clean \
--index index.html \
--host 127.0.0.1 \
--port 7860 \
--device mps \
--dtype float32 \
--temperature 0.5 \
--top_k 30 \
--prompt_format plain
Use --prompt_format plain for this pretrained base checkpoint. Use chat format only after supervised fine tuning.