Tiny Multimodal Pretrained 68k

This repository contains a clean pretrained checkpoint for Tiny Multimodal, a small from-scratch native multimodal Transformer experiment.

This is a pretrained base model, not an instruction-tuned chat assistant.

Files

ckpt_last.pt
args.json
tokenizer_8k_clean/
  vocab.json
  merges.txt

Model

~14.7M parameters
decoder-only Transformer
8 layers
hidden size 304
8 attention heads
8192-token ByteLevel BPE tokenizer
128x128 image input
16x16 image patches, 64 image patch tokens

Images are converted into patch vectors and inserted into the causal token sequence:

<|bos|> <|image|> [64 image patch vectors] <|/image|> text...

Training

This checkpoint is from the clean pretraining line, stopped around the ~68k step range.

The data mix was:

TinyStories for compact story structure
filtered Cosmopedia-style story/explanation text
COCO captions for visual grounding
Flickr captions for additional image-caption variety

Current Behavior

The model is starting to learn some visual grounding. In simple manual tests, color words and broad scene cues started to affect generations, for example red images increasing red-related text.

The vision side is still pretty bad. It is weak at object identity, counting, spatial details, and robust visual question answering.

Use this as a toy research checkpoint, not a reliable assistant.

Usage

Clone the code repository:

git clone https://github.com/vidvudsc/MultiModal
cd MultiModal

Download this model folder into:

models/pretrained_68k/

Run local inference:

python3 infer.py \
  --checkpoint models/pretrained_68k/ckpt_last.pt \
  --tokenizer_dir models/pretrained_68k/tokenizer_8k_clean \
  --index index.html \
  --host 127.0.0.1 \
  --port 7860 \
  --device mps \
  --dtype float32 \
  --temperature 0.5 \
  --top_k 30 \
  --prompt_format plain

Use --prompt_format plain for this pretrained base checkpoint. Use chat format only after supervised fine tuning.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support