---
license: apache-2.0
library_name: pytorch
tags:
  - multimodal
  - vision-language
  - tiny-model
  - pytorch
  - experimental
---

# Tiny Multimodal Pretrained 68k

This repository contains a clean pretrained checkpoint for Tiny Multimodal, a small from-scratch native multimodal Transformer experiment.

This is a **pretrained base model**, not an instruction-tuned chat assistant.

## Files

```text
ckpt_last.pt
args.json
tokenizer_8k_clean/
  vocab.json
  merges.txt
```

## Model

- ~14.7M parameters
- decoder-only Transformer
- 8 layers
- hidden size 304
- 8 attention heads
- 8192-token ByteLevel BPE tokenizer
- 128x128 image input
- 16x16 image patches, 64 image patch tokens

Images are converted into patch vectors and inserted into the causal token sequence:

```text
<|bos|> <|image|> [64 image patch vectors] <|/image|> text...
```

## Training

This checkpoint is from the clean pretraining line, stopped around the ~68k step range.

The data mix was:

- TinyStories for compact story structure
- filtered Cosmopedia-style story/explanation text
- COCO captions for visual grounding
- Flickr captions for additional image-caption variety

## Current Behavior

The model is starting to learn some visual grounding. In simple manual tests, color words and broad scene cues started to affect generations, for example red images increasing red-related text.

The vision side is still pretty bad. It is weak at object identity, counting, spatial details, and robust visual question answering.

Use this as a toy research checkpoint, not a reliable assistant.

## Usage

Clone the code repository:

```bash
git clone https://github.com/vidvudsc/MultiModal
cd MultiModal
```

Download this model folder into:

```text
models/pretrained_68k/
```

Run local inference:

```bash
python3 infer.py \
  --checkpoint models/pretrained_68k/ckpt_last.pt \
  --tokenizer_dir models/pretrained_68k/tokenizer_8k_clean \
  --index index.html \
  --host 127.0.0.1 \
  --port 7860 \
  --device mps \
  --dtype float32 \
  --temperature 0.5 \
  --top_k 30 \
  --prompt_format plain
```

Use `--prompt_format plain` for this pretrained base checkpoint. Use chat format only after supervised fine tuning.