File size: 9,924 Bytes

bdd4eae

---
license: apple-amlr
datasets:
- riddhimanrana/coco-fastvlm-2k-val2017
language:
- en
base_model:
- apple/FastVLM-0.5B
base_model_relation: finetune
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- mlx
- finetuned
- 4bit
- llava_qwen2
- multimodal
---

# fastvlm-0.5b-captions

## Model Details

`fastvlm-0.5b-captions` is a finetuned version of **FastVLM-0.5B Stage 3** from the [FastVLM official repository](https://github.com/apple/ml-fastvlm), built for **efficient structured image captioning on mobile devices**. This model incorporates **LoRA fine-tuning**, **4-bit quantization**, and **MobileCLIP-S0** as its vision tower, achieving substantial RAM reductions for embedded inference. This is part of a larger research project that I'm conducting. Find out more at [orionlive.ai/research](https://orionlive.ai/research) or visit my git repo [riddhimanrana/orion](https://github.com/riddhimanrana/orion)

### Model Description

- **Developed by:** Riddhiman Rana (fine-tuning and optimizations)
- **Model type:** VLM (Vision-Language Model)
- **Original model authors:** Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
- **Language(s) (NLP):** English
- **License (base model):** apple-amlr
- **Finetuned from model:** [`apple/ml-fastvlm`](https://github.com/apple/ml-fastvlm), specifically `FastVLM-0.5B Stage 3`

### Model Sources

<!-- Provide the basic links for the model. -->

- **Base Model Repository:** https://github.com/apple/ml-fastvlm
- **Fine-tuning Training Dataset:** https://huggingface.co/datasets/riddhimanrana/coco-fastvlm-2k-val2017
- **FastVLM Paper (CVPR 2025):** https://www.arxiv.org/abs/2412.13303

## Uses

<table>
<tr>
    <td><img src="https://huggingface.co/riddhimanrana/fastvlm-0.5b-captions/resolve/main/demo/demo.gif" alt="FastVLM - iOS App Demo"></td>
</tr>
</table>

*Demo on iPhone 13 Pro Max*

### Direct Use

- Generating **highly detailed, structured captions** for images on mobile and embedded devices.
- Ideal for **low-resource environments** such as iPhones, MacBooks, and potentially other Apple Silicon devices via MLX and CoreML.
- Tested on iPhone 12/13 Pro Max/14 – reaching RAM usage **below 1 GB** and TTFT as low as **600ms** on higher-end iPhones.

### Out-of-Scope Use

- This is not designed for general-purpose multimodal reasoning beyond descriptive image captioning.
- Not suitable for text-only language tasks.

## Bias, Risks, and Limitations

- Dataset was limited to **2,000 images from COCO 2017 Validation** – captions may reflect biases in that dataset.
- The model’s structured captions might occasionally be verbose or repetitive depending on input complexity.
- Accuracy for extremely abstract or unfamiliar visual scenes may degrade.


### Recommendations


## How to Get Started with the Model

To run inference of PyTorch checkpoint, follow the instruction below. I recommend you go through [apple/ml-fastvlm](https://github.com/apple/ml-fastvlm) for further instructions on inference on Apple Silicon and other devices.

```python
python predict.py --model-path /path/to/checkpoint-dir \
                  --image-file /path/to/image.png \
                  --prompt "Describe the image."
```

The prompt I used for the dataset, in training, and in practice is:
```
You are a vision-language model that analyzes images for context-aware reasoning.
Given a visual scene, generate a rich, structured, and detailed description that includes:\n\n
  1. Main Focus – What is the primary object, person, or action in the scene?\n
  2. Surrounding Objects & Context – List and describe notable secondary objects, people, or environment details.\n
  3. Spatial Relationships – Describe where the objects are relative to one another.\n
  4. Activities & Interactions – What are people or objects doing? Are there interactions or implied motions?\n
  5. Scene Type & Time – Describe the overall type of scene (e.g. urban street, kitchen, park) and visible time of day.\n
  6. Inferences & Intent – Based on visual cues, infer what might have just happened or what might happen next.\n
  7. Style & Aesthetic – Describe the scene’s mood, lighting, or style (e.g. bright, moody, colorful).\n\n
  Your goal: make your description so complete and detailed that an image generator could reconstruct the scene with full visual accuracy from your output alone.
```

## Training Details

### Training Data

* **Training data:** [`riddhimanrana/coco-fastvlm-2k-val2017`](https://huggingface.co/datasets/riddhimanrana/coco-fastvlm-2k-val2017)
* **Device:** MacBook Pro 16" (M2 Pro, 16GB RAM, Apple Silicon)
* **Vision tower:** [`MobileCLIP-S0`](https://github.com/apple/ml-mobileclip)
* **Lora parameters:**
  * `r=128`
  * `alpha=256`
  * `Dropout = 0.1`
  * Applied to the language model using PEFT
* **Epochs:** `1`
* **Model max tokens:** `512`
* **Quantization:** 4-bit (post-training, MLX conversion)

### Training Procedure

#### Preprocessing

- Image aspect ratio padded to 256×256.
- Object detection tags from YOLOv11n were added at the start of each prompt.
- All prompts followed a structured, 7-point captioning rubric.
- Inputs were clipped at 512 tokens.

#### Training Hyperparameters
| Hyperparameter         | Value                                |
| ---------------------- | ------------------------------------ |
| Precision              | `fp32` (Apple Silicon, no bf16/fp16) |
| Learning rate          | `2e-4`                               |
| Weight decay           | `0.0`                                |
| Warmup ratio           | `0.03`                               |
| Scheduler              | `cosine`                             |
| Batch size (train)     | `8`                                  |
| Batch size (eval)      | `4`                                  |
| Gradient accumulation  | `1`                                  |
| Max token length       | `512`                                |
| Logging steps          | `1`                                  |
| Evaluation strategy    | `no`                                 |
| Save strategy          | `steps` (default step interval)      |
| Gradient checkpointing | `True`                               |
| Lazy preprocessing     | `True`                               |
| DataLoader workers     | `4`                                  |
#### Speeds, Sizes, Times

Training duration: ~1.2 hours on M2 Pro (1 epoch over 2k samples)

Peak RAM usage: ~11.5 GB

Merged model size: 3.0 GB (pre-quantization)

Post-quantization size: ~864 MB (MLX-quantized, 4-bit)

Inference memory on iPhone (MLX): ~980MB-1.2GB RAM with 256 token generation

All devices were fed the same image. However, this model is only compatible with iPhone 12 and newer models. It has been tested on iPhone 11, but it doesn’t work due to incompatibility issues with Apple MLX support and smaller neural engines.
| Device            | Chip   | RAM  | TTFT   | Generation |
|-------------------|--------|------|--------|------------|
| iPhone 12         | A14    | 4GB  | 2392ms | 73.5 tok/s |
| iPhone 13 Pro Max | A15    | 6GB  | 1138ms | 74.1 tok/s |
| iPhone 14         | A15    | 6GB  | 1069ms | 71.3 tok/s |
| MacBook Air 2020  | M1     | 8GB  | 673ms  | 131 tok/s  |
## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- A subset of COCO val2017 images was manually evaluated.
- Dataset includes both common and edge cases: animals, street scenes, closeups, occlusion, and indoor scenes.

#### Factors

- Image complexity (single vs multi-object)
- Scene type (indoor vs outdoor)
- Visual density
- Prompt diversity (7-point rubric compliance)

#### Metrics

*Due to the direction of my current project, evaluation metrics weren’t particularly important so I didn't spend much time on it. However, I am open to community contributions for model evaluation.*

- **Human Evaluation** (1–5 scale):
  - Completeness: How well the description matches the visible scene
  - Structure: Coherence of the response relative to the 7-part prompt
  - Detail & Accuracy: Visual correctness of relationships and entities
- **Quantitative** (for future release):
  - CIDEr / METEOR / BLEU-4 (planned via COCO eval pipeline)

### Results
| Metric          | Avg Score |
| --------------- | --------- |
| Completeness    | `4.6 / 5` |
| Structure       | `4.8 / 5` |
| Visual Accuracy | `4.5 / 5` |
#### Summary

The model produces rich, well-structured, and highly relevant captions optimized for real-time mobile inference. With ~930 MB size and <1 GB RAM usage, it is deployable on older iPhones w/o Apple Intelligence(e.g., iPhone 12 or newer). Despite fine-tuning on just 2,000 examples, its reasoning capability generalizes well due to the high-quality distilled prompts.

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** MacBook Air M1 (dataset generation), MacBook Pro M2 Pro (training, quantization)
- **Hours used:** ~3 hours for dataset, ~1h for training
- **Compute Region:** Local / personal hardware
- **Carbon Emitted:** Minimal, due to small dataset size and single-device compute.

## Citation

**BibTeX:**

```bibtex
@InProceedings{fastvlm2025,
  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2025}
}
```

## Model Card Contact

Contact: @riddhimanrana on Hugging Face or GitHub