File size: 3,388 Bytes
de8fafd
 
66b7ffa
de8fafd
dffbe5f
267598c
dffbe5f
081b24f
dffbe5f
dff9c1d
7ca304f
fc5dc52
 
 
 
dffbe5f
07e011b
dffbe5f
 
 
 
 
fc5dc52
dffbe5f
08ffeb8
fc5dc52
 
08ffeb8
 
 
dffbe5f
 
08ffeb8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b483f8
 
07e011b
fc5dc52
 
 
 
 
3b483f8
4f59175
797e1e4
137d94e
 
 
 
8b660c2
137d94e
797e1e4
4f59175
3b483f8
23187b6
7ede937
 
 
0131f81
93f8ccd
797e1e4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: apache-2.0
pipeline_tag: image-text-to-text
---

Moondream is a small vision language model designed to run efficiently everywhere. 

[Website](https://moondream.ai/) / [Demo](https://moondream.ai/playground) / [GitHub](https://github.com/vikhyat/moondream)

This repository contains the latest (**2025-04-14**) release of Moondream, as well as [historical releases](https://huggingface.co/vikhyatk/moondream2/blob/main/versions.txt). The model is updated frequently, so we recommend specifying a revision as shown below if you're using it in a production application.

To use **quantized int4**, make sure to install the requirements: 
```
pip install -r https://depot.moondream.ai/transformers/requirements.txt
```

### Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

# To run in float16, set revision_id = 2025-04-14
model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="int4_2025-04-14",
    revision="2025-04-14
    trust_remote_code=True,
    # Uncomment to run on GPU.
    # device_map={"": "cuda"}
)

# Captioning
print("Short caption:")
print(model.caption(image, length="short")["caption"])

print("\nNormal caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
    # Streaming generation example, supported for caption() and detect()
    print(t, end="", flush=True)
print(model.caption(image, length="normal"))

# Visual Querying
print("\nVisual query: 'How many people are in the image?'")
print(model.query(image, "How many people are in the image?")["answer"])

# Object Detection
print("\nObject detection: 'face'")
objects = model.detect(image, "face")["objects"]
print(f"Found {len(objects)} face(s)")

# Pointing
print("\nPointing: 'person'")
points = model.point(image, "person")["points"]
print(f"Found {len(points)} person(s)")
```

### Changelog
**int4-2025-04-15** ([full release notes](https://moondream.ai/blog/moondream-2025-04-14-release))
1. Moondream uses a whole lot less memory (4.12 down to 2.47GB)
2. Small device get a big speed up (44.54 to 67.84 tok/sec on a RTX 4050 Mobile)
3. Improved spatial understanding (RealWorldQA up from 58.3 to 60.13)


**2025-04-15** ([full release notes](https://moondream.ai/blog/moondream-2025-04-14-release))

1. Improved chart understanding (ChartQA up from 74.8 to 77.5, 82.2 with PoT)
2. Added temperature and nucleus sampling to reduce repetitive outputs
3. Better OCR for documents and tables (prompt with “Transcribe the text” or “Transcribe the text in natural reading order”)
4. Object detection supports document layout detection (figure, formula, text, etc)
5. UI understanding (ScreenSpot F1\@0.5 up from 53.3 to 60.3)
6. Improved text understanding (DocVQA up from 76.5 to 79.3, TextVQA up from 74.6 to 76.3)

**2025-03-27** ([full release notes](https://moondream.ai/blog/moondream-2025-03-27-release))

1. Added support for long-form captioning
2. Open vocabulary image tagging
3. Improved counting accuracy (e.g. CountBenchQA increased from 80 to 86.4)
4. Improved text understanding (e.g. OCRBench increased from 58.3 to 61.2)
5. Improved object detection, especially for small objects (e.g. COCO up from 30.5 to 51.2)
6. Fixed token streaming bug affecting multi-byte unicode characters
7. gpt-fast style `compile()` now supported in HF Transformers implementation