Update README.md
Browse files
README.md
CHANGED
|
@@ -38,34 +38,151 @@ The model comes with four skills, tailored towards different visual understandin
|
|
| 38 |
|
| 39 |
The `query` skill can be used to ask open-ended questions about images.
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
By default, `query` runs in reasoning mode, allowing the model to "think" about the question before generating an answer. This is helpful for more complicated tasks, but sometimes the task you're running is simple and doesn't benefit from reasoning. To save on inference cost when this is the case, you can disable reasoning:
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
If you want to stream outputs, pass in `stream=True`. You can control the temperature, top-p, and maximum number of tokens generated by passing in optional settings.
|
| 48 |
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
Note that this isn't just for images; Moondream is also a strong general-purpose text model.
|
| 52 |
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
### Caption
|
| 56 |
|
| 57 |
Whether you want short, normal-sized or long descriptions of images, the `caption` skill has you covered.
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
It accepts the same streaming and temperature etc. settings as the `query` skill.
|
| 62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
### Point
|
| 64 |
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
### Detect
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
The `query` skill can be used to ask open-ended questions about images.
|
| 40 |
|
| 41 |
+
```python
|
| 42 |
+
from PIL import Image
|
| 43 |
+
|
| 44 |
+
# Simple VQA
|
| 45 |
+
image = Image.open("photo.jpg")
|
| 46 |
+
result = moondream.query(image=image, question="What's in this image?")
|
| 47 |
+
print(result["answer"])
|
| 48 |
+
```
|
| 49 |
|
| 50 |
By default, `query` runs in reasoning mode, allowing the model to "think" about the question before generating an answer. This is helpful for more complicated tasks, but sometimes the task you're running is simple and doesn't benefit from reasoning. To save on inference cost when this is the case, you can disable reasoning:
|
| 51 |
|
| 52 |
+
```python
|
| 53 |
+
# Without reasoning for simple questions
|
| 54 |
+
result = moondream.query(
|
| 55 |
+
image=image,
|
| 56 |
+
question="What color is the sky?",
|
| 57 |
+
reasoning=False
|
| 58 |
+
)
|
| 59 |
+
print(result["answer"])
|
| 60 |
+
```
|
| 61 |
|
| 62 |
If you want to stream outputs, pass in `stream=True`. You can control the temperature, top-p, and maximum number of tokens generated by passing in optional settings.
|
| 63 |
|
| 64 |
+
```python
|
| 65 |
+
# Streaming with custom settings
|
| 66 |
+
settings = {
|
| 67 |
+
"temperature": 0.7,
|
| 68 |
+
"top_p": 0.95,
|
| 69 |
+
"max_tokens": 512
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
result = moondream.query(
|
| 73 |
+
image=image,
|
| 74 |
+
question="Describe what's happening in detail",
|
| 75 |
+
stream=True,
|
| 76 |
+
settings=settings
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
# Stream the answer
|
| 80 |
+
for chunk in result["answer"]:
|
| 81 |
+
print(chunk, end="", flush=True)
|
| 82 |
+
```
|
| 83 |
|
| 84 |
Note that this isn't just for images; Moondream is also a strong general-purpose text model.
|
| 85 |
|
| 86 |
+
```python
|
| 87 |
+
# Text-only example (no image)
|
| 88 |
+
result = moondream.query(
|
| 89 |
+
question="Explain the concept of machine learning in simple terms"
|
| 90 |
+
)
|
| 91 |
+
print(result["answer"])
|
| 92 |
+
```
|
| 93 |
|
| 94 |
### Caption
|
| 95 |
|
| 96 |
Whether you want short, normal-sized or long descriptions of images, the `caption` skill has you covered.
|
| 97 |
|
| 98 |
+
```python
|
| 99 |
+
# Different caption lengths
|
| 100 |
+
image = Image.open("landscape.jpg")
|
| 101 |
+
|
| 102 |
+
# Short caption
|
| 103 |
+
short = moondream.caption(image, length="short")
|
| 104 |
+
print(f"Short: {short['caption']}")
|
| 105 |
+
|
| 106 |
+
# Normal caption (default)
|
| 107 |
+
normal = moondream.caption(image, length="normal")
|
| 108 |
+
print(f"Normal: {normal['caption']}")
|
| 109 |
+
|
| 110 |
+
# Long caption
|
| 111 |
+
long = moondream.caption(image, length="long")
|
| 112 |
+
print(f"Long: {long['caption']}")
|
| 113 |
+
```
|
| 114 |
|
| 115 |
It accepts the same streaming and temperature etc. settings as the `query` skill.
|
| 116 |
|
| 117 |
+
```python
|
| 118 |
+
# Streaming caption with custom settings
|
| 119 |
+
result = moondream.caption(
|
| 120 |
+
image,
|
| 121 |
+
length="long",
|
| 122 |
+
stream=True,
|
| 123 |
+
settings={"temperature": 0.3}
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
for chunk in result["caption"]:
|
| 127 |
+
print(chunk, end="", flush=True)
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
### Point
|
| 131 |
|
| 132 |
+
The `point` skill identifies specific points (x, y coordinates) for objects in an image.
|
| 133 |
+
|
| 134 |
+
```python
|
| 135 |
+
# Find points for specific objects
|
| 136 |
+
image = Image.open("crowd.jpg")
|
| 137 |
+
result = moondream.point(image, "person wearing a red shirt")
|
| 138 |
+
|
| 139 |
+
# Points are normalized coordinates (0-1)
|
| 140 |
+
for i, point in enumerate(result["points"]):
|
| 141 |
+
print(f"Point {i+1}: x={point['x']:.3f}, y={point['y']:.3f}")
|
| 142 |
+
```
|
| 143 |
|
| 144 |
### Detect
|
| 145 |
|
| 146 |
+
The `detect` skill provides bounding boxes for objects in an image.
|
| 147 |
+
|
| 148 |
+
```python
|
| 149 |
+
# Detect objects with bounding boxes
|
| 150 |
+
image = Image.open("street_scene.jpg")
|
| 151 |
+
result = moondream.detect(image, "car")
|
| 152 |
+
|
| 153 |
+
# Bounding boxes are normalized coordinates (0-1)
|
| 154 |
+
for i, obj in enumerate(result["objects"]):
|
| 155 |
+
print(f"Object {i+1}: "
|
| 156 |
+
f"x_min={obj['x_min']:.3f}, y_min={obj['y_min']:.3f}, "
|
| 157 |
+
f"x_max={obj['x_max']:.3f}, y_max={obj['y_max']:.3f}")
|
| 158 |
+
|
| 159 |
+
# Control maximum number of objects
|
| 160 |
+
settings = {"max_objects": 10}
|
| 161 |
+
result = moondream.detect(image, "person", settings=settings)
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
### Caching image encodings (advanced)
|
| 165 |
|
| 166 |
+
If you're planning to run multiple inferences on the same image, you can pre-encode it once and reuse the encoding for better performance.
|
| 167 |
+
|
| 168 |
+
```python
|
| 169 |
+
# Encode image once
|
| 170 |
+
image = Image.open("complex_scene.jpg")
|
| 171 |
+
encoded = moondream.encode_image(image)
|
| 172 |
+
|
| 173 |
+
# Reuse the encoding for multiple queries
|
| 174 |
+
questions = [
|
| 175 |
+
"How many people are in this image?",
|
| 176 |
+
"What time of day was this taken?",
|
| 177 |
+
"What's the weather like?"
|
| 178 |
+
]
|
| 179 |
+
|
| 180 |
+
for q in questions:
|
| 181 |
+
result = moondream.query(image=encoded, question=q, reasoning=False)
|
| 182 |
+
print(f"Q: {q}")
|
| 183 |
+
print(f"A: {result['answer']}\n")
|
| 184 |
+
|
| 185 |
+
# Also works with other skills
|
| 186 |
+
caption = moondream.caption(encoded, length="normal")
|
| 187 |
+
objects = moondream.detect(encoded, "vehicle")
|
| 188 |
+
```
|