Update README.md
Browse files
README.md
CHANGED
|
@@ -36,8 +36,8 @@ pip install git+https://github.com/huggingface/transformers accelerate
|
|
| 36 |
We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
|
| 37 |
|
| 38 |
```bash
|
| 39 |
-
# It's highly recommanded to
|
| 40 |
-
pip install keye-vl-utils
|
| 41 |
```
|
| 42 |
|
| 43 |
If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install keye-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
|
|
@@ -46,6 +46,10 @@ If you are not using Linux, you might not be able to install `decord` from PyPI.
|
|
| 46 |
|
| 47 |
Here we show a code snippet to show you how to use the chat model with `transformers` and `keye_vl_utils`:
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
```python
|
| 50 |
from transformers import AutoModel, AutoTokenizer, AutoProcessor
|
| 51 |
from keye_vl_utils import process_vision_info
|
|
@@ -54,8 +58,11 @@ from keye_vl_utils import process_vision_info
|
|
| 54 |
model_path = "Kwai-Keye/Keye-VL-8B-Preview"
|
| 55 |
|
| 56 |
model = AutoModel.from_pretrained(
|
| 57 |
-
model_path,
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
| 61 |
# model = KeyeForConditionalGeneration.from_pretrained(
|
|
@@ -74,6 +81,21 @@ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
|
|
| 74 |
# max_pixels = 1280*28*28
|
| 75 |
# processor = AutoProcessor.from_pretrained(model_pat, min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True)
|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
messages = [
|
| 78 |
{
|
| 79 |
"role": "user",
|
|
@@ -87,6 +109,20 @@ messages = [
|
|
| 87 |
}
|
| 88 |
]
|
| 89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
# Preparation for inference
|
| 91 |
text = processor.apply_chat_template(
|
| 92 |
messages, tokenize=False, add_generation_prompt=True
|
|
|
|
| 36 |
We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
|
| 37 |
|
| 38 |
```bash
|
| 39 |
+
# It's highly recommanded to use `[decord]` feature for faster video loading.
|
| 40 |
+
pip install "keye-vl-utils[decord]==1.0.0"
|
| 41 |
```
|
| 42 |
|
| 43 |
If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install keye-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
|
|
|
|
| 46 |
|
| 47 |
Here we show a code snippet to show you how to use the chat model with `transformers` and `keye_vl_utils`:
|
| 48 |
|
| 49 |
+
> [!NOTE]
|
| 50 |
+
> Following Qwen3, we also offer a soft switch mechanism that lets users dynamically control the model's behavior.
|
| 51 |
+
> You can add /think, /no_think, or nothing to user prompts to switch the model's thinking modes.
|
| 52 |
+
|
| 53 |
```python
|
| 54 |
from transformers import AutoModel, AutoTokenizer, AutoProcessor
|
| 55 |
from keye_vl_utils import process_vision_info
|
|
|
|
| 58 |
model_path = "Kwai-Keye/Keye-VL-8B-Preview"
|
| 59 |
|
| 60 |
model = AutoModel.from_pretrained(
|
| 61 |
+
model_path,
|
| 62 |
+
torch_dtype="auto",
|
| 63 |
+
device_map="auto",
|
| 64 |
+
trust_remote_code=True,
|
| 65 |
+
)
|
| 66 |
|
| 67 |
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
| 68 |
# model = KeyeForConditionalGeneration.from_pretrained(
|
|
|
|
| 81 |
# max_pixels = 1280*28*28
|
| 82 |
# processor = AutoProcessor.from_pretrained(model_pat, min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True)
|
| 83 |
|
| 84 |
+
# Non-Thinking Mode
|
| 85 |
+
messages = [
|
| 86 |
+
{
|
| 87 |
+
"role": "user",
|
| 88 |
+
"content": [
|
| 89 |
+
{
|
| 90 |
+
"type": "image",
|
| 91 |
+
"image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",
|
| 92 |
+
},
|
| 93 |
+
{"type": "text", "text": "Describe this image./no_think"},
|
| 94 |
+
],
|
| 95 |
+
}
|
| 96 |
+
]
|
| 97 |
+
|
| 98 |
+
# Auto-Thinking Mode
|
| 99 |
messages = [
|
| 100 |
{
|
| 101 |
"role": "user",
|
|
|
|
| 109 |
}
|
| 110 |
]
|
| 111 |
|
| 112 |
+
# Thinking mode
|
| 113 |
+
messages = [
|
| 114 |
+
{
|
| 115 |
+
"role": "user",
|
| 116 |
+
"content": [
|
| 117 |
+
{
|
| 118 |
+
"type": "image",
|
| 119 |
+
"image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",
|
| 120 |
+
},
|
| 121 |
+
{"type": "text", "text": "Describe this image./think"},
|
| 122 |
+
],
|
| 123 |
+
}
|
| 124 |
+
]
|
| 125 |
+
|
| 126 |
# Preparation for inference
|
| 127 |
text = processor.apply_chat_template(
|
| 128 |
messages, tokenize=False, add_generation_prompt=True
|