Add pipeline tag and library name to metadata
Browse filesThis PR updates the model card metadata by adding the `pipeline_tag: image-text-to-text` and `library_name: transformers`.
Adding these tags will:
1. Improve the model's discoverability on the Hugging Face Hub under the [Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text) category.
2. Enable the "Use in Transformers" button on the model page, providing users with a ready-to-use code snippet.
The sample usage provided in the documentation confirms that the model is designed to work with the `transformers` library.
README.md
CHANGED
|
@@ -1,7 +1,9 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
<div align="center">
|
|
@@ -57,8 +59,6 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
|
|
| 57 |
| **HMMT 2025** | 78.18 | **92.14** | 57.29 | 67.71 | 65.68 | 51.30 |
|
| 58 |
| **LiveCodeBench** | 75.77 | **76.43** | 48.71 | 69.45 | 72.01 | 57.10 |
|
| 59 |
|
| 60 |
-
<!-- > **Note:** **SeRe** (Sequential Reasoning) uses a max length of 64K tokens; **PaCoRe** (Parallel Coordinated Reasoning) synthesizes 16 SeRe rollouts with a max length of 128K tokens. -->
|
| 61 |
-
|
| 62 |
> **Note on Inference Modes:**
|
| 63 |
>
|
| 64 |
> **SeRe (Sequential Reasoning):** The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens.
|
|
@@ -120,7 +120,7 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
|
|
| 120 |
|
| 121 |
### Inference with Hugging Face Transformers
|
| 122 |
|
| 123 |
-
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.
|
| 124 |
|
| 125 |
```python
|
| 126 |
from transformers import AutoProcessor, AutoModelForCausalLM
|
|
@@ -185,4 +185,4 @@ If you find this project useful in your research, please cite our technical repo
|
|
| 185 |
|
| 186 |
## 📄 License
|
| 187 |
|
| 188 |
-
This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model:
|
| 3 |
+
- stepfun-ai/Step3-VL-10B-Base
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
library_name: transformers
|
| 6 |
+
pipeline_tag: image-text-to-text
|
| 7 |
---
|
| 8 |
|
| 9 |
<div align="center">
|
|
|
|
| 59 |
| **HMMT 2025** | 78.18 | **92.14** | 57.29 | 67.71 | 65.68 | 51.30 |
|
| 60 |
| **LiveCodeBench** | 75.77 | **76.43** | 48.71 | 69.45 | 72.01 | 57.10 |
|
| 61 |
|
|
|
|
|
|
|
| 62 |
> **Note on Inference Modes:**
|
| 63 |
>
|
| 64 |
> **SeRe (Sequential Reasoning):** The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens.
|
|
|
|
| 120 |
|
| 121 |
### Inference with Hugging Face Transformers
|
| 122 |
|
| 123 |
+
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment. We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.
|
| 124 |
|
| 125 |
```python
|
| 126 |
from transformers import AutoProcessor, AutoModelForCausalLM
|
|
|
|
| 185 |
|
| 186 |
## 📄 License
|
| 187 |
|
| 188 |
+
This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).
|