Add model card metadata and resource links
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,18 +1,33 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
## Citation
|
| 5 |
|
| 6 |
If you use this model, please cite:
|
| 7 |
|
| 8 |
```bibtex
|
| 9 |
-
@
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
primaryClass={cs.CV},
|
| 16 |
-
url={https://arxiv.org/abs/2601.16973},
|
| 17 |
}
|
| 18 |
```
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
---
|
| 6 |
+
|
| 7 |
+
# VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
|
| 8 |
+
|
| 9 |
+
VisGym is a gymnasium of 17 visually interactive, long-horizon environments for evaluating, diagnosing, and training vision–language models (VLMs) in multi-step visual decision-making across symbolic puzzles, real-image understanding, navigation, and manipulation.
|
| 10 |
+
|
| 11 |
+
This repository contains model checkpoints described in the paper [VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents](https://huggingface.co/papers/2601.16973).
|
| 12 |
+
|
| 13 |
+
- **Project Page:** [https://visgym.github.io/](https://visgym.github.io/)
|
| 14 |
+
- **Code:** [https://github.com/visgym/VIsGym](https://github.com/visgym/VIsGym)
|
| 15 |
+
- **Paper:** [https://arxiv.org/abs/2601.16973](https://arxiv.org/abs/2601.16973)
|
| 16 |
+
|
| 17 |
+
## Description
|
| 18 |
+
|
| 19 |
+
Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. VisGym provides 17 environments for evaluating and training VLMs, offering flexible controls over difficulty, input representation, planning horizon, and feedback. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation.
|
| 20 |
+
|
| 21 |
## Citation
|
| 22 |
|
| 23 |
If you use this model, please cite:
|
| 24 |
|
| 25 |
```bibtex
|
| 26 |
+
@article{wang2026visgym,
|
| 27 |
+
title = {VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents},
|
| 28 |
+
author = {Wang, Zirui and Zhang, Junyi and Ge, Jiaxin and Lian, Long and Fu, Letian and Dunlap, Lisa and Goldberg, Ken and Wang, Xudong and Stoica, Ion and Chan, David M. and Min, Sewon and Gonzalez, Joseph E.},
|
| 29 |
+
journal = {arXiv preprint arXiv:2601.16973},
|
| 30 |
+
year = {2026},
|
| 31 |
+
url = {https://arxiv.org/abs/2601.16973}
|
|
|
|
|
|
|
| 32 |
}
|
| 33 |
```
|