| Field | Response |
|---|---|
| Intended Task/Domain: | Vision-to-action model designed to play video games directly from raw frames |
| Model Type: | Transformer |
| Intended Users: | Researchers, game developers, open source community, gamers. Potential applications include next-generation game AI, automating testing for video games, and generally advancing research in embodied AI. |
| Output: | Gamepad actions |
| Describe how the model works: | Image inputs are encoded with a vision transformer. A separate diffusion transformer is conditioned on the image embeddings, which then denoise an action tensor |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
| Technical Limitations & Mitigation: | This model performs well on games played with a gamepad. Model may not perform well on games played with a keyboard or mouse. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Task success rate |
| Potential Known Risks: | The model may occasionally lose at certain games. |
| Licensing: | Governing Terms: NVIDIA License. Additional Information: Apache License for https://huggingface.co/google/siglip2-base-patch16-224. |