| license: apache-2.0 | |
| pipeline_tag: video-audio-to-text | |
| library_name: transformers | |
| This repository contains the pretrain and finetune weights for the model introduced in the paper "Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation". | |
| [Paper](https://huggingface.co/papers/2503.13068) | |
| Github: https://github.com/GeWu-Lab/Crab |