|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- multimodal |
|
|
- video-understanding |
|
|
- sports |
|
|
- commentary-generation |
|
|
- llama3 |
|
|
- soccer |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- MatchTime |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Matchcommentary: Automatic Soccer Game Commentary Generation |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Matchcommentary is a multimodal model designed for automatic soccer game commentary generation. It combines video feature understanding with large language models to generate fluent and contextually appropriate soccer commentary. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
The model consists of: |
|
|
- **Vision Encoder**: Q-Former architecture for processing video features |
|
|
- **Language Model**: LLaMA-3-8B-Instruct for text generation |
|
|
- **Feature Fusion**: Cross-attention mechanism between visual and textual information |
|
|
- **Domain Adaptation**: Soccer-specific vocabulary constraints |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
- Automatic soccer game commentary generation |
|
|
- Sports video understanding and description |
|
|
- Multimodal video-to-text generation |
|
|
|
|
|
### Limitations |
|
|
- Trained specifically on soccer/football content |
|
|
- Requires pre-extracted video features |
|
|
- Performance may vary on different video qualities or angles |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on the MatchTime dataset, which contains: |
|
|
- Soccer game videos with corresponding commentary |
|
|
- Multiple leagues and seasons |
|
|
- Temporal alignment between visual events and commentary |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model achieves state-of-the-art performance on the MatchTime benchmark, with the best validation CIDEr score among tested configurations. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from models.matchvoice_model import matchvoice_model |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
model = matchvoice_model( |
|
|
llm_ckpt="meta-llama/Meta-Llama-3-8B-Instruct", |
|
|
tokenizer_ckpt="meta-llama/Meta-Llama-3-8B-Instruct", |
|
|
num_video_query_token=32, |
|
|
num_features=512, |
|
|
device="cuda:0", |
|
|
inference=True |
|
|
) |
|
|
|
|
|
# Load checkpoint |
|
|
checkpoint = torch.load("model_save_best_val_CIDEr.pth") |
|
|
model.load_state_dict(checkpoint) |
|
|
model.eval() |
|
|
|
|
|
# Generate commentary |
|
|
with torch.no_grad(): |
|
|
commentary = model(video_samples) |
|
|
``` |