|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- multimodal |
|
|
- video-understanding |
|
|
- football |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- MatchTime |
|
|
pipeline_tag: text-generation |
|
|
model_type: matchcommentary |
|
|
base_model: meta-llama/Meta-Llama-3-8B-Instruct |
|
|
--- |
|
|
|
|
|
# Matchcommentary: Automatic Soccer Game Commentary Generation Model |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
Matchcommentary is a multimodal learning-based automatic soccer game commentary generation model that generates fluent soccer commentary text based on video features. The model combines visual feature extraction, Q-Former architecture, and large language models to achieve high-quality soccer commentary generation. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Base Model**: LLaMA-3-8B-Instruct |
|
|
- **Vision Encoder**: Q-Former architecture |
|
|
- **Feature Dimension**: 512-dimensional video features |
|
|
- **Window Size**: 15-second video clips |
|
|
- **Query Tokens**: 32 video query tokens |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Install Dependencies |
|
|
|
|
|
```bash |
|
|
pip install torch transformers einops pycocoevalcap opencv-python numpy |
|
|
``` |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
from models.matchvoice_model import matchvoice_model |
|
|
from matchvoice_dataset import MatchVoice_Dataset |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
model = matchvoice_model( |
|
|
llm_ckpt="meta-llama/Meta-Llama-3-8B-Instruct", |
|
|
tokenizer_ckpt="meta-llama/Meta-Llama-3-8B-Instruct", |
|
|
num_video_query_token=32, |
|
|
num_features=512, |
|
|
device="cuda:0", |
|
|
inference=True |
|
|
) |
|
|
|
|
|
# Load checkpoint |
|
|
checkpoint = torch.load("model_save_best_val_CIDEr.pth", map_location="cpu") |
|
|
model.load_state_dict(checkpoint) |
|
|
model.eval() |
|
|
|
|
|
# Perform inference (requires prepared video features) |
|
|
with torch.no_grad(): |
|
|
predictions = model(samples) |
|
|
``` |
|
|
|
|
|
### Complete Inference Pipeline |
|
|
|
|
|
Using the provided `inference1.py` script: |
|
|
|
|
|
```bash |
|
|
python inference1.py \ |
|
|
--feature_root ./features \ |
|
|
--ann_root ./dataset/MatchTime/train \ |
|
|
--model_ckpt model_save_best_val_CIDEr.pth \ |
|
|
--window 15 \ |
|
|
--batch_size 4 \ |
|
|
--num_video_query_token 32 \ |
|
|
--num_features 512 \ |
|
|
--csv_output_path ./inference_result/predictions.csv |
|
|
``` |
|
|
|
|
|
## Input Data Format |
|
|
|
|
|
The model expects the following input format: |
|
|
|
|
|
1. **Video Features**: ResNet_PCA512 features with shape `[batch_size, time_length, feature_dim]` |
|
|
2. **Timestamp Information**: Metadata including game time, event type, etc. |
|
|
3. **Attention Mask**: For handling variable-length sequences |
|
|
|
|
|
## Output Format |
|
|
|
|
|
The model outputs a CSV file with the following columns: |
|
|
- `league`: League and season information |
|
|
- `game`: Game name |
|
|
- `half`: First/second half |
|
|
- `timestamp`: Event timestamp |
|
|
- `type`: Soccer event type |
|
|
- `anonymized`: Ground truth annotation |
|
|
- `predicted_res_{i}`: Model prediction results |
|
|
|
|
|
## Model Features |
|
|
|
|
|
- Supports multiple video feature formats (ResNet, C3D, CLIP, etc.) |
|
|
- Soccer-specific vocabulary constraint generation |
|
|
- Supports both batch inference and single video inference |
|
|
- Q-Former-based multimodal fusion architecture |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
Evaluation results on the MatchTime dataset: |
|
|
- Achieved best validation CIDEr score |
|
|
- Supports real-time soccer commentary generation |
|
|
|
|
|
|
|
|
|
|
|
|