File size: 3,067 Bytes
595dad7 1ea8d66 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
library_name: transformers
tags:
- multimodal
- video-understanding
- football
language:
- en
datasets:
- MatchTime
pipeline_tag: text-generation
model_type: matchcommentary
base_model: meta-llama/Meta-Llama-3-8B-Instruct
---
# Matchcommentary: Automatic Soccer Game Commentary Generation Model
## Model Overview
Matchcommentary is a multimodal learning-based automatic soccer game commentary generation model that generates fluent soccer commentary text based on video features. The model combines visual feature extraction, Q-Former architecture, and large language models to achieve high-quality soccer commentary generation.
## Model Architecture
- **Base Model**: LLaMA-3-8B-Instruct
- **Vision Encoder**: Q-Former architecture
- **Feature Dimension**: 512-dimensional video features
- **Window Size**: 15-second video clips
- **Query Tokens**: 32 video query tokens
## Usage
### Install Dependencies
```bash
pip install torch transformers einops pycocoevalcap opencv-python numpy
```
### Quick Start
```python
from models.matchvoice_model import matchvoice_model
from matchvoice_dataset import MatchVoice_Dataset
import torch
# Load model
model = matchvoice_model(
llm_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
num_video_query_token=32,
num_features=512,
device="cuda:0",
inference=True
)
# Load checkpoint
checkpoint = torch.load("model_save_best_val_CIDEr.pth", map_location="cpu")
model.load_state_dict(checkpoint)
model.eval()
# Perform inference (requires prepared video features)
with torch.no_grad():
predictions = model(samples)
```
### Complete Inference Pipeline
Using the provided `inference1.py` script:
```bash
python inference1.py \
--feature_root ./features \
--ann_root ./dataset/MatchTime/train \
--model_ckpt model_save_best_val_CIDEr.pth \
--window 15 \
--batch_size 4 \
--num_video_query_token 32 \
--num_features 512 \
--csv_output_path ./inference_result/predictions.csv
```
## Input Data Format
The model expects the following input format:
1. **Video Features**: ResNet_PCA512 features with shape `[batch_size, time_length, feature_dim]`
2. **Timestamp Information**: Metadata including game time, event type, etc.
3. **Attention Mask**: For handling variable-length sequences
## Output Format
The model outputs a CSV file with the following columns:
- `league`: League and season information
- `game`: Game name
- `half`: First/second half
- `timestamp`: Event timestamp
- `type`: Soccer event type
- `anonymized`: Ground truth annotation
- `predicted_res_{i}`: Model prediction results
## Model Features
- Supports multiple video feature formats (ResNet, C3D, CLIP, etc.)
- Soccer-specific vocabulary constraint generation
- Supports both batch inference and single video inference
- Q-Former-based multimodal fusion architecture
## Performance Metrics
Evaluation results on the MatchTime dataset:
- Achieved best validation CIDEr score
- Supports real-time soccer commentary generation
|