File size: 3,067 Bytes
595dad7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ea8d66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
library_name: transformers
tags:
- multimodal
- video-understanding
- football
language:
- en
datasets:
- MatchTime
pipeline_tag: text-generation
model_type: matchcommentary
base_model: meta-llama/Meta-Llama-3-8B-Instruct
---

# Matchcommentary: Automatic Soccer Game Commentary Generation Model

## Model Overview

Matchcommentary is a multimodal learning-based automatic soccer game commentary generation model that generates fluent soccer commentary text based on video features. The model combines visual feature extraction, Q-Former architecture, and large language models to achieve high-quality soccer commentary generation.

## Model Architecture

- **Base Model**: LLaMA-3-8B-Instruct
- **Vision Encoder**: Q-Former architecture
- **Feature Dimension**: 512-dimensional video features
- **Window Size**: 15-second video clips
- **Query Tokens**: 32 video query tokens

## Usage

### Install Dependencies

```bash
pip install torch transformers einops pycocoevalcap opencv-python numpy
```

### Quick Start

```python
from models.matchvoice_model import matchvoice_model
from matchvoice_dataset import MatchVoice_Dataset
import torch

# Load model
model = matchvoice_model(
    llm_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
    num_video_query_token=32,
    num_features=512,
    device="cuda:0",
    inference=True
)

# Load checkpoint
checkpoint = torch.load("model_save_best_val_CIDEr.pth", map_location="cpu")
model.load_state_dict(checkpoint)
model.eval()

# Perform inference (requires prepared video features)
with torch.no_grad():
    predictions = model(samples)
```

### Complete Inference Pipeline

Using the provided `inference1.py` script:

```bash
python inference1.py \
    --feature_root ./features \
    --ann_root ./dataset/MatchTime/train \
    --model_ckpt model_save_best_val_CIDEr.pth \
    --window 15 \
    --batch_size 4 \
    --num_video_query_token 32 \
    --num_features 512 \
    --csv_output_path ./inference_result/predictions.csv
```

## Input Data Format

The model expects the following input format:

1. **Video Features**: ResNet_PCA512 features with shape `[batch_size, time_length, feature_dim]`
2. **Timestamp Information**: Metadata including game time, event type, etc.
3. **Attention Mask**: For handling variable-length sequences

## Output Format

The model outputs a CSV file with the following columns:
- `league`: League and season information
- `game`: Game name
- `half`: First/second half
- `timestamp`: Event timestamp
- `type`: Soccer event type
- `anonymized`: Ground truth annotation
- `predicted_res_{i}`: Model prediction results

## Model Features

- Supports multiple video feature formats (ResNet, C3D, CLIP, etc.)
- Soccer-specific vocabulary constraint generation
- Supports both batch inference and single video inference
- Q-Former-based multimodal fusion architecture

## Performance Metrics

Evaluation results on the MatchTime dataset:
- Achieved best validation CIDEr score
- Supports real-time soccer commentary generation