abocide
/

matchcommentary

Text Generation

Matchcommentary

video-understanding

Model card Files Files and versions

matchcommentary / README.md

abocide's picture

Upload README.md with huggingface_hub

595dad7 verified 5 months ago

|

history blame contribute delete

3.07 kB

	---
	library_name: transformers
	tags:
	- multimodal
	- video-understanding
	- football
	language:
	- en
	datasets:
	- MatchTime
	pipeline_tag: text-generation
	model_type: matchcommentary
	base_model: meta-llama/Meta-Llama-3-8B-Instruct
	---

	# Matchcommentary: Automatic Soccer Game Commentary Generation Model

	## Model Overview

	Matchcommentary is a multimodal learning-based automatic soccer game commentary generation model that generates fluent soccer commentary text based on video features. The model combines visual feature extraction, Q-Former architecture, and large language models to achieve high-quality soccer commentary generation.

	## Model Architecture

	- Base Model: LLaMA-3-8B-Instruct
	- Vision Encoder: Q-Former architecture
	- Feature Dimension: 512-dimensional video features
	- Window Size: 15-second video clips
	- Query Tokens: 32 video query tokens

	## Usage

	### Install Dependencies

	```bash
	pip install torch transformers einops pycocoevalcap opencv-python numpy
	```

	### Quick Start

	```python
	from models.matchvoice_model import matchvoice_model
	from matchvoice_dataset import MatchVoice_Dataset
	import torch

	# Load model
	model = matchvoice_model(
	llm_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
	tokenizer_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
	num_video_query_token=32,
	num_features=512,
	device="cuda:0",
	inference=True
	)

	# Load checkpoint
	checkpoint = torch.load("model_save_best_val_CIDEr.pth", map_location="cpu")
	model.load_state_dict(checkpoint)
	model.eval()

	# Perform inference (requires prepared video features)
	with torch.no_grad():
	predictions = model(samples)
	```

	### Complete Inference Pipeline

	Using the provided `inference1.py` script:

	```bash
	python inference1.py \
	--feature_root ./features \
	--ann_root ./dataset/MatchTime/train \
	--model_ckpt model_save_best_val_CIDEr.pth \
	--window 15 \
	--batch_size 4 \
	--num_video_query_token 32 \
	--num_features 512 \
	--csv_output_path ./inference_result/predictions.csv
	```

	## Input Data Format

	The model expects the following input format:

	1. Video Features: ResNet_PCA512 features with shape `[batch_size, time_length, feature_dim]`
	2. Timestamp Information: Metadata including game time, event type, etc.
	3. Attention Mask: For handling variable-length sequences

	## Output Format

	The model outputs a CSV file with the following columns:
	- `league`: League and season information
	- `game`: Game name
	- `half`: First/second half
	- `timestamp`: Event timestamp
	- `type`: Soccer event type
	- `anonymized`: Ground truth annotation
	- `predicted_res_{i}`: Model prediction results

	## Model Features

	- Supports multiple video feature formats (ResNet, C3D, CLIP, etc.)
	- Soccer-specific vocabulary constraint generation
	- Supports both batch inference and single video inference
	- Q-Former-based multimodal fusion architecture

	## Performance Metrics

	Evaluation results on the MatchTime dataset:
	- Achieved best validation CIDEr score
	- Supports real-time soccer commentary generation