Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +99 -0
ckpt/model_save_best_val_CIDEr.pth +3 -0
config.json +17 -0
inference.py +207 -0
model_card.md +78 -0
requirements.txt +14 -0
soccer_words_llama3.pkl +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# Matchcommentary: Automatic Soccer Game Commentary Generation Model
+## Model Overview
+Matchcommentary is a multimodal learning-based automatic soccer game commentary generation model that generates fluent soccer commentary text based on video features. The model combines visual feature extraction, Q-Former architecture, and large language models to achieve high-quality soccer commentary generation.
+## Model Architecture
+- **Base Model**: LLaMA-3-8B-Instruct
+- **Vision Encoder**: Q-Former architecture
+- **Feature Dimension**: 512-dimensional video features
+- **Window Size**: 15-second video clips
+- **Query Tokens**: 32 video query tokens
+## Usage
+### Install Dependencies
+```bash
+pip install torch transformers einops pycocoevalcap opencv-python numpy
+```
+### Quick Start
+```python
+from models.matchvoice_model import matchvoice_model
+from matchvoice_dataset import MatchVoice_Dataset
+import torch
+# Load model
+model = matchvoice_model(
+    llm_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
+    tokenizer_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
+    num_video_query_token=32,
+    num_features=512,
+    device="cuda:0",
+    inference=True
+)
+# Load checkpoint
+checkpoint = torch.load("model_save_best_val_CIDEr.pth", map_location="cpu")
+model.load_state_dict(checkpoint)
+model.eval()
+# Perform inference (requires prepared video features)
+with torch.no_grad():
+    predictions = model(samples)
+```
+### Complete Inference Pipeline
+Using the provided `inference1.py` script:
+```bash
+python inference1.py \
+    --feature_root ./features \
+    --ann_root ./dataset/MatchTime/train \
+    --model_ckpt model_save_best_val_CIDEr.pth \
+    --window 15 \
+    --batch_size 4 \
+    --num_video_query_token 32 \
+    --num_features 512 \
+    --csv_output_path ./inference_result/predictions.csv
+```
+## Input Data Format
+The model expects the following input format:
+1. **Video Features**: ResNet_PCA512 features with shape `[batch_size, time_length, feature_dim]`
+2. **Timestamp Information**: Metadata including game time, event type, etc.
+3. **Attention Mask**: For handling variable-length sequences
+## Output Format
+The model outputs a CSV file with the following columns:
+- `league`: League and season information
+- `game`: Game name
+- `half`: First/second half
+- `timestamp`: Event timestamp
+- `type`: Soccer event type
+- `anonymized`: Ground truth annotation
+- `predicted_res_{i}`: Model prediction results
+## Model Features
+- Supports multiple video feature formats (ResNet, C3D, CLIP, etc.)
+- Soccer-specific vocabulary constraint generation
+- Supports both batch inference and single video inference
+- Q-Former-based multimodal fusion architecture
+## Performance Metrics
+Evaluation results on the MatchTime dataset:
+- Achieved best validation CIDEr score
+- Supports real-time soccer commentary generation

ckpt/model_save_best_val_CIDEr.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6aea3d4f7776b9b1c40b518fe1ce0b5ed6a7d3c8c60f55113e9ed08d281439ba
+size 2186901790

config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "architectures": ["MatchcommentaryModel"],
+  "model_type": "Matchcommentary",
+  "llm_ckpt": "meta-llama/Meta-Llama-3-8B-Instruct",
+  "tokenizer_ckpt": "meta-llama/Meta-Llama-3-8B-Instruct",
+  "max_frame_pos": 128,
+  "window": 15,
+  "num_query_tokens": 32,
+  "num_video_query_token": 32,
+  "num_features": 512,
+  "fps": 0.5,
+  "max_token_length": 128,
+  "feature_subdir": "ResNET_PCA512",
+  "torch_dtype": "float16",
+  "transformers_version": "4.42.3",
+  "description": "MatchcommentaryModel model for automatic soccer game commentary generation, trained on MatchTime dataset"
+}

inference.py ADDED Viewed

	@@ -0,0 +1,207 @@

+#!/usr/bin/env python3
+"""
+Matchcommentary Model Inference Script - HuggingFace Version
+For automatic soccer commentary generation
+"""
+import torch
+import argparse
+import os
+import csv
+from tqdm import tqdm
+from typing import List, Dict, Any
+import json
+# Assuming model files are included in the HuggingFace repository
+from models.matchvoice_model import matchvoice_model
+from matchvoice_dataset import MatchVoice_Dataset
+from torch.utils.data import DataLoader
+class MatchcommentaryPredictor:
+    """Matchcommentary model inference class"""
+    def __init__(self, model_path: str = "./", device: str = "cuda:0"):
+        """
+        Initialize Matchcommentary predictor
+        Args:
+            model_path: Path to model files
+            device: Device to run on
+        """
+        self.device = device
+        self.model = None
+        self.load_model(model_path)
+    def load_model(self, model_path: str):
+        """Load the model"""
+        print("Loading Matchcommentary model...")
+        # Initialize model
+        self.model = matchvoice_model(
+            llm_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
+            tokenizer_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
+            num_video_query_token=32,
+            num_features=512,
+            device=self.device,
+            inference=True
+        )
+        # Load checkpoint
+        checkpoint_path = os.path.join(model_path, "model_save_best_val_CIDEr.pth")
+        if os.path.exists(checkpoint_path):
+            checkpoint = torch.load(checkpoint_path, map_location="cpu")
+            # Load state dict
+            model_state_dict = self.model.state_dict()
+            for key, value in checkpoint.items():
+                if key in model_state_dict:
+                    model_state_dict[key] = value
+            self.model.load_state_dict(model_state_dict)
+            print("Model checkpoint loaded successfully!")
+        else:
+            print(f"Warning: Model checkpoint file not found at {checkpoint_path}")
+        self.model.eval()
+    def predict_single(self, video_features: torch.Tensor) -> List[str]:
+        """
+        Predict commentary for a single video clip
+        Args:
+            video_features: Video feature tensor
+        Returns:
+            List of predicted commentary texts
+        """
+        with torch.no_grad():
+            # Build input sample format
+            samples = {
+                'features': video_features.to(self.device),
+                'caption_info': [["", "", "", "", "", ""]]  # Placeholder
+            }
+            predictions = self.model(samples)
+            return predictions
+    def predict_batch(self,
+                     feature_root: str,
+                     ann_root: str,
+                     output_csv: str,
+                     batch_size: int = 4,
+                     num_workers: int = 2,
+                     generate_num: int = 1,
+                     fps: float = 0.5,
+                     window: float = 15):
+        """
+        Batch prediction and save results to CSV file
+        Args:
+            feature_root: Root directory for video features
+            ann_root: Root directory for annotation files
+            output_csv: Output CSV file path
+            batch_size: Batch size for processing
+            num_workers: Number of data loading workers
+            generate_num: Number of commentary generations per video clip
+            fps: Feature extraction frame rate
+            window: Video window size in seconds
+        """
+        print("Preparing dataset...")
+        # Create dataset
+        test_dataset = MatchVoice_Dataset(
+            feature_root=feature_root,
+            ann_root=ann_root,
+            fps=fps,
+            timestamp_key="gameTime",
+            tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
+            window=window,
+            split_ratio=0.01,  # Use small subset for quick testing
+            is_train=False
+        )
+        test_data_loader = DataLoader(
+            test_dataset,
+            batch_size=batch_size,
+            num_workers=num_workers,
+            drop_last=False,
+            shuffle=False,
+            pin_memory=True,
+            collate_fn=test_dataset.collater
+        )
+        print("Dataset preparation completed, starting prediction...")
+        # Create output directory
+        os.makedirs(os.path.dirname(output_csv), exist_ok=True)
+        # Write CSV header
+        headers = ['league', 'game', 'half', 'timestamp', 'type', 'anonymized']
+        headers += [f'predicted_res_{i}' for i in range(generate_num)]
+        with open(output_csv, 'w', newline='', encoding='utf-8') as file:
+            writer = csv.writer(file)
+            writer.writerow(headers)
+        # Start prediction
+        with torch.no_grad():
+            for samples in tqdm(test_data_loader, desc="Prediction Progress"):
+                all_predictions = []
+                # Generate multiple predictions
+                for _ in range(generate_num):
+                    predicted_res = self.model(samples)
+                    all_predictions.append(predicted_res)
+                # Write results
+                caption_info = samples["caption_info"]
+                with open(output_csv, 'a', newline='', encoding='utf-8') as file:
+                    writer = csv.writer(file)
+                    for info in zip(*all_predictions, caption_info):
+                        row = [info[-1][4], info[-1][5], info[-1][0],
+                               info[-1][1], info[-1][2], info[-1][3]] + list(info[:-1])
+                        writer.writerow(row)
+        print(f"Prediction completed! Results saved to: {output_csv}")
+def main():
+    """Main function"""
+    parser = argparse.ArgumentParser(description="Matchcommentary Model Inference Script")
+    parser.add_argument("--model_path", type=str, default="./",
+                       help="Path to model files")
+    parser.add_argument("--feature_root", type=str, default="./features",
+                       help="Root directory for video features")
+    parser.add_argument("--ann_root", type=str, default="./dataset/MatchTime/train",
+                       help="Root directory for annotation files")
+    parser.add_argument("--output_csv", type=str, default="./predictions.csv",
+                       help="Output CSV file path")
+    parser.add_argument("--batch_size", type=int, default=4,
+                       help="Batch size for processing")
+    parser.add_argument("--num_workers", type=int, default=2,
+                       help="Number of data loading workers")
+    parser.add_argument("--generate_num", type=int, default=1,
+                       help="Number of commentary generations per video clip")
+    parser.add_argument("--device", type=str, default="cuda:0",
+                       help="Device to run on")
+    parser.add_argument("--fps", type=float, default=0.5,
+                       help="Feature extraction frame rate")
+    parser.add_argument("--window", type=float, default=15,
+                       help="Video window size in seconds")
+    args = parser.parse_args()
+    # Create predictor and run prediction
+    predictor = MatchcommentaryPredictor(args.model_path, args.device)
+    predictor.predict_batch(
+        feature_root=args.feature_root,
+        ann_root=args.ann_root,
+        output_csv=args.output_csv,
+        batch_size=args.batch_size,
+        num_workers=args.num_workers,
+        generate_num=args.generate_num,
+        fps=args.fps,
+        window=args.window
+    )
+if __name__ == "__main__":
+    main()

model_card.md ADDED Viewed

	@@ -0,0 +1,78 @@

+---
+library_name: transformers
+tags:
+- multimodal
+- video-understanding
+- sports
+- commentary-generation
+- llama3
+- soccer
+language:
+- en
+datasets:
+- MatchTime
+pipeline_tag: text-generation
+---
+# Matchcommentary: Automatic Soccer Game Commentary Generation
+## Model Description
+Matchcommentary is a multimodal model designed for automatic soccer game commentary generation. It combines video feature understanding with large language models to generate fluent and contextually appropriate soccer commentary.
+## Architecture
+The model consists of:
+- **Vision Encoder**: Q-Former architecture for processing video features
+- **Language Model**: LLaMA-3-8B-Instruct for text generation
+- **Feature Fusion**: Cross-attention mechanism between visual and textual information
+- **Domain Adaptation**: Soccer-specific vocabulary constraints
+## Intended Use
+### Primary Use Cases
+- Automatic soccer game commentary generation
+- Sports video understanding and description
+- Multimodal video-to-text generation
+### Limitations
+- Trained specifically on soccer/football content
+- Requires pre-extracted video features
+- Performance may vary on different video qualities or angles
+## Training Data
+The model was trained on the MatchTime dataset, which contains:
+- Soccer game videos with corresponding commentary
+- Multiple leagues and seasons
+- Temporal alignment between visual events and commentary
+## Performance
+The model achieves state-of-the-art performance on the MatchTime benchmark, with the best validation CIDEr score among tested configurations.
+## Usage
+```python
+from models.matchvoice_model import matchvoice_model
+import torch
+# Load model
+model = matchvoice_model(
+    llm_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
+    tokenizer_ckpt="meta-llama/Meta-Llama-3-8B-Instruct",
+    num_video_query_token=32,
+    num_features=512,
+    device="cuda:0",
+    inference=True
+)
+# Load checkpoint
+checkpoint = torch.load("model_save_best_val_CIDEr.pth")
+model.load_state_dict(checkpoint)
+model.eval()
+# Generate commentary
+with torch.no_grad():
+    commentary = model(video_samples)
+```

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+torch>=2.0.0
+transformers>=4.42.3
+einops>=0.8.0
+numpy>=1.26.3
+opencv-python>=4.10.0
+pycocoevalcap>=1.2
+pycocotools>=2.0.8
+pillow>=10.4.0
+pyyaml>=6.0.2
+requests>=2.32.3
+safetensors>=0.4.4
+huggingface-hub>=0.24.6
+tqdm
+argparse

soccer_words_llama3.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:654f03e1d4678cd0c3e8ca587af027e4bc14489e94e90bd30ad856242dab2d94
+size 9092