Spaces:

guard2PFE
/

DeepFakeDetector-demo

Running

App Files Files Community

guard2PFE commited on Feb 17

Commit

1ac3d64

verified ·

1 Parent(s): e5e4b3d

Upload 22 files

Browse files

Files changed (22) hide show

.gitignore +7 -0
LICENSE +21 -0
README.md +265 -12
check_data.py +89 -0
engine.py +171 -0
find_tall_model.py +102 -0
infer_videos_txt.py +433 -0
keep_only_numbered.py +50 -0
main.py +462 -0
make_tall_txt.py +146 -0
make_tall_txt_count.py +65 -0
models.py +245 -0
renumber_frames_for_tall.py +53 -0
requirements-torch.txt +4 -0
requirements.txt +0 -0
test.py +333 -0
test_new.py +319 -0
utils.py +265 -0
video_dataset.py +1228 -0
video_dataset_aug.py +73 -0
video_dataset_config.py +31 -0
video_transforms.py +420 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+lists/
+output/
+venv/
+*.csv
+*.json
+__pycache__/
+data/

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) [year] [fullname]
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,265 @@
----
-title: DeepFakeDetector Demo
-emoji: 🐠
-colorFrom: green
-colorTo: purple
-sdk: gradio
-sdk_version: 6.5.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 🎯 TALL-SWIN Deepfake Detection – Video-Level Pipeline
+Transformer-based deepfake detection system with custom dataset preparation, evaluation metrics, and batch inference utilities.
+---
+# 📖 1. Project Overview
+This project implements a **video-level deepfake detection pipeline** based on the **TALL-SWIN Vision Transformer architecture**.
+It extends the original TALL/DeiT repository with:
+- Custom dataset preparation scripts
+- Frame cleaning and renaming utilities
+- Automatic train/test split generation
+- Advanced evaluation metrics (ROC, PR, confusion matrix)
+- Threshold-controlled classification
+- Batch inference from video lists
+- Reproducible environment setup
+Designed primarily for FaceForensics++ (FFPP)-style datasets.
+---
+# 🏗 2. Pipeline Architecture
+```
+Raw Videos
+    ↓
+Frame Extraction
+    ↓
+Frame Cleaning / Renaming
+    ↓
+Train/Test Split Generation (.txt)
+    ↓
+Training (TALL-SWIN)
+    ↓
+Video-Level Aggregation
+    ↓
+Evaluation (ROC / PR / Metrics)
+    ↓
+Batch Inference
+```
+---
+# 📂 3. Project Structure
+```
+.
+├── main.py                      # Training script (base repository)
+├── engine.py                    # Training loop
+├── models.py                    # DeiT models
+├── utils.py                     # Distributed + checkpoint utilities
+├── video_dataset.py             # Dataset loader
+├── video_dataset_aug.py         # Augmentations
+├── video_dataset_config.py      # Dataset config
+├── video_transforms.py          # Group transforms
+│
+├── eval_direct.py               # Custom evaluation with plots
+├── test_new.py                  # Threshold-controlled evaluation
+├── infer_videos_txt.py          # Batch video inference
+├── make_tall_txt.py             # Dataset split generator
+├── make_tall_txt_count.py       # Alternative split generator
+├── renumber_frames_for_tall.py  # Frame renaming tool
+├── keep_only_numbered.py        # Frame cleaning tool
+├── find_tall_model.py           # Debug utility
+│
+├── requirements.txt
+├── requirements-torch.txt
+└── README.md
+```
+---
+# ⚙ 4. Installation
+## 4.1 Clone Repository
+```bash
+git clone <your_repository_url>
+cd TALL4Deepfake
+```
+## 4.2 Create Virtual Environment
+**Windows (PowerShell)**
+```powershell
+python -m venv venv
+venv\Scripts\Activate.ps1
+```
+**Linux / macOS**
+```bash
+python -m venv venv
+source venv/bin/activate
+```
+## 4.3 Install PyTorch (CUDA Build)
+PyTorch CUDA wheels are not hosted on default PyPI.
+```bash
+pip install -r requirements-torch.txt
+```
+## 4.4 Install Remaining Dependencies
+```bash
+pip install -r requirements.txt
+```
+## 4.5 Sanity Check
+```bash
+python -c "import torch, cv2, numpy as np; print(torch.__version__); print(cv2.__version__); print(np.__version__)"
+```
+---
+# 📁 5. Dataset Preparation
+Expected structure:
+```
+data/
+ ├── real/
+ │    ├── video_001/
+ │    │     ├── 0001.jpg
+ │    │     ├── 0002.jpg
+ │    │     └── ...
+ └── fake/
+      ├── video_002/
+```
+## 5.1 Renumber Frames
+Ensures frame names follow `0001.jpg` format.
+```bash
+python renumber_frames_for_tall.py --root data --digits 4 --copy
+```
+## 5.2 Remove Non-numbered Frames
+```bash
+python keep_only_numbered.py --root data
+```
+Dry run (no deletion):
+```bash
+python keep_only_numbered.py --root data --dry_run
+```
+## 5.3 Generate Train/Test Split
+**Full Split Generator**
+```bash
+python make_tall_txt.py --root data --out lists --train_ratio 0.8
+```
+Outputs:
+- `lists/cdf_train_fold.txt`
+- `lists/cdf_test_fold.txt`
+**Alternative Split (Count-Based)**
+```bash
+python make_tall_txt_count.py --root data --out lists
+```
+---
+# 🚀 6. Training
+```bash
+python main.py --dataset ffpp --data_dir [data_dir] --data_txt_dir [data_txt_dir] --input-size 112 --num_clips 8 --output_dir [outout_dir] --opt adamw --lr 1.5e-5 --warmup-lr 1.5e-8 --min-lr 1.5e-7 --epochs 10 --sched cosine --duration 4 --batch-size 2 --thumbnail_rows 2 --disable_scaleup --cutout True --pretrained --warmup-epochs 1 --no-amp --model TALL_SWIN --hpe_to_token --num_workers 0
+```
+---
+# 📊 7. Evaluation
+## 7.1 Custom Evaluation with Plots
+```bash
+python test_new.py --dataset ffpp --data_dir [your_data_dir] --data_txt_dir [your_data_dir_txt] --num_clips 8 --duration 4 --thumbnail_rows 2 --batch-size 1 --num_workers 0 --initial_checkpoint [your_.pth_dir] --output_dir [your_out_dir] --save_plots
+```
+---
+# 🎥 8. Batch Video Inference
+Create a file `videos.txt`:
+```
+C:/.../video1.mp4
+C:/.../video2.mp4
+```
+Run:
+```bash
+python infer_videos_txt.py --video_list videos.txt --initial_checkpoint [your_.pth_dir] --dataset ffpp --duration 4 --num_clips 8
+```
+Outputs:
+- `results.json`
+- `results.csv`
+---
+# 📈 9. Implemented Metrics
+- Accuracy
+- Balanced Accuracy
+- Precision
+- Recall
+- F1-score
+- ROC-AUC
+- PR-AUC
+- Confusion Matrix
+- Classification Report
+---
+# 🧠 10. Video-Level Aggregation Strategy
+When multiple clips are extracted per video:
+```
+logits → softmax → mean over clips → threshold decision
+```
+If logits shape is `[B*K, 2]`, they are reshaped into:
+```
+[B, K, 2] → mean(dim=1)
+```
+Ensures video-level classification rather than frame-level.
+---
+# 🖥 11. Recommended Environment
+- Python 3.10
+- CUDA-compatible GPU
+- PyTorch 1.13.1 (cu117 build)
+- NumPy 2.x (if using OpenCV ≥4.13)
+---
+# 👨‍💻 Author
+**Vinícius Passos Castilho Pinto**
+Double-degree Engineering Student
+Industrial & Automation Systems
+France / Brazil

check_data.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""
+Diagnose data directory structure and list file format
+"""
+import os
+def check_data_structure():
+    data_dir = r"C:\Users\vinip\Desktop\TALL4Deepfake\data"
+    list_file = r"C:\Users\vinip\Desktop\TALL4Deepfake\lists\cdf_test_fold.txt"
+    print("=" * 60)
+    print("Data Structure Diagnostic")
+    print("=" * 60)
+    # Check data directory
+    print(f"\n1. Checking data directory: {data_dir}")
+    if os.path.exists(data_dir):
+        print("   ✓ Directory exists")
+        # List subdirectories
+        subdirs = [d for d in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir, d))]
+        print(f"\n   Subdirectories found: {subdirs}")
+        # Check a few subdirs
+        for subdir in subdirs[:3]:
+            subdir_path = os.path.join(data_dir, subdir)
+            video_folders = os.listdir(subdir_path)[:3]
+            print(f"\n   {subdir}/ contains: {video_folders}")
+            # Check first video folder
+            if video_folders:
+                first_video = os.path.join(subdir_path, video_folders[0])
+                if os.path.isdir(first_video):
+                    frames = os.listdir(first_video)[:5]
+                    print(f"     {video_folders[0]}/ contains: {frames}")
+    else:
+        print("   ✗ Directory not found!")
+        return
+    # Check list file
+    print(f"\n2. Checking list file: {list_file}")
+    if os.path.exists(list_file):
+        print("   ✓ File exists")
+        with open(list_file, 'r') as f:
+            lines = f.readlines()
+        print(f"   Total lines: {len(lines)}")
+        print("\n   First 5 lines:")
+        for i, line in enumerate(lines[:5], 1):
+            print(f"   {i}: {repr(line.strip())}")
+        # Analyze path format
+        print("\n3. Path format analysis:")
+        first_line = lines[0].strip()
+        parts = first_line.split()
+        if parts:
+            video_path = parts[0]
+            print(f"   Video path: {repr(video_path)}")
+            print(f"   Uses forward slashes: {'/' in video_path}")
+            print(f"   Uses backslashes: {chr(92) in video_path}")
+            # Try to construct full path
+            full_path = os.path.join(data_dir, video_path)
+            print(f"\n   Constructed path: {full_path}")
+            print(f"   Path exists: {os.path.exists(full_path)}")
+            # Try with normalization
+            normalized = os.path.normpath(os.path.join(data_dir, video_path))
+            print(f"\n   Normalized path: {normalized}")
+            print(f"   Path exists: {os.path.exists(normalized)}")
+            # Check if video folder exists
+            if not os.path.exists(normalized):
+                # Try to find similar paths
+                print("\n   ⚠ Path doesn't exist. Looking for similar paths...")
+                video_name = os.path.basename(video_path)
+                category = os.path.dirname(video_path).replace('/', os.sep).replace('\\', os.sep)
+                category_path = os.path.join(data_dir, category)
+                if os.path.exists(category_path):
+                    contents = os.listdir(category_path)
+                    print(f"\n   Directory '{category}' contains: {contents[:10]}")
+    else:
+        print("   ✗ File not found!")
+    print("\n" + "=" * 60)
+if __name__ == "__main__":
+    check_data_structure()

engine.py ADDED Viewed

	@@ -0,0 +1,171 @@

+# Copyright (c) 2015-present, Facebook, Inc.
+# All rights reserved.
+#
+# This source code is licensed under the CC-by-NC license found in the
+# LICENSE file in the root directory of this source tree.
+#
+"""
+Train and eval functions used in main.py
+"""
+from typing import Iterable, Optional
+from einops import rearrange
+import torch
+import numpy
+from timm.data import Mixup
+from timm.utils import accuracy, ModelEma
+import utils
+from sklearn.metrics import roc_auc_score
+def train_one_epoch(model: torch.nn.Module, criterion: torch.nn.Module,
+                    data_loader: Iterable, num_cilps:int, optimizer: torch.optim.Optimizer,
+                    device: torch.device, epoch: int, loss_scaler, max_norm: float = 0,
+                    model_ema: Optional[ModelEma] = None, mixup_fn: Optional[Mixup] = None,
+                    world_size: int = 1, distributed: bool = True, amp=True,
+                    contrastive_nomixup=False, hard_contrastive=False,
+                    finetune=False
+                    ):
+    # TODO fix this for finetuning
+    if finetune:
+        model.train(not finetune)
+    else:
+        model.train()
+    #criterion.train()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.8f}'))
+    header = 'Epoch: [{}]'.format(epoch)
+    print_freq = 50
+    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
+        batch_size = targets.size(0)
+        samples = samples.to(device, non_blocking=True)
+        targets = targets.to(device, non_blocking=True)
+        if mixup_fn is not None:
+            # batch size has to be an even number
+            if batch_size == 1:
+                continue
+            if batch_size % 2 != 0:
+                    samples, targets = samples[:-1], targets[:-1]
+            samples, targets = mixup_fn(samples, targets)
+        with torch.cuda.amp.autocast(enabled=amp):
+            outputs = model(samples)
+            outputs = outputs.reshape(batch_size, num_cilps, -1).mean(dim=1)
+            loss = criterion(outputs, targets)
+        loss_value = loss.item()
+        optimizer.zero_grad()
+        # this attribute is added by timm on one optimizer (adahessian)
+        is_second_order = hasattr(optimizer, 'is_second_order') and optimizer.is_second_order
+        if amp:
+            loss_scaler(loss, optimizer, clip_grad=max_norm,
+                        parameters=model.parameters(), create_graph=is_second_order)
+        else:
+            loss.backward(create_graph=is_second_order)
+            if max_norm is not None and max_norm != 0.0:
+                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
+            optimizer.step()
+        torch.cuda.synchronize()
+        if model_ema is not None:
+            model_ema.update(model)
+        metric_logger.update(loss=loss_value)
+        metric_logger.update(lr=optimizer.param_groups[0]["lr"])
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print("Averaged stats:", metric_logger)
+    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+@torch.no_grad()
+def evaluate(data_loader, model, device, world_size, distributed=True, amp=False, num_crops=1, num_clips=1):
+    criterion = torch.nn.CrossEntropyLoss()
+    to_np = lambda x: x.data.cpu().numpy()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    header = 'Test:'
+    # switch to evaluation mode
+    model.eval()
+    outputs = []
+    targets = []
+    logits = []
+    binary_label = []
+    for images, target in metric_logger.log_every(data_loader, 10, header):
+        images = images.to(device, non_blocking=True)
+        target = target.to(device, non_blocking=True)
+        # compute output
+        batch_size = images.shape[0]
+        with torch.cuda.amp.autocast(enabled=amp):
+            output = model(images)
+        output = output.reshape(batch_size, num_crops * num_clips, -1).mean(dim=1)
+        output_np = to_np(output[:,1])
+        if distributed:
+            outputs.append(concat_all_gather(output))
+            targets.append(concat_all_gather(target))
+            output_ = concat_all_gather(output)
+            target_ = concat_all_gather(target)
+            output_np_ = to_np(output_[:,1])
+            logits.append(output_np_)
+            binary_label.append(target_.detach().cpu())
+        else:
+            outputs.append(output)
+            targets.append(target)
+            logits.append(output_np)
+            binary_label.append(target.detach().cpu())
+        batch_size = images.shape[0]
+        acc1 = accuracy(output, target, topk=(1,))[0]
+        metric_logger.meters['acc1'].update(acc1.item(), images.size(0))
+    # import pdb;pdb.set_trace()
+    acc_outputs = numpy.stack(logits,0).reshape(-1,1)
+    acc_label = numpy.stack(binary_label,0).reshape(-1,1)
+    outputs = torch.cat(outputs, dim=0)
+    targets = torch.cat(targets, dim=0)
+    auc_score = roc_auc_score(acc_label, acc_outputs)
+    real_loss = criterion(outputs, targets)
+    metric_logger.update(loss=real_loss.item())
+    print('* Acc@1 {top1.global_avg:.3f} AUC {auc} loss {losses.global_avg:.3f}'
+          .format(top1=metric_logger.acc1,auc=auc_score,losses=metric_logger.loss))
+    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+@torch.no_grad()
+def concat_all_gather(tensor):
+    """
+    Performs all_gather operation on the provided tensors.
+    *** Warning ***: torch.distributed.all_gather has no gradient.
+    """
+    tensors_gather = [torch.ones_like(tensor)
+        for _ in range(torch.distributed.get_world_size())]
+    torch.distributed.all_gather(tensors_gather, tensor.contiguous(), async_op=False)
+    #output = torch.cat(tensors_gather, dim=0)
+    if tensor.dim() == 1:
+        output = rearrange(tensors_gather, 'n b -> (b n)')
+    else:
+        output = rearrange(tensors_gather, 'n b c -> (b n) c')
+    return output

find_tall_model.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""
+Find the TALL_SWIN model definition in the codebase
+"""
+import os
+import re
+def search_for_tall_model(root_dir):
+    """Search for TALL_SWIN class definition"""
+    print("=" * 60)
+    print("Searching for TALL_SWIN model definition...")
+    print("=" * 60)
+    patterns = [
+        r'class\s+TALL_SWIN',
+        r'class\s+TallSwin',
+        r'class\s+TALLSwin',
+        r'def\s+TALL_SWIN',
+        r'@register_model.*TALL',
+    ]
+    found_files = []
+    # Search in models directory and subdirectories
+    for dirpath, dirnames, filenames in os.walk(root_dir):
+        # Skip venv and common ignore dirs
+        dirnames[:] = [d for d in dirnames if d not in ['venv', '.git', '__pycache__', 'node_modules']]
+        for filename in filenames:
+            if filename.endswith('.py'):
+                filepath = os.path.join(dirpath, filename)
+                try:
+                    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
+                        content = f.read()
+                    for pattern in patterns:
+                        matches = re.findall(pattern, content, re.IGNORECASE)
+                        if matches:
+                            found_files.append(filepath)
+                            print(f"\n✓ Found in: {filepath}")
+                            print(f"  Pattern matched: {pattern}")
+                            # Show context
+                            lines = content.split('\n')
+                            for i, line in enumerate(lines):
+                                if re.search(pattern, line, re.IGNORECASE):
+                                    start = max(0, i-2)
+                                    end = min(len(lines), i+10)
+                                    print(f"\n  Context (lines {start+1}-{end+1}):")
+                                    print("  " + "-"*50)
+                                    for j in range(start, end):
+                                        marker = ">>>" if j == i else "   "
+                                        print(f"  {marker} {j+1:4d}: {lines[j]}")
+                                    print("  " + "-"*50)
+                            break
+                except Exception as e:
+                    pass
+    if not found_files:
+        print("\n✗ No TALL_SWIN model definition found!")
+        print("\nSearching for any Swin-related files...")
+        # Broader search
+        for dirpath, dirnames, filenames in os.walk(root_dir):
+            dirnames[:] = [d for d in dirnames if d not in ['venv', '.git', '__pycache__']]
+            for filename in filenames:
+                if 'swin' in filename.lower() or 'tall' in filename.lower():
+                    print(f"  Found file: {os.path.join(dirpath, filename)}")
+    return found_files
+if __name__ == "__main__":
+    import sys
+    # Get the repo directory
+    if len(sys.argv) > 1:
+        repo_dir = sys.argv[1]
+    else:
+        repo_dir = os.getcwd()
+    print(f"Searching in: {repo_dir}\n")
+    found = search_for_tall_model(repo_dir)
+    print("\n" + "=" * 60)
+    if found:
+        print(f"Found {len(found)} file(s) with TALL_SWIN definition")
+        print("\nNext steps:")
+        print("1. Check the file(s) above for the TALL_SWIN class")
+        print("2. Ensure this file is imported in models/__init__.py")
+        print("3. Example fix for models/__init__.py:")
+        print("\n   from .tall_swin import TALL_SWIN")
+        print("   # or")
+        print("   from .tall_swin import *")
+    else:
+        print("No TALL_SWIN definition found!")
+        print("\nPossible issues:")
+        print("1. The model file is in a different location")
+        print("2. The model has a different class name")
+        print("3. The model code is missing from the repository")
+    print("=" * 60)

infer_videos_txt.py ADDED Viewed

	@@ -0,0 +1,433 @@

+# infer_videos_txt.py
+# -------------------------------------------
+# Inference on a list of videos (one path per line in a .txt).
+# For each video, frames are extracted to a temporary folder and then fed to the model
+# using the same VideoDataSet pipeline as training/evaluation.
+#
+# PowerShell usage:
+#   python infer_videos_txt.py `
+#     --video_list "C:\path\videos.txt" `
+#     --initial_checkpoint "C:\path\model_best.pth" `
+#     --dataset ffpp `
+#     --threshold 0.7 `
+#     --num_clips 8 --duration 4 --thumbnail_rows 2 `
+#     --num_workers 0 `
+#     --output_json "C:\path\results.json" `
+#     --output_csv "C:\path\results.csv"
+# -------------------------------------------
+import os
+import csv
+import json
+import argparse
+import shutil
+import tempfile
+from typing import List, Dict, Any
+import cv2
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+from timm.models import create_model
+import my_models  # registers TALL_SWIN
+import utils
+from video_dataset import VideoDataSet
+from video_dataset_aug import get_augmentor, build_dataflow
+from video_dataset_config import get_dataset_config
+# -----------------------------
+# Frame extraction (OpenCV)
+# -----------------------------
+def extract_frames_opencv(video_path: str, out_dir: str, digits: int = 5, max_frames: int = 0) -> int:
+    """Extract all frames from a video to out_dir as {00001}.jpg, {00002}.jpg, ...
+    Returns the number of frames extracted.
+    """
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        raise RuntimeError(f"Could not open video: {video_path}")
+    idx = 1
+    fmt = f"{{:0{digits}d}}.jpg"
+    while True:
+        ok, frame = cap.read()
+        if not ok:
+            break
+        out_file = os.path.join(out_dir, fmt.format(idx))
+        cv2.imwrite(out_file, frame)
+        idx += 1
+        if max_frames > 0 and (idx - 1) >= max_frames:
+            break
+    cap.release()
+    return idx - 1
+def detect_digits_for_tmpl(image_tmpl: str) -> int:
+    """Infer digits from an image template like '{:05d}.jpg' -> 5."""
+    import re
+    m = re.search(r"\{:\s*0?(\d+)d\}", image_tmpl)
+    if m:
+        return int(m.group(1))
+    return 5
+# -----------------------------
+# IO helpers
+# -----------------------------
+def read_video_list_txt(path: str) -> List[str]:
+    """Read a text file with one video path per line; returns unique absolute paths."""
+    videos: List[str] = []
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            p = line.strip().strip('"').strip("'")
+            if not p:
+                continue
+            videos.append(os.path.abspath(p))
+    # De-duplicate while keeping order
+    seen = set()
+    uniq: List[str] = []
+    for p in videos:
+        if p not in seen:
+            uniq.append(p)
+            seen.add(p)
+    return uniq
+def write_csv(path: str, rows: List[Dict[str, Any]]) -> None:
+    """Write per-video results to CSV."""
+    if os.path.dirname(path):
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+    fieldnames = [
+        "video",
+        "pred_name",
+        "pred",
+        "threshold",
+        "p_real",
+        "p_fake",
+        "n_frames",
+        "n_preds",
+        "status",
+        "error",
+    ]
+    with open(path, "w", newline="", encoding="utf-8") as f:
+        w = csv.DictWriter(f, fieldnames=fieldnames)
+        w.writeheader()
+        for r in rows:
+            w.writerow({k: r.get(k, "") for k in fieldnames})
+def write_json(path: str, payload: Dict[str, Any]) -> None:
+    """Write results summary to JSON."""
+    if os.path.dirname(path):
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+    with open(path, "w", encoding="utf-8") as f:
+        json.dump(payload, f, indent=2, ensure_ascii=False)
+# -----------------------------
+# Temporary dataset builder
+# -----------------------------
+def build_tmp_dataset_from_video(video_path: str, tmp_dir: str, image_tmpl: str) -> Dict[str, Any]:
+    """Create a temporary folder with extracted frames and a one-line list file compatible with VideoDataSet."""
+    video_id = "video"
+    video_folder = os.path.join(tmp_dir, video_id)
+    os.makedirs(video_folder, exist_ok=True)
+    digits = detect_digits_for_tmpl(image_tmpl)
+    n = extract_frames_opencv(video_path, video_folder, digits=digits, max_frames=0)
+    if n < 4:
+        raise RuntimeError(f"Video has too few frames ({n}). Need >= 4.")
+    # VideoDataSet expects a list file with: <video_id> <start_idx> <num_frames> <label>
+    list_file_abs = os.path.join(tmp_dir, "one.txt")
+    with open(list_file_abs, "w", encoding="utf-8") as f:
+        f.write(f"{video_id} 1 {n} 0\n")  # dummy label (0), unused during inference
+    return {"list_rel": "one.txt", "nframes": n, "video_id": video_id}
+# -----------------------------
+# Model + augmentor builder
+# -----------------------------
+def build_model_and_augmentor(args):
+    """Build model, load checkpoint, and create the same evaluation augmentor."""
+    utils.init_distributed_mode(args)  # will set args.distributed, etc.
+    device = torch.device(args.device)
+    cudnn.benchmark = True
+    # Get dataset config for: num_classes, separator, image_tmpl, filter_video, etc.
+    num_classes, _, _, _, filename_seperator, image_tmpl, filter_video, _ = get_dataset_config(
+        args.dataset, args.use_lmdb
+    )
+    args.num_classes = num_classes
+    print(f"Creating model: {args.model}")
+    model = create_model(
+        args.model,
+        pretrained=False,
+        duration=args.duration,
+        hpe_to_token=args.hpe_to_token,
+        rel_pos=args.rel_pos,
+        window_size=args.window_size,
+        thumbnail_rows=args.thumbnail_rows,
+        token_mask=not args.no_token_mask,
+        online_learning=False,
+        num_classes=args.num_classes,
+        drop_rate=args.drop,
+        drop_path_rate=args.drop_path,
+        drop_block_rate=args.drop_block,
+        use_checkpoint=args.use_checkpoint,
+    ).to(device)
+    model.eval()
+    ckpt = torch.load(args.initial_checkpoint, map_location="cpu")
+    if isinstance(ckpt, dict) and "model" in ckpt:
+        utils.load_checkpoint(model, ckpt["model"])
+    else:
+        # If the checkpoint is a raw state_dict
+        model.load_state_dict(ckpt, strict=False)
+    mean = (0.5, 0.5, 0.5) if "mean" not in model.default_cfg else model.default_cfg["mean"]
+    std = (0.5, 0.5, 0.5) if "std" not in model.default_cfg else model.default_cfg["std"]
+    augmentor = get_augmentor(
+        False,                   # is_train
+        args.input_size,         # input_size
+        mean,
+        std,
+        args.disable_scaleup,
+        threed_data=args.threed_data,
+        version=args.augmentor_ver,
+        scale_range=args.scale_range,
+        num_clips=args.num_clips,
+        num_crops=args.num_crops,
+        dataset=args.dataset
+    )
+    meta = {
+        "device": device,
+        "model": model,
+        "augmentor": augmentor,
+        "num_classes": num_classes,
+        "filename_seperator": filename_seperator,
+        "image_tmpl": image_tmpl,
+        "filter_video": filter_video,
+    }
+    return meta
+# -----------------------------
+# Inference
+# -----------------------------
+@torch.no_grad()
+def infer_one_video_from_tmp(args, meta: Dict[str, Any], tmp_root: str, list_rel_path: str, image_tmpl: str) -> Dict[str, Any]:
+    """Run inference on a single temporary dataset with 1 video."""
+    device = meta["device"]
+    model = meta["model"]
+    augmentor = meta["augmentor"]
+    dataset = VideoDataSet(
+        root_path=tmp_root,
+        list_file=list_rel_path,          # relative to root_path for VideoDataSet
+        num_groups=args.duration,
+        frames_per_group=args.frames_per_group,
+        sample_offset=0,
+        num_clips=args.num_clips,
+        modality=args.modality,
+        dense_sampling=args.dense_sampling,
+        fixed_offset=True,
+        image_tmpl=image_tmpl,            # enforce correct template (e.g., {:05d}.jpg)
+        transform=augmentor,
+        is_train=False,
+        test_mode=False,
+        seperator=meta["filename_seperator"],
+        filter_video=meta["filter_video"],
+        num_classes=meta["num_classes"],
+        whole_video=False,
+    )
+    loader = build_dataflow(
+        dataset, is_train=False, batch_size=1,
+        workers=args.num_workers, is_distributed=False
+    )
+    logits_all = []
+    for samples, _targets in loader:
+        samples = samples.to(device, non_blocking=True)
+        logits = model(samples)  # shape [1,2] typically (or [K,2] depending on pipeline)
+        logits_all.append(logits.detach().cpu())
+    logits_all = torch.cat(logits_all, dim=0)           # [n_preds, 2]
+    logits_mean = logits_all.mean(dim=0, keepdim=True)  # [1,2]
+    probs = torch.softmax(logits_mean, dim=1).numpy()[0]
+    # Threshold-based decision (class 1 = FAKE)
+    thr = float(args.threshold)
+    pred = int(probs[1] >= thr)
+    return {
+        "threshold": thr,
+        "p_real": float(probs[0]),
+        "p_fake": float(probs[1]),
+        "pred": pred,
+        "pred_name": "FAKE" if pred == 1 else "REAL",
+        "n_preds": int(logits_all.shape[0]),
+    }
+# -----------------------------
+# CLI
+# -----------------------------
+def get_args():
+    ap = argparse.ArgumentParser("Infer TALL_SWIN from a list of videos (txt)")
+    ap.add_argument("--video_list", required=True, help="Text file with one video path per line")
+    ap.add_argument("--initial_checkpoint", required=True)
+    ap.add_argument("--dataset", default="ffpp")
+    ap.add_argument("--model", default="TALL_SWIN")
+    ap.add_argument("--device", default="cuda")
+    ap.add_argument("--num_workers", type=int, default=0)
+    ap.add_argument("--duration", type=int, default=4)
+    ap.add_argument("--frames_per_group", type=int, default=1)
+    ap.add_argument("--num_clips", type=int, default=8)
+    ap.add_argument("--num_crops", type=int, default=1)
+    ap.add_argument("--thumbnail_rows", type=int, default=2)
+    ap.add_argument("--input_size", type=int, default=224)
+    ap.add_argument("--threshold", type=float, default=0.5,
+                    help="Decision threshold for FAKE (pred=1 if p_fake >= threshold)")
+    ap.add_argument("--disable_scaleup", action="store_true")
+    ap.add_argument("--threed_data", default=False)
+    ap.add_argument("--dense_sampling", default=True)
+    ap.add_argument("--augmentor_ver", default="v1")
+    ap.add_argument("--scale_range", default=[256, 320], type=int, nargs="+")
+    ap.add_argument("--modality", default="rgb")
+    ap.add_argument("--use_lmdb", default=False)
+    ap.add_argument("--hpe_to_token", action="store_true")
+    ap.add_argument("--rel_pos", action="store_true")
+    ap.add_argument("--window_size", type=int, default=7)
+    ap.add_argument("--no_token_mask", action="store_true")
+    ap.add_argument("--drop", type=float, default=0.0)
+    ap.add_argument("--drop_path", type=float, default=0.1)
+    ap.add_argument("--drop_block", default=None)
+    ap.add_argument("--use_checkpoint", default=False)
+    ap.add_argument("--dist_url", default="env://")
+    ap.add_argument("--world_size", default=1, type=int)
+    ap.add_argument("--local_rank", default=None, type=int)
+    ap.add_argument("--output_json", default="", help="Optional path to save results JSON")
+    ap.add_argument("--output_csv", default="", help="Optional path to save results CSV")
+    return ap.parse_args()
+# -----------------------------
+# Main
+# -----------------------------
+def main():
+    args = get_args()
+    if not os.path.isfile(args.video_list):
+        raise FileNotFoundError(args.video_list)
+    if not os.path.isfile(args.initial_checkpoint):
+        raise FileNotFoundError(args.initial_checkpoint)
+    videos = read_video_list_txt(args.video_list)
+    if len(videos) == 0:
+        raise RuntimeError("video_list is empty.")
+    # Build model + augmentor once
+    meta = build_model_and_augmentor(args)
+    results_rows: List[Dict[str, Any]] = []
+    ok_count = 0
+    print(f"\nVideos to process: {len(videos)}")
+    print(f"Checkpoint: {args.initial_checkpoint}")
+    print(f"Threshold: {args.threshold}\n")
+    for i, video_path in enumerate(videos, 1):
+        row: Dict[str, Any] = {
+            "video": video_path,
+            "status": "ok",
+            "error": "",
+            "pred": "",
+            "pred_name": "",
+            "threshold": float(args.threshold),
+            "p_real": "",
+            "p_fake": "",
+            "n_frames": "",
+            "n_preds": "",
+        }
+        if not os.path.isfile(video_path):
+            row["status"] = "skip"
+            row["error"] = "file_not_found"
+            results_rows.append(row)
+            print(f"[{i}/{len(videos)}] [SKIP] Not found: {video_path}")
+            continue
+        tmp_dir = tempfile.mkdtemp(prefix="tall_infer_")
+        try:
+            tmp_info = build_tmp_dataset_from_video(video_path, tmp_dir, image_tmpl=meta["image_tmpl"])
+            row["n_frames"] = int(tmp_info["nframes"])
+            out = infer_one_video_from_tmp(
+                args, meta, tmp_dir, tmp_info["list_rel"], image_tmpl=meta["image_tmpl"]
+            )
+            row.update(out)
+            ok_count += 1
+            print(f"[{i}/{len(videos)}] [{row['pred_name']}] {video_path} | p_fake={float(row['p_fake']):.4f}")
+        except Exception as e:
+            row["status"] = "error"
+            row["error"] = str(e)
+            print(f"[{i}/{len(videos)}] [ERROR] {video_path}\n  -> {e}")
+        finally:
+            shutil.rmtree(tmp_dir, ignore_errors=True)
+        results_rows.append(row)
+    summary = {
+        "video_list": os.path.abspath(args.video_list),
+        "checkpoint": os.path.abspath(args.initial_checkpoint),
+        "dataset": args.dataset,
+        "model": args.model,
+        "threshold": float(args.threshold),
+        "num_videos": len(videos),
+        "num_ok": ok_count,
+        "results": results_rows,
+    }
+    print("\n=== SUMMARY ===")
+    print(f"ok: {ok_count}/{len(videos)}")
+    if ok_count > 0:
+        pf = [float(r["p_fake"]) for r in results_rows if r.get("status") == "ok"]
+        print(f"avg p_fake (ok only): {sum(pf)/len(pf):.4f}")
+    if args.output_json:
+        write_json(args.output_json, summary)
+        print(f"Saved JSON: {os.path.abspath(args.output_json)}")
+    if args.output_csv:
+        write_csv(args.output_csv, results_rows)
+        print(f"Saved CSV:  {os.path.abspath(args.output_csv)}")
+if __name__ == "__main__":
+    main()

keep_only_numbered.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import argparse, re
+from pathlib import Path
+IMG_EXTS = {".jpg", ".jpeg", ".png", ".bmp", ".webp"}
+NUM_RE = re.compile(r"^\d+$")  # stem must be only digits
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--root", required=True)
+    ap.add_argument("--dry_run", action="store_true", help="Only print what would be deleted")
+    args = ap.parse_args()
+    root = Path(args.root)
+    to_delete = []
+    for cls in ["real", "fake"]:
+        cls_dir = root / cls
+        if not cls_dir.exists():
+            continue
+        for vid_dir in cls_dir.iterdir():
+            if not vid_dir.is_dir():
+                continue
+            for p in vid_dir.iterdir():
+                if not p.is_file():
+                    continue
+                if p.suffix.lower() not in IMG_EXTS:
+                    continue
+                if not NUM_RE.match(p.stem):
+                    to_delete.append(p)
+    print(f"Found {len(to_delete)} non-numbered frames to delete.")
+    for p in to_delete[:30]:
+        print("DEL", p)
+    if len(to_delete) > 30:
+        print("...")
+    if args.dry_run:
+        print("Dry-run only. No files deleted.")
+        return
+    for p in to_delete:
+        try:
+            p.unlink()
+        except Exception as e:
+            print("FAILED", p, e)
+    print("Done.")
+if __name__ == "__main__":
+    main()

main.py ADDED Viewed

	@@ -0,0 +1,462 @@

+import argparse
+import datetime
+import numpy as np
+import time
+import torch
+import torch.backends.cudnn as cudnn
+import json
+import os
+import warnings
+from pathlib import Path
+from timm.data import Mixup
+from timm.models import create_model
+from timm.loss import LabelSmoothingCrossEntropy, SoftTargetCrossEntropy
+from timm.scheduler import create_scheduler
+from timm.optim import create_optimizer
+from timm.utils import NativeScaler, get_state_dict, ModelEma
+#from datasets import build_dataset
+from engine import train_one_epoch, evaluate
+import models
+import my_models
+import torch.nn as nn
+import utils
+from video_dataset import VideoDataSet
+from video_dataset_aug import get_augmentor, build_dataflow
+from video_dataset_config import get_dataset_config, DATASET_CONFIG
+warnings.filterwarnings("ignore", category=UserWarning)
+def get_args_parser():
+    parser = argparse.ArgumentParser('DeiT training and evaluation script', add_help=False)
+    parser.add_argument('--model_name',default="TALL_SWIN")
+    parser.add_argument('--batch-size', default=2, type=int)
+    parser.add_argument('--epochs', default=30, type=int)
+    # Dataset parameters
+    parser.add_argument('--data_txt_dir', type=str,default='##path_for_dataset_txt##', help='path to text of dataset')
+    parser.add_argument('--data_dir', type=str,default="##path_for_dataset##", help='path to dataset')
+    parser.add_argument('--dataset', default='ffpp_train',
+                        choices=list(DATASET_CONFIG.keys()), help='path to dataset file list')
+    parser.add_argument('--duration', default=8, type=int, help='number of frames')
+    parser.add_argument('--frames_per_group', default=1, type=int,
+                        help='[uniform sampling] number of frames per group; '
+                             '[dense sampling]: sampling frequency')
+    parser.add_argument('--threed_data', default=False, help='load data in the layout for 3D conv')
+    parser.add_argument('--input_size', default=224, type=int, metavar='N', help='input image size')
+    parser.add_argument('--disable_scaleup', action='store_true',
+                        help='do not scale up and then crop a small region, directly crop the input_size')
+    parser.add_argument('--random_sampling', action='store_true',
+                        help='perform determinstic sampling for data loader')
+    parser.add_argument('--dense_sampling', default=True,
+                        help='perform dense sampling for data loader')
+    parser.add_argument('--augmentor_ver', default='v1', type=str, choices=['v1', 'v2'],
+                        help='[v1] TSN data argmentation, [v2] resize the shorter side to `scale_range`')
+    parser.add_argument('--scale_range', default=[256, 320], type=int, nargs="+",
+                        metavar='scale_range', help='scale range for augmentor v2')
+    parser.add_argument('--modality', default='rgb', type=str, help='rgb or flow')
+    parser.add_argument('--use_lmdb', default=False, help='use lmdb instead of jpeg.')
+    parser.add_argument('--use_pyav', default=False, help='use video directly.')
+    # temporal module
+    parser.add_argument('--pretrained', action='store_true', default=False,
+                    help='Start with pretrained version of specified network (if avail)')
+    parser.add_argument('--temporal_module_name', default=None, type=str, metavar='TEM', choices=['ResNet3d', 'TAM', 'TTAM', 'TSM', 'TTSM', 'MSA'],
+                        help='temporal module applied. [TAM]')
+    parser.add_argument('--temporal_attention_only', action='store_true', default=False,
+                        help='use attention only in temporal module]')
+    parser.add_argument('--no_token_mask', action='store_true', default=False, help='do not apply token mask')
+    parser.add_argument('--temporal_heads_scale', default=1.0, type=float, help='scale of the number of spatial heads')
+    parser.add_argument('--temporal_mlp_scale', default=1.0, type=float, help='scale of spatial mlp')
+    parser.add_argument('--rel_pos', action='store_true', default=False,
+                        help='use relative positioning in temporal module]')
+    parser.add_argument('--temporal_pooling', type=str, default=None, choices=['avg', 'max', 'conv', 'depthconv'],
+                        help='perform temporal pooling]')
+    parser.add_argument('--bottleneck', default=None, choices=['regular', 'dw'],
+                        help='use depth-wise bottleneck in temporal attention')
+    parser.add_argument('--window_size', default=14, type=int, help='number of frames')
+    parser.add_argument('--thumbnail_rows', default=4, type=int, help='number of frames per row')
+    parser.add_argument('--hpe_to_token', default=False, action='store_true',
+                        help='add hub position embedding to image tokens')
+    # Model parameters
+    parser.add_argument('--model', default='TALL_SWIN', type=str, metavar='MODEL',
+                        help='Name of model to train')
+    parser.add_argument('--input-size', default=224, type=int, help='images input size')
+    parser.add_argument('--drop', type=float, default=0.0, metavar='PCT',
+                        help='Dropout rate (default: 0.)')
+    parser.add_argument('--drop-path', type=float, default=0.1, metavar='PCT',
+                        help='Drop path rate (default: 0.1)')
+    parser.add_argument('--drop-block', type=float, default=None, metavar='PCT',
+                        help='Drop block rate (default: None)')
+    parser.add_argument('--model-ema', action='store_true')
+    parser.add_argument('--no-model-ema', action='store_false', dest='model_ema')
+    parser.set_defaults(model_ema=True)
+    parser.add_argument('--model-ema-decay', type=float, default=0.99996, help='')
+    parser.add_argument('--model-ema-force-cpu', action='store_true', default=False, help='')
+    # Optimizer parameters
+    parser.add_argument('--opt', default='adamw', type=str, metavar='OPTIMIZER',
+                        help='Optimizer (default: "adamw"')
+    parser.add_argument('--opt-eps', default=1e-8, type=float, metavar='EPSILON',
+                        help='Optimizer Epsilon (default: 1e-8)')
+    parser.add_argument('--opt-betas', default=None, type=float, nargs='+', metavar='BETA',
+                        help='Optimizer Betas (default: None, use opt default)')
+    parser.add_argument('--clip-grad', type=float, default=None, metavar='NORM',
+                        help='Clip gradient norm (default: None, no clipping)')
+    parser.add_argument('--momentum', type=float, default=0.9, metavar='M',
+                        help='SGD momentum (default: 0.9)')
+    parser.add_argument('--weight-decay', type=float, default=1e-5,
+                        help='weight decay (default: 0.05)')
+    # Learning rate schedule parameters
+    parser.add_argument('--sched', default='cosine', type=str, metavar='SCHEDULER',
+                        help='LR scheduler (default: "cosine"')
+    parser.add_argument('--lr', type=float, default=5e-5, metavar='LR',
+                        help='learning rate (default: 5e-4)')
+    parser.add_argument('--lr-noise', type=float, nargs='+', default=None, metavar='pct, pct',
+                        help='learning rate noise on/off epoch percentages')
+    parser.add_argument('--lr-noise-pct', type=float, default=0.67, metavar='PERCENT',
+                        help='learning rate noise limit percent (default: 0.67)')
+    parser.add_argument('--lr-noise-std', type=float, default=1.0, metavar='STDDEV',
+                        help='learning rate noise std-dev (default: 1.0)')
+    parser.add_argument('--warmup-lr', type=float, default=1e-8, metavar='LR',
+                        help='warmup learning rate (default: 1e-6)')
+    parser.add_argument('--min-lr', type=float, default=1e-7, metavar='LR',
+                        help='lower lr bound for cyclic schedulers that hit 0 (1e-5)')
+    parser.add_argument('--decay-epochs', type=float, default=10, metavar='N',
+                        help='epoch interval to decay LR')
+    parser.add_argument('--warmup-epochs', type=int, default=10, metavar='N',
+                        help='epochs to warmup LR, if scheduler supports')
+    parser.add_argument('--cooldown-epochs', type=int, default=10, metavar='N',
+                        help='epochs to cooldown LR at min_lr, after cyclic schedule ends')
+    parser.add_argument('--patience-epochs', type=int, default=10, metavar='N',
+                        help='patience epochs for Plateau LR scheduler (default: 10')
+    parser.add_argument('--decay-rate', '--dr', type=float, default=0.1, metavar='RATE',
+                        help='LR decay rate (default: 0.1)')
+    # Augmentation parameters
+    parser.add_argument('--color-jitter', type=float, default=0.4, metavar='PCT',
+                        help='Color jitter factor (default: 0.4)')
+    parser.add_argument('--aa', type=str, default='rand-m9-mstd0.5', metavar='NAME',
+                        help='Use AutoAugment policy. "v0" or "original". " + \
+                             "(default: rand-m9-mstd0.5-inc1)'),
+    parser.add_argument('--smoothing', type=float, default=0.1, help='Label smoothing (default: 0.1)')
+    parser.add_argument('--train-interpolation', type=str, default='bicubic',
+                        help='Training interpolation (random, bilinear, bicubic default: "bicubic")')
+    parser.add_argument('--repeated-aug', action='store_true')
+    parser.add_argument('--no-repeated-aug', action='store_false', dest='repeated_aug')
+    parser.set_defaults(repeated_aug=False)
+    # * Random Erase params
+    parser.add_argument('--reprob', type=float, default=0.0, metavar='PCT',
+                        help='Random erase prob (default: 0.25)')
+    parser.add_argument('--remode', type=str, default='pixel',
+                        help='Random erase mode (default: "pixel")')
+    parser.add_argument('--recount', type=int, default=1,
+                        help='Random erase count (default: 1)')
+    parser.add_argument('--resplit', action='store_true', default=False,
+                        help='Do not random erase first (clean) augmentation split')
+    # * Mixup params
+    parser.add_argument('--cutout',default=True)
+    parser.add_argument('--mixup', type=float, default=0,
+                        help='mixup alpha, mixup enabled if > 0. (default: 0.8)')
+    parser.add_argument('--cutmix', type=float, default=0,
+                        help='cutmix alpha, cutmix enabled if > 0. (default: 1.0)')
+    parser.add_argument('--cutmix-minmax', type=float, nargs='+', default=None,
+                        help='cutmix min/max ratio, overrides alpha and enables cutmix if set (default: None)')
+    parser.add_argument('--mixup-prob', type=float, default=1.0,
+                        help='Probability of performing mixup or cutmix when either/both is enabled')
+    parser.add_argument('--mixup-switch-prob', type=float, default=0.5,
+                        help='Probability of switching to cutmix when both mixup and cutmix enabled')
+    parser.add_argument('--mixup-mode', type=str, default='batch',
+                        help='How to apply mixup/cutmix params. Per "batch", "pair", or "elem"')
+    # Dataset parameters
+    parser.add_argument('--output_dir', default="",
+                        help='path where to save, empty for no saving')
+    parser.add_argument('--device', default='cuda',
+                        help='device to use for training / testing')
+    parser.add_argument('--seed', default=42, type=int)
+    parser.add_argument('--resume', default="", help='resume from checkpoint')
+    parser.add_argument('--no-resume-loss-scaler', action='store_false', dest='resume_loss_scaler')
+    parser.add_argument('--no-amp', action='store_false', dest='amp', help='disable amp')
+    parser.add_argument('--use_checkpoint', default=False, help='use checkpoint to save memory')
+    parser.add_argument('--start_epoch', default=0, type=int, metavar='N',
+                        help='start epoch')
+    parser.add_argument('--eval', action='store_true', help='Perform evaluation only')
+    parser.add_argument('--num_workers', default=8, type=int)
+    parser.add_argument('--pin-mem', action='store_true',
+                        help='Pin CPU memory in DataLoader for more efficient (sometimes) transfer to GPU.')
+    parser.add_argument('--no-pin-mem', action='store_false', dest='pin_mem',
+                        help='')
+    parser.set_defaults(pin_mem=True)
+    # for testing and validation
+    parser.add_argument('--num_crops', default=1, type=int, choices=[1, 3, 5, 10])
+    parser.add_argument('--num_clips', default=1, type=int)
+    # distributed training parameters
+    parser.add_argument('--world_size', default=1, type=int,
+                        help='number of distributed processes')
+    parser.add_argument("--local_rank", type=int)
+    parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
+    parser.add_argument('--auto-resume', default=True, help='auto resume')
+    # exp
+    # parser.add_argument('--simclr_w', type=float, default=0., help='weights for simclr loss')
+    parser.add_argument('--contrastive_nomixup', action='store_true', help='do not involve mixup in contrastive learning')
+    parser.add_argument('--finetune', default=False, help='finetune model')
+    parser.add_argument('--initial_checkpoint', type=str, default='', help='path to the pretrained model')
+    parser.add_argument('--hard_contrastive', action='store_true', help='use HEXA')
+    # parser.add_argument('--selfdis_w', type=float, default=0., help='enable self distillation')
+    return parser
+def main(args):
+    utils.init_distributed_mode(args)
+    print(args)
+    # Patch
+    if not hasattr(args, 'hard_contrastive'):
+        args.hard_contrastive = False
+    if not hasattr(args, 'selfdis_w'):
+        args.selfdis_w = 0.0
+    #is_imnet21k = args.data_set == 'IMNET21K'
+    device = torch.device(args.device)
+    # fix the seed for reproducibility
+    seed = args.seed + utils.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    # random.seed(seed)
+    cudnn.benchmark = True
+    num_classes, train_list_name, val_list_name, test_list_name, filename_seperator, image_tmpl, filter_video, label_file = get_dataset_config(
+        args.dataset, args.use_lmdb)
+    args.num_classes = num_classes
+    if args.modality == 'rgb':
+        args.input_channels = 3
+    elif args.modality == 'flow':
+        args.input_channels = 2 * 5
+    mixup_fn = None
+    mixup_active = args.mixup > 0 or args.cutmix > 0. or args.cutmix_minmax is not None
+    if mixup_active:
+        mixup_fn = Mixup(
+            mixup_alpha=args.mixup, cutmix_alpha=args.cutmix, cutmix_minmax=args.cutmix_minmax,
+            prob=args.mixup_prob, switch_prob=args.mixup_switch_prob, mode=args.mixup_mode,
+            label_smoothing=args.smoothing, num_classes=args.num_classes)
+    print(f"Creating model: {args.model}")
+    model =create_model(
+        args.model,
+        pretrained=args.pretrained,
+        duration=args.duration,
+        hpe_to_token = args.hpe_to_token,
+        rel_pos = args.rel_pos,
+        window_size=args.window_size,
+        thumbnail_rows = args.thumbnail_rows,
+        token_mask=not args.no_token_mask,
+        online_learning = False,
+        num_classes=args.num_classes,
+        drop_rate=args.drop,
+        drop_path_rate=args.drop_path,
+        drop_block_rate=args.drop_block,
+        use_checkpoint=args.use_checkpoint
+    )
+    # TODO: finetuning
+    model.to(device)
+    model_ema = None
+    if args.model_ema:
+        # Important to create EMA model after cuda(), DP wrapper, and AMP but before SyncBN and DDP wrapper
+        model_ema = ModelEma(
+            model,
+            decay=args.model_ema_decay,
+            device='cpu' if args.model_ema_force_cpu else '',
+            resume=args.resume)
+    model_without_ddp = model
+    if args.distributed:
+        #model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
+        model_without_ddp = model.module
+    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print('number of params:', n_parameters)
+    optimizer = create_optimizer(args, model)
+    loss_scaler = NativeScaler()
+    #print(f"Scaled learning rate (batch size: {args.batch_size * utils.get_world_size()}): {linear_scaled_lr}")
+    lr_scheduler, _ = create_scheduler(args, optimizer)
+    criterion = LabelSmoothingCrossEntropy()
+    if args.mixup > 0.:
+        # smoothing is handled with mixup label transform
+        criterion = SoftTargetCrossEntropy()
+    elif args.smoothing:
+        criterion = LabelSmoothingCrossEntropy(smoothing=args.smoothing)
+    else:
+        criterion = torch.nn.CrossEntropyLoss()
+    if args.distributed:
+        mean = (0.5, 0.5, 0.5) if 'mean' not in model.module.default_cfg else model.module.default_cfg['mean']
+        std = (0.5, 0.5, 0.5) if 'std' not in model.module.default_cfg else model.module.default_cfg['std']
+    else:
+        mean = (0.5, 0.5, 0.5) if 'mean' not in model.default_cfg else model.default_cfg['mean']
+        std = (0.5, 0.5, 0.5) if 'std' not in model.default_cfg else model.default_cfg['std']
+# dataset_train, args.nb_classes = build_dataset(is_train=True, args=args)
+# create data loaders w/ augmentation pipeiine
+    video_data_cls = VideoDataSet
+    train_list = os.path.join(args.data_txt_dir, train_list_name)
+    train_augmentor = get_augmentor(True, args.input_size, mean, std, threed_data=False,
+                                    version=args.augmentor_ver, scale_range=args.scale_range, cut_out = args.cutout,dataset=args.dataset)
+    dataset_train = video_data_cls(args.data_dir, train_list, args.duration, args.frames_per_group,
+                                num_clips=args.num_clips,
+                                modality=args.modality, image_tmpl=image_tmpl,
+                                dense_sampling=args.dense_sampling,
+                                transform=train_augmentor, is_train=True, test_mode=False,
+                                seperator=filename_seperator, filter_video=filter_video)
+    num_tasks = utils.get_world_size()
+    data_loader_train = build_dataflow(dataset_train, is_train=True, batch_size=args.batch_size,
+                                    workers=args.num_workers, is_distributed=args.distributed)
+    val_list = os.path.join(args.data_txt_dir, val_list_name)
+    val_augmentor = get_augmentor(False, args.input_size, mean, std, args.disable_scaleup,
+                                threed_data=args.threed_data, version=args.augmentor_ver,
+                                scale_range=args.scale_range, num_clips=args.num_clips, num_crops=args.num_crops,cut_out = False, dataset=args.dataset)
+    dataset_val = video_data_cls(args.data_dir, val_list, args.duration, args.frames_per_group,
+                                num_clips=args.num_clips,
+                                modality=args.modality, image_tmpl=image_tmpl,
+                                dense_sampling=args.dense_sampling,
+                                transform=val_augmentor, is_train=False, test_mode=False,
+                                seperator=filename_seperator, filter_video=filter_video)
+    data_loader_val = build_dataflow(dataset_val, is_train=False, batch_size=args.batch_size,
+                                    workers=args.num_workers, is_distributed=args.distributed)
+    max_accuracy = 0.0
+    output_dir = Path(args.output_dir)
+    if args.initial_checkpoint:
+        checkpoint = torch.load(args.initial_checkpoint, map_location='cpu')
+        utils.load_checkpoint(model, checkpoint['model'])
+    if args.auto_resume:
+        if args.resume == '':
+            args.resume = str(output_dir / "checkpoint.pth")
+            if not os.path.exists(args.resume):
+                args.resume = ''
+    if args.resume:
+        if args.resume.startswith('https'):
+            checkpoint = torch.hub.load_state_dict_from_url(
+                args.resume, map_location='cpu', check_hash=True)
+        else:
+            checkpoint = torch.load(args.resume, map_location='cpu')
+        utils.load_checkpoint(model, checkpoint['model'])
+        if not args.eval and 'optimizer' in checkpoint and 'lr_scheduler' in checkpoint and 'epoch' in checkpoint:
+            optimizer.load_state_dict(checkpoint['optimizer'])
+            lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
+            args.start_epoch = checkpoint['epoch'] + 1
+            if 'scaler' in checkpoint and args.resume_loss_scaler:
+                print("Resume with previous loss scaler state")
+                loss_scaler.load_state_dict(checkpoint['scaler'])
+            if args.model_ema:
+                utils._load_checkpoint_for_ema(model_ema, checkpoint['model_ema'])
+            max_accuracy = checkpoint['max_accuracy']
+    if args.eval:
+        test_stats = evaluate(data_loader_val, model, device, num_tasks, distributed=args.distributed, amp=args.amp, num_crops=args.num_crops, num_clips=args.num_clips)
+        print(f"Accuracy of the network on the {len(dataset_val)} test images: {test_stats['acc1']:.1f}%")
+        return
+    print(f"Start training, currnet max acc is {max_accuracy:.2f}")
+    start_time = time.time()
+    for epoch in range(args.start_epoch, args.epochs):
+        if args.distributed:
+            data_loader_train.sampler.set_epoch(epoch)
+        train_stats = train_one_epoch(
+            model, criterion, data_loader_train,args.num_clips,
+            optimizer, device, epoch, loss_scaler,
+            args.clip_grad, model_ema, mixup_fn, num_tasks, True,
+            amp=args.amp,
+            contrastive_nomixup=args.contrastive_nomixup,
+            hard_contrastive=args.hard_contrastive,
+            finetune=args.finetune
+        )
+        lr_scheduler.step(epoch)
+        test_stats = evaluate(data_loader_val, model, device, num_tasks, distributed=args.distributed, amp=args.amp, num_crops=args.num_crops, num_clips=args.num_clips)
+        print(f"Accuracy of the network on the {len(dataset_val)} test images: {test_stats['acc1']:.1f}%")
+        max_accuracy = max(max_accuracy, test_stats["acc1"])
+        print(f'Max accuracy: {max_accuracy:.2f}%')
+        if args.output_dir:
+            checkpoint_paths = [output_dir / 'checkpoint{}.pth'.format(epoch)]
+            if test_stats["acc1"] == max_accuracy:
+                checkpoint_paths.append(output_dir / 'model_best.pth')
+            for checkpoint_path in checkpoint_paths:
+                state_dict = {
+                    'model': model_without_ddp.state_dict(),
+                    'optimizer': optimizer.state_dict(),
+                    'lr_scheduler': lr_scheduler.state_dict(),
+                    'epoch': epoch,
+                    'args': args,
+                    'scaler': loss_scaler.state_dict(),
+                    'max_accuracy': max_accuracy
+                }
+                if args.model_ema:
+                    state_dict['model_ema'] = get_state_dict(model_ema)
+                utils.save_on_master(state_dict, checkpoint_path)
+        log_stats = {**{f'train_{k}': v for k, v in train_stats.items()},
+                     **{f'test_{k}': v for k, v in test_stats.items()},
+                     'epoch': epoch,
+                     'n_parameters': n_parameters}
+        if args.output_dir and utils.is_main_process():
+            with (output_dir / "log.txt").open("a") as f:
+                f.write(json.dumps(log_stats) + "\n")
+    total_time = time.time() - start_time
+    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+    print('Training time {}'.format(total_time_str))
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('DeiT training and evaluation script', parents=[get_args_parser()])
+    args = parser.parse_args()
+    if args.output_dir:
+        Path(args.output_dir).mkdir(parents=True, exist_ok=True)
+    main(args)

make_tall_txt.py ADDED Viewed

	@@ -0,0 +1,146 @@

+# make_tall_txt.py
+# Usage:
+#   python make_tall_txt.py --root frames_root --out lists --train_ratio 0.8 --seed 42
+#
+# Expected structure:
+#   root/
+#     real/<video_id>/*.jpg|png...
+#     fake/<video_id>/*.jpg|png...
+#
+# Output:
+#   lists/train.txt
+#   lists/test.txt
+#
+# Each line:
+#   relative_path start_frame end_frame label
+import os
+import re
+import argparse
+import random
+from pathlib import Path
+IMG_EXTS = {".jpg", ".jpeg", ".png", ".bmp", ".tif", ".tiff", ".webp"}
+# Try to parse frame index from filenames like:
+#   frame_000012_f000036.jpg  -> uses the "f000036" part
+#   000036.png                -> uses the numeric stem
+RE_FPART = re.compile(r"_f(\d+)", re.IGNORECASE)
+RE_NUM   = re.compile(r"(\d+)$")
+def list_frames(video_dir: Path):
+    files = [p for p in video_dir.iterdir() if p.is_file() and p.suffix.lower() in IMG_EXTS]
+    files.sort()
+    return files
+def parse_frame_idx(p: Path):
+    s = p.stem
+    m = RE_FPART.search(s)
+    if m:
+        return int(m.group(1))
+    m = RE_NUM.search(s)
+    if m:
+        return int(m.group(1))
+    return None  # unknown
+def video_start_end(video_dir: Path):
+    frames = list_frames(video_dir)
+    if not frames:
+        return None, None, 0
+    idxs = [parse_frame_idx(p) for p in frames]
+    idxs = [i for i in idxs if i is not None]
+    # Fallback: if can't parse indices, use 1..N
+    if not idxs:
+        return 1, len(frames), len(frames)
+    return min(idxs), max(idxs), len(frames)
+def collect_videos(root: Path, class_name: str):
+    class_dir = root / class_name
+    if not class_dir.exists():
+        return []
+    videos = []
+    for vd in class_dir.iterdir():
+        if vd.is_dir():
+            start, end, n = video_start_end(vd)
+            if n > 0:
+                videos.append((vd, start, end, n))
+    videos.sort(key=lambda x: x[0].name)
+    return videos
+def write_list(items, out_path: Path, root: Path, label_map):
+    with open(out_path, "w", encoding="utf-8") as f:
+        for vd, start, end, _n, label in items:
+            rel = vd.relative_to(root).as_posix()
+            f.write(f"{rel} {start} {end} {label}\n")
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--root", required=True, help="Root with real/ and fake/ video folders")
+    ap.add_argument("--out", default="lists", help="Output folder for txt files")
+    ap.add_argument("--train_ratio", type=float, default=0.8)
+    ap.add_argument("--seed", type=int, default=42)
+    ap.add_argument("--label_real", type=int, default=0, help="Label for real")
+    ap.add_argument("--label_fake", type=int, default=1, help="Label for fake")
+    args = ap.parse_args()
+    if not (0.0 < args.train_ratio < 1.0):
+        raise SystemExit("--train_ratio must be between 0 and 1.")
+    root = Path(args.root)
+    out = Path(args.out)
+    out.mkdir(parents=True, exist_ok=True)
+    # Collect
+    real_videos = collect_videos(root, "real")
+    fake_videos = collect_videos(root, "fake")
+    if not real_videos:
+        raise SystemExit(f"No videos found under: {root/'real'}")
+    if not fake_videos:
+        raise SystemExit(f"No videos found under: {root/'fake'}")
+    # Build items list (per video)
+    label_map = {"real": args.label_real, "fake": args.label_fake}
+    items = []
+    for vd, start, end, n in real_videos:
+        items.append((vd, start, end, n, label_map["real"]))
+    for vd, start, end, n in fake_videos:
+        items.append((vd, start, end, n, label_map["fake"]))
+    # Split by class (keeps balance more stable)
+    rng = random.Random(args.seed)
+    def split_class(videos, label):
+        vids = [(vd, s, e, n, label) for vd, s, e, n in videos]
+        rng.shuffle(vids)
+        k = int(round(len(vids) * args.train_ratio))
+        return vids[:k], vids[k:]
+    real_train, real_test = split_class(real_videos, label_map["real"])
+    fake_train, fake_test = split_class(fake_videos, label_map["fake"])
+    train_items = real_train + fake_train
+    test_items  = real_test + fake_test
+    rng.shuffle(train_items)
+    rng.shuffle(test_items)
+    # Write
+    train_path = out / "cdf_train_fold.txt"
+    test_path = out / "cdf_test_fold.txt"
+    write_list(train_items, train_path, root, label_map)
+    write_list(test_items, test_path, root, label_map)
+    print("DONE")
+    print(f"Train videos: {len(train_items)} (real {len(real_train)}, fake {len(fake_train)})")
+    print(f"Test  videos: {len(test_items)} (real {len(real_test)},  fake {len(fake_test)})")
+    print("Saved:")
+    print(" ", train_path.resolve())
+    print(" ", test_path.resolve())
+if __name__ == "__main__":
+    main()

make_tall_txt_count.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import argparse, random
+from pathlib import Path
+IMG_EXTS = {".jpg", ".jpeg", ".png", ".bmp", ".webp"}
+def count_frames(video_dir: Path) -> int:
+    return sum(1 for p in video_dir.iterdir() if p.is_file() and p.suffix.lower() in IMG_EXTS)
+def collect(root: Path, cls: str, label: int):
+    base = root / cls
+    out = []
+    for vd in sorted([p for p in base.iterdir() if p.is_dir()]):
+        n = count_frames(vd)
+        if n >= 4:  # precisa >3 frames
+            rel = vd.relative_to(root).as_posix()
+            out.append((rel, 1, n, label))
+    return out
+def split(items, train_ratio, seed):
+    rng = random.Random(seed)
+    rng.shuffle(items)
+    k = int(round(len(items)*train_ratio))
+    return items[:k], items[k:]
+def write_list(path: Path, items):
+    with open(path, "w", encoding="utf-8") as f:
+        for rel, s, e, lab in items:
+            f.write(f"{rel} {s} {e} {lab}\n")
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--root", required=True)
+    ap.add_argument("--out", default="lists")
+    ap.add_argument("--train_ratio", type=float, default=0.8)
+    ap.add_argument("--seed", type=int, default=42)
+    ap.add_argument("--label_real", type=int, default=0)
+    ap.add_argument("--label_fake", type=int, default=1)
+    args = ap.parse_args()
+    root = Path(args.root)
+    out = Path(args.out); out.mkdir(parents=True, exist_ok=True)
+    real = collect(root, "real", args.label_real)
+    fake = collect(root, "fake", args.label_fake)
+    r_tr, r_te = split(real, args.train_ratio, args.seed)
+    f_tr, f_te = split(fake, args.train_ratio, args.seed)
+    train = r_tr + f_tr
+    test  = r_te + f_te
+    rng = random.Random(args.seed)
+    rng.shuffle(train); rng.shuffle(test)
+    write_list(out/"train.txt", train)
+    write_list(out/"test.txt", test)
+    write_list(out/"cdf_train_fold.txt", train)
+    write_list(out/"cdf_test_fold.txt", test)
+    print("OK")
+    print(f"train videos: {len(train)} | test videos: {len(test)}")
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,245 @@

+# Copyright (c) 2015-present, Facebook, Inc.
+# All rights reserved.
+#
+# This source code is licensed under the CC-by-NC license found in the
+# LICENSE file in the root directory of this source tree.
+#
+import torch
+import torch.nn as nn
+from functools import partial
+from timm.models.vision_transformer import VisionTransformer, _cfg
+from timm.models.registry import register_model
+@register_model
+def deit_tiny_patch8_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=8, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_tiny_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_tiny_patch16_d_6_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=192, depth=6, num_heads=3, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_tiny_patch32_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=32, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_small_patch8_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=8, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_small_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_small_patch16_d_6_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=384, depth=6, num_heads=6, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_small_patch32_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=32, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_base_patch8_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=8, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_base_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_base_patch16_ft_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    for m in model.parameters():
+        m.requires_grad = False
+    for m in model.head.parameters():
+        m.requires_grad = True
+    return model
+@register_model
+def deit_base24_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=768, depth=24, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_base16_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=768, depth=16, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_base_patch16_384(pretrained=False, **kwargs):
+    model = VisionTransformer(img_size=384,
+        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def deit_base_patch32_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=32, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            url="https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth",
+            map_location="cpu", check_hash=True
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model

renumber_frames_for_tall.py ADDED Viewed

	@@ -0,0 +1,53 @@

+# renumber_frames_for_tall.py
+# Usage:
+#   python renumber_frames_for_tall.py --root "C:\...\TALL4Deepfake\data" --ext .jpg --digits 4 --copy
+import argparse
+from pathlib import Path
+import shutil
+IMG_EXTS = {".jpg", ".jpeg", ".png", ".bmp", ".webp"}
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--root", required=True, help="Root that contains real/ and fake/")
+    ap.add_argument("--digits", type=int, default=4, help="Digits for numbering (4 -> 0001.jpg)")
+    ap.add_argument("--ext", default=".jpg", help="Target extension for output filenames, e.g. .jpg")
+    ap.add_argument("--copy", action="store_true", help="Copy instead of rename (safer)")
+    args = ap.parse_args()
+    root = Path(args.root)
+    for cls in ["real", "fake"]:
+        cls_dir = root / cls
+        if not cls_dir.exists():
+            print(f"[skip] {cls_dir} not found")
+            continue
+        for vid_dir in sorted([p for p in cls_dir.iterdir() if p.is_dir()]):
+            frames = [p for p in vid_dir.iterdir() if p.is_file() and p.suffix.lower() in IMG_EXTS]
+            frames.sort(key=lambda p: p.name)
+            if not frames:
+                print(f"[empty] {vid_dir}")
+                continue
+            tmp_dir = vid_dir / "_tall_tmp"
+            tmp_dir.mkdir(exist_ok=True)
+            for i, src in enumerate(frames, start=1):
+                dst_name = f"{i:0{args.digits}d}{args.ext}"
+                dst = tmp_dir / dst_name
+                if args.copy:
+                    shutil.copy2(src, dst)
+                else:
+                    shutil.move(src, dst)
+            # move tmp content back
+            for f in tmp_dir.iterdir():
+                shutil.move(str(f), str(vid_dir / f.name))
+            tmp_dir.rmdir()
+            print(f"[ok] {vid_dir.name}: {len(frames)} frames -> renamed to 1..{len(frames)}")
+if __name__ == "__main__":
+    main()

requirements-torch.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+--index-url https://download.pytorch.org/whl/cu117
+torch==1.13.1+cu117
+torchvision==0.14.1+cu117
+opencv-python==4.13.0.92

requirements.txt ADDED Viewed

Binary file (2.37 kB). View file

test.py ADDED Viewed

	@@ -0,0 +1,333 @@

+import argparse
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+import os
+import warnings
+from pathlib import Path
+from timm.models import create_model
+from timm.utils import ModelEma
+#from datasets import build_dataset
+import my_models
+from engine import evaluate
+#import simclr
+import utils
+from video_dataset import VideoDataSet
+from video_dataset_aug import get_augmentor, build_dataflow
+from video_dataset_config import get_dataset_config, DATASET_CONFIG
+warnings.filterwarnings("ignore", category=UserWarning)
+#torch.multiprocessing.set_start_method('spawn', force=True)
+def get_args_parser():
+    parser = argparse.ArgumentParser('DeiT training and evaluation script', add_help=False)
+    parser.add_argument('--model_name',default="TALL_SWIN")
+    parser.add_argument('--batch-size', default=2, type=int)
+    parser.add_argument('--epochs', default=30, type=int)
+    # Dataset parameters
+    parser.add_argument('--data_txt_dir', type=str,default='##path_for_dataset_txt##', help='path to text of dataset')
+    parser.add_argument('--data_dir', type=str,default="##path_for_dataset##", help='path to dataset')
+    parser.add_argument('--dataset', default='ffpp',
+                        choices=list(DATASET_CONFIG.keys()), help='path to dataset file list')
+    parser.add_argument('--duration', default=1, type=int, help='number of frames')
+    parser.add_argument('--frames_per_group', default=1, type=int,
+                        help='[uniform sampling] number of frames per group; '
+                             '[dense sampling]: sampling frequency')
+    parser.add_argument('--threed_data', default=False, help='load data in the layout for 3D conv')
+    parser.add_argument('--input_size', default=224, type=int, metavar='N', help='input image size')
+    parser.add_argument('--disable_scaleup', action='store_true',
+                        help='do not scale up and then crop a small region, directly crop the input_size')
+    parser.add_argument('--random_sampling', action='store_true',
+                        help='perform determinstic sampling for data loader')
+    parser.add_argument('--dense_sampling', default=True,
+                        help='perform dense sampling for data loader')
+    parser.add_argument('--augmentor_ver', default='v1', type=str, choices=['v1', 'v2'],
+                        help='[v1] TSN data argmentation, [v2] resize the shorter side to `scale_range`')
+    parser.add_argument('--scale_range', default=[256, 320], type=int, nargs="+",
+                        metavar='scale_range', help='scale range for augmentor v2')
+    parser.add_argument('--modality', default='rgb', type=str, help='rgb or flow')
+    parser.add_argument('--use_lmdb', default=False, help='use lmdb instead of jpeg.')
+    parser.add_argument('--use_pyav', default=False, help='use video directly.')
+    # temporal module
+    parser.add_argument('--pretrained', action='store_true', default=False,
+                    help='Start with pretrained version of specified network (if avail)')
+    parser.add_argument('--temporal_module_name', default=None, type=str, metavar='TEM', choices=['ResNet3d', 'TAM', 'TTAM', 'TSM', 'TTSM', 'MSA'],
+                        help='temporal module applied. [TAM]')
+    parser.add_argument('--temporal_attention_only', action='store_true', default=False,
+                        help='use attention only in temporal module]')
+    parser.add_argument('--no_token_mask', action='store_true', default=False, help='do not apply token mask')
+    parser.add_argument('--temporal_heads_scale', default=1.0, type=float, help='scale of the number of spatial heads')
+    parser.add_argument('--temporal_mlp_scale', default=1.0, type=float, help='scale of spatial mlp')
+    parser.add_argument('--rel_pos', action='store_true', default=False,
+                        help='use relative positioning in temporal module]')
+    parser.add_argument('--temporal_pooling', type=str, default=None, choices=['avg', 'max', 'conv', 'depthconv'],
+                        help='perform temporal pooling]')
+    parser.add_argument('--bottleneck', default=None, choices=['regular', 'dw'],
+                        help='use depth-wise bottleneck in temporal attention')
+    parser.add_argument('--window_size', default=7, type=int, help='number of frames')
+    parser.add_argument('--thumbnail_rows', default=3, type=int, help='number of frames per row')
+    parser.add_argument('--hpe_to_token', default=False, action='store_true',
+                        help='add hub position embedding to image tokens')
+    # Model parameters
+    parser.add_argument('--model', default='TALL_SWIN', type=str, metavar='MODEL',
+                        help='Name of model to train')
+#    parser.add_argument('--input-size', default=224, type=int, help='images input size')
+    parser.add_argument('--drop', type=float, default=0.0, metavar='PCT',
+                        help='Dropout rate (default: 0.)')
+    parser.add_argument('--drop-path', type=float, default=0.1, metavar='PCT',
+                        help='Drop path rate (default: 0.1)')
+    parser.add_argument('--drop-block', type=float, default=None, metavar='PCT',
+                        help='Drop block rate (default: None)')
+    parser.add_argument('--model-ema', action='store_true')
+    parser.add_argument('--no-model-ema', action='store_false', dest='model_ema')
+    parser.set_defaults(model_ema=True)
+    parser.add_argument('--model-ema-decay', type=float, default=0.99996, help='')
+    parser.add_argument('--model-ema-force-cpu', action='store_true', default=False, help='')
+    # Optimizer parameters
+    parser.add_argument('--opt', default='adamw', type=str, metavar='OPTIMIZER',
+                        help='Optimizer (default: "adamw"')
+    parser.add_argument('--opt-eps', default=1e-8, type=float, metavar='EPSILON',
+                        help='Optimizer Epsilon (default: 1e-8)')
+    parser.add_argument('--opt-betas', default=None, type=float, nargs='+', metavar='BETA',
+                        help='Optimizer Betas (default: None, use opt default)')
+    parser.add_argument('--clip-grad', type=float, default=None, metavar='NORM',
+                        help='Clip gradient norm (default: None, no clipping)')
+    parser.add_argument('--momentum', type=float, default=0.9, metavar='M',
+                        help='SGD momentum (default: 0.9)')
+    parser.add_argument('--weight-decay', type=float, default=1e-5,
+                        help='weight decay (default: 0.05)')
+    # Learning rate schedule parameters
+    parser.add_argument('--sched', default='cosine', type=str, metavar='SCHEDULER',
+                        help='LR scheduler (default: "cosine"')
+    parser.add_argument('--lr', type=float, default=5e-5, metavar='LR',
+                        help='learning rate (default: 5e-4)')
+    parser.add_argument('--lr-noise', type=float, nargs='+', default=None, metavar='pct, pct',
+                        help='learning rate noise on/off epoch percentages')
+    parser.add_argument('--lr-noise-pct', type=float, default=0.67, metavar='PERCENT',
+                        help='learning rate noise limit percent (default: 0.67)')
+    parser.add_argument('--lr-noise-std', type=float, default=1.0, metavar='STDDEV',
+                        help='learning rate noise std-dev (default: 1.0)')
+    parser.add_argument('--warmup-lr', type=float, default=1e-7, metavar='LR',
+                        help='warmup learning rate (default: 1e-6)')
+    parser.add_argument('--min-lr', type=float, default=2e-6, metavar='LR',
+                        help='lower lr bound for cyclic schedulers that hit 0 (1e-5)')
+    parser.add_argument('--decay-epochs', type=float, default=10, metavar='N',
+                        help='epoch interval to decay LR')
+    parser.add_argument('--warmup-epochs', type=int, default=10, metavar='N',
+                        help='epochs to warmup LR, if scheduler supports')
+    parser.add_argument('--cooldown-epochs', type=int, default=10, metavar='N',
+                        help='epochs to cooldown LR at min_lr, after cyclic schedule ends')
+    parser.add_argument('--patience-epochs', type=int, default=10, metavar='N',
+                        help='patience epochs for Plateau LR scheduler (default: 10')
+    parser.add_argument('--decay-rate', '--dr', type=float, default=0.1, metavar='RATE',
+                        help='LR decay rate (default: 0.1)')
+    # Augmentation parameters
+    parser.add_argument('--color-jitter', type=float, default=0.4, metavar='PCT',
+                        help='Color jitter factor (default: 0.4)')
+    parser.add_argument('--aa', type=str, default='rand-m9-mstd0.5-inc1', metavar='NAME',
+                        help='Use AutoAugment policy. "v0" or "original". " + \
+                             "(default: rand-m9-mstd0.5-inc1)'),
+    parser.add_argument('--smoothing', type=float, default=0.1, help='Label smoothing (default: 0.1)')
+    parser.add_argument('--train-interpolation', type=str, default='bicubic',
+                        help='Training interpolation (random, bilinear, bicubic default: "bicubic")')
+    parser.add_argument('--repeated-aug', action='store_true')
+    parser.add_argument('--no-repeated-aug', action='store_false', dest='repeated_aug')
+    parser.set_defaults(repeated_aug=False)
+    # * Random Erase params
+    parser.add_argument('--reprob', type=float, default=0.0, metavar='PCT',
+                        help='Random erase prob (default: 0.25)')
+    parser.add_argument('--remode', type=str, default='pixel',
+                        help='Random erase mode (default: "pixel")')
+    parser.add_argument('--recount', type=int, default=1,
+                        help='Random erase count (default: 1)')
+    parser.add_argument('--resplit', action='store_true', default=False,
+                        help='Do not random erase first (clean) augmentation split')
+    # * Mixup params
+    parser.add_argument('--mixup', type=float, default=0,
+                        help='mixup alpha, mixup enabled if > 0. (default: 0.8)')
+    parser.add_argument('--cutmix', type=float, default=0,
+                        help='cutmix alpha, cutmix enabled if > 0. (default: 1.0)')
+    parser.add_argument('--cutmix-minmax', type=float, nargs='+', default=None,
+                        help='cutmix min/max ratio, overrides alpha and enables cutmix if set (default: None)')
+    parser.add_argument('--mixup-prob', type=float, default=1.0,
+                        help='Probability of performing mixup or cutmix when either/both is enabled')
+    parser.add_argument('--mixup-switch-prob', type=float, default=0.5,
+                        help='Probability of switching to cutmix when both mixup and cutmix enabled')
+    parser.add_argument('--mixup-mode', type=str, default='batch',
+                        help='How to apply mixup/cutmix params. Per "batch", "pair", or "elem"')
+    # Dataset parameters
+    parser.add_argument('--output_dir', default="./output",
+                        help='path where to save, empty for no saving')
+    parser.add_argument('--device', default='cuda',
+                        help='device to use for training / testing')
+    parser.add_argument('--seed', default=42, type=int)
+    parser.add_argument('--resume', default='', help='resume from checkpoint')
+    parser.add_argument('--no-resume-loss-scaler', action='store_false', dest='resume_loss_scaler')
+    parser.add_argument('--no-amp', action='store_false', dest='amp', help='disable amp')
+    parser.add_argument('--use_checkpoint', default=False, help='use checkpoint to save memory')
+    parser.add_argument('--start_epoch', default=0, type=int, metavar='N',
+                        help='start epoch')
+    parser.add_argument('--eval', action='store_true', help='Perform evaluation only')
+    parser.add_argument('--num_workers', default=8, type=int)
+    parser.add_argument('--pin-mem', action='store_true',
+                        help='Pin CPU memory in DataLoader for more efficient (sometimes) transfer to GPU.')
+    parser.add_argument('--no-pin-mem', action='store_false', dest='pin_mem',
+                        help='')
+    parser.set_defaults(pin_mem=True)
+    # for testing and validation
+    parser.add_argument('--num_crops', default=1, type=int, choices=[1, 3, 5, 10])
+    parser.add_argument('--num_clips', default=3, type=int)
+    # distributed training parameters
+    parser.add_argument('--world_size', default=1, type=int,
+                        help='number of distributed processes')
+    parser.add_argument("--local_rank", type=int)
+    parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
+    parser.add_argument('--auto-resume', default=True, help='auto resume')
+    # exp
+    # parser.add_argument('--simclr_w', type=float, default=0., help='weights for simclr loss')
+    parser.add_argument('--contrastive_nomixup', action='store_true', help='do not involve mixup in contrastive learning')
+    parser.add_argument('--finetune', default=False, help='finetune model')
+    parser.add_argument('--initial_checkpoint', type=str, default='', help='path to the pretrained model')
+    parser.add_argument('--hard_contrastive', action='store_true', help='use HEXA')
+    # parser.add_argument('--selfdis_w', type=float, default=0., help='enable self distillation')
+    return parser
+def main(args):
+    utils.init_distributed_mode(args)
+    print(args)
+    # Patch
+    if not hasattr(args, 'hard_contrastive'):
+        args.hard_contrastive = False
+    if not hasattr(args, 'selfdis_w'):
+        args.selfdis_w = 0.0
+    #is_imnet21k = args.data_set == 'IMNET21K'
+    device = torch.device(args.device)
+    # fix the seed for reproducibility
+    seed = args.seed + utils.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    # random.seed(seed)
+    cudnn.benchmark = True
+    num_classes, train_list_name, val_list_name, test_list_name, filename_seperator, image_tmpl, filter_video, label_file = get_dataset_config(
+        args.dataset, args.use_lmdb)
+    args.num_classes = num_classes
+    if args.modality == 'rgb':
+        args.input_channels = 3
+    elif args.modality == 'flow':
+        args.input_channels = 2 * 5
+    print(f"Creating model: {args.model}")
+    model = create_model(
+        args.model,
+        pretrained=args.pretrained,
+        duration=args.duration,
+        hpe_to_token = args.hpe_to_token,
+        rel_pos = args.rel_pos,
+        window_size=args.window_size,
+        thumbnail_rows = args.thumbnail_rows,
+        token_mask=not args.no_token_mask,
+        online_learning = False,
+        num_classes=args.num_classes,
+        drop_rate=args.drop,
+        drop_path_rate=args.drop_path,
+        drop_block_rate=args.drop_block,
+        use_checkpoint=args.use_checkpoint
+    )
+    # TODO: finetuning
+    model.to(device)
+    model_ema = None
+    if args.model_ema:
+        # Important to create EMA model after cuda(), DP wrapper, and AMP but before SyncBN and DDP wrapper
+        model_ema = ModelEma(
+            model,
+            decay=args.model_ema_decay,
+            device='cpu' if args.model_ema_force_cpu else '',
+            resume=args.resume)
+    model_without_ddp = model
+    if args.distributed:
+        #model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
+        model_without_ddp = model.module
+    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print('number of params:', n_parameters)
+    if args.distributed:
+        mean = (0.5, 0.5, 0.5) if 'mean' not in model.module.default_cfg else model.module.default_cfg['mean']
+        std = (0.5, 0.5, 0.5) if 'std' not in model.module.default_cfg else model.module.default_cfg['std']
+    else:
+        mean = (0.5, 0.5, 0.5) if 'mean' not in model.default_cfg else model.default_cfg['mean']
+        std = (0.5, 0.5, 0.5) if 'std' not in model.default_cfg else model.default_cfg['std']
+# dataset_train, args.nb_classes = build_dataset(is_train=True, args=args)
+# create data loaders w/ augmentation pipeiine
+    video_data_cls = VideoDataSet
+    num_tasks = utils.get_world_size()
+    val_list = os.path.join(args.data_txt_dir, val_list_name)
+    val_augmentor = get_augmentor(False, args.input_size, mean, std, args.disable_scaleup,
+                                threed_data=args.threed_data, version=args.augmentor_ver,
+                                scale_range=args.scale_range, num_clips=args.num_clips, num_crops=args.num_crops, dataset=args.dataset)
+    dataset_val = video_data_cls(args.data_dir, val_list, args.duration, args.frames_per_group,
+                                num_clips=args.num_clips,
+                                modality=args.modality,
+                                dense_sampling=args.dense_sampling,
+                                image_tmpl=image_tmpl,
+                                transform=val_augmentor,
+                                is_train=False, test_mode=False,
+                                seperator=filename_seperator, filter_video=filter_video)
+    data_loader_val = build_dataflow(dataset_val, is_train=False, batch_size=args.batch_size,
+                                    workers=args.num_workers, is_distributed=args.distributed)
+    if args.initial_checkpoint:
+                checkpoint = torch.load(args.initial_checkpoint, map_location='cpu')
+                utils.load_checkpoint(model, checkpoint['model'])
+                state = evaluate(data_loader_val, model, device, num_tasks, distributed=args.distributed, amp=args.amp, num_crops=args.num_crops, num_clips=args.num_clips)
+                print(f"Accuracy of the network on the {len(dataset_val)} test images: {state['acc1']:.1f}%")
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('DeiT evaluation script', parents=[get_args_parser()])
+    args = parser.parse_args()
+    if args.output_dir:
+        Path(args.output_dir).mkdir(parents=True, exist_ok=True)
+    main(args)

test_new.py ADDED Viewed

	@@ -0,0 +1,319 @@

+import argparse
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+import os
+import warnings
+import json
+from pathlib import Path
+from timm.models import create_model
+import my_models  # registra TALL_SWIN
+import utils
+from video_dataset import VideoDataSet
+from video_dataset_aug import get_augmentor, build_dataflow
+from video_dataset_config import get_dataset_config, DATASET_CONFIG
+from sklearn.metrics import (
+    accuracy_score, balanced_accuracy_score,
+    precision_recall_fscore_support,
+    confusion_matrix, classification_report,
+    roc_auc_score, roc_curve,
+    average_precision_score, precision_recall_curve
+)
+import matplotlib.pyplot as plt
+warnings.filterwarnings("ignore", category=UserWarning)
+def get_args_parser():
+    parser = argparse.ArgumentParser('DeiT evaluation script', add_help=False)
+    parser.add_argument('--model', default='TALL_SWIN', type=str)
+    parser.add_argument('--model_name', default="TALL_SWIN")
+    parser.add_argument('--batch-size', default=2, type=int)
+    # Dataset parameters
+    parser.add_argument('--data_txt_dir', type=str, default='##path_for_dataset_txt##')
+    parser.add_argument('--data_dir', type=str, default="##path_for_dataset##")
+    parser.add_argument('--dataset', default='ffpp', choices=list(DATASET_CONFIG.keys()))
+    parser.add_argument('--duration', default=1, type=int)
+    parser.add_argument('--frames_per_group', default=1, type=int)
+    parser.add_argument('--threed_data', default=False)
+    parser.add_argument('--input_size', default=224, type=int)
+    parser.add_argument('--disable_scaleup', action='store_true')
+    parser.add_argument('--random_sampling', action='store_true')
+    parser.add_argument('--dense_sampling', default=True)
+    parser.add_argument('--augmentor_ver', default='v1', type=str, choices=['v1', 'v2'])
+    parser.add_argument('--scale_range', default=[256, 320], type=int, nargs="+")
+    parser.add_argument('--modality', default='rgb', type=str)
+    parser.add_argument('--use_lmdb', default=False)
+    parser.add_argument('--use_pyav', default=False)
+    # temporal module / model params
+    parser.add_argument('--pretrained', action='store_true', default=False)
+    parser.add_argument('--temporal_module_name', default=None, type=str,
+                        choices=['ResNet3d', 'TAM', 'TTAM', 'TSM', 'TTSM', 'MSA'])
+    parser.add_argument('--temporal_attention_only', action='store_true', default=False)
+    parser.add_argument('--no_token_mask', action='store_true', default=False)
+    parser.add_argument('--temporal_heads_scale', default=1.0, type=float)
+    parser.add_argument('--temporal_mlp_scale', default=1.0, type=float)
+    parser.add_argument('--rel_pos', action='store_true', default=False)
+    parser.add_argument('--temporal_pooling', type=str, default=None,
+                        choices=['avg', 'max', 'conv', 'depthconv'])
+    parser.add_argument('--bottleneck', default=None, choices=['regular', 'dw'])
+    parser.add_argument('--window_size', default=7, type=int)
+    parser.add_argument('--thumbnail_rows', default=3, type=int)
+    parser.add_argument('--hpe_to_token', default=False, action='store_true')
+    parser.add_argument('--drop', type=float, default=0.0)
+    parser.add_argument('--drop-path', type=float, default=0.1)
+    parser.add_argument('--drop-block', type=float, default=None)
+    # runtime
+    parser.add_argument('--output_dir', default="./output")
+    parser.add_argument('--device', default='cuda')
+    parser.add_argument('--seed', default=42, type=int)
+    parser.add_argument('--num_workers', default=8, type=int)
+    parser.add_argument('--num_crops', default=1, type=int, choices=[1, 3, 5, 10])
+    parser.add_argument('--num_clips', default=3, type=int)
+    parser.add_argument('--world_size', default=1, type=int)
+    parser.add_argument("--local_rank", type=int)
+    parser.add_argument('--dist_url', default='env://')
+    # checkpoint
+    parser.add_argument('--initial_checkpoint', type=str, default='',
+                        help='path do .pth/.pth.tar com checkpoint (espera key "model")')
+    parser.add_argument('--threshold', type=float, default=0.5,
+                        help='threshold para decidir classe 1 (fake) a partir de prob[:,1]')
+    parser.add_argument('--metrics_out', default='', type=str,
+                        help='pasta pra salvar metrics.json e plots (default: output_dir)')
+    parser.add_argument('--save_plots', action='store_true',
+                        help='salvar cm.png / roc.png / pr.png')
+    return parser
+@torch.no_grad()
+def eval_with_outputs(data_loader, model, device, threshold: float = 0.5):
+    model.eval()
+    y_true, y_score, y_pred = [], [], []
+    thr = float(threshold)
+    for samples, targets in data_loader:
+        samples = samples.to(device, non_blocking=True)
+        targets = targets.to(device, non_blocking=True)
+        logits = model(samples)  # [B,2] ou [B*K,2]
+        # se logits veio por-clip, agrega por vídeo
+        B = targets.shape[0]
+        if logits.shape[0] != B:
+            if logits.shape[0] % B != 0:
+                raise RuntimeError(
+                    f"logits batch ({logits.shape[0]}) não é múltiplo do target batch ({B})."
+                )
+            K = logits.shape[0] // B
+            logits = logits.view(B, K, -1).mean(dim=1)  # [B,2]
+        probs = torch.softmax(logits, dim=1)  # [B,2]
+        p1 = probs[:, 1]                      # score da classe 1 (fake)
+        # >>> AQUI é o THRESHOLD <<<
+        hat = (p1 >= thr).long()
+        y_true.append(targets.detach().cpu().numpy())
+        y_score.append(p1.detach().cpu().numpy())
+        y_pred.append(hat.detach().cpu().numpy())
+    y_true = np.concatenate(y_true).astype(int)
+    y_score = np.concatenate(y_score).astype(float)
+    y_pred = np.concatenate(y_pred).astype(int)
+    return y_true, y_score, y_pred
+def plot_confusion(cm, out_path):
+    plt.figure(figsize=(6, 5))
+    plt.imshow(cm)
+    plt.title("Confusion Matrix")
+    plt.xlabel("Predicted")
+    plt.ylabel("True")
+    for (i, j), v in np.ndenumerate(cm):
+        plt.text(j, i, str(v), ha="center", va="center")
+    plt.tight_layout()
+    plt.savefig(out_path, dpi=200)
+    plt.close()
+def plot_roc(y, scores, out_path):
+    fpr, tpr, _ = roc_curve(y, scores)
+    auc = roc_auc_score(y, scores)
+    plt.figure(figsize=(7, 6))
+    plt.plot(fpr, tpr, label=f"AUC={auc:.4f}")
+    plt.plot([0, 1], [0, 1], "--", label="Chance")
+    plt.xlabel("FPR")
+    plt.ylabel("TPR")
+    plt.legend(loc="best")
+    plt.tight_layout()
+    plt.savefig(out_path, dpi=200)
+    plt.close()
+def plot_pr(y, scores, out_path):
+    p, r, _ = precision_recall_curve(y, scores)
+    ap = average_precision_score(y, scores)
+    plt.figure(figsize=(7, 6))
+    plt.plot(r, p, label=f"AP={ap:.4f}")
+    plt.xlabel("Recall")
+    plt.ylabel("Precision")
+    plt.legend(loc="best")
+    plt.tight_layout()
+    plt.savefig(out_path, dpi=200)
+    plt.close()
+def main(args):
+    utils.init_distributed_mode(args)
+    print(args)
+    device = torch.device(args.device)
+    seed = args.seed + utils.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    cudnn.benchmark = True
+    num_classes, train_list_name, val_list_name, test_list_name, filename_seperator, image_tmpl, filter_video, label_file = \
+        get_dataset_config(args.dataset, args.use_lmdb)
+    args.num_classes = num_classes
+    args.input_channels = 3 if args.modality == 'rgb' else 2 * 5
+    print(f"Creating model: {args.model}")
+    model = create_model(
+        args.model,
+        pretrained=args.pretrained,
+        duration=args.duration,
+        hpe_to_token=args.hpe_to_token,
+        rel_pos=args.rel_pos,
+        window_size=args.window_size,
+        thumbnail_rows=args.thumbnail_rows,
+        token_mask=not args.no_token_mask,
+        online_learning=False,
+        num_classes=args.num_classes,
+        drop_rate=args.drop,
+        drop_path_rate=args.drop_path,
+        drop_block_rate=args.drop_block,
+        use_checkpoint=False
+    )
+    model.to(device)
+    # mean/std
+    if args.distributed:
+        mean = (0.5, 0.5, 0.5) if 'mean' not in model.module.default_cfg else model.module.default_cfg['mean']
+        std = (0.5, 0.5, 0.5) if 'std' not in model.module.default_cfg else model.module.default_cfg['std']
+    else:
+        mean = (0.5, 0.5, 0.5) if 'mean' not in model.default_cfg else model.default_cfg['mean']
+        std = (0.5, 0.5, 0.5) if 'std' not in model.default_cfg else model.default_cfg['std']
+    # dataset (val list)
+    video_data_cls = VideoDataSet
+    val_list = os.path.join(args.data_txt_dir, val_list_name)
+    val_augmentor = get_augmentor(
+        False, args.input_size, mean, std, args.disable_scaleup,
+        threed_data=args.threed_data, version=args.augmentor_ver,
+        scale_range=args.scale_range, num_clips=args.num_clips,
+        num_crops=args.num_crops, dataset=args.dataset
+    )
+    dataset_val = video_data_cls(
+        args.data_dir, val_list,
+        args.duration, args.frames_per_group,
+        num_clips=args.num_clips,
+        modality=args.modality,
+        dense_sampling=args.dense_sampling,
+        image_tmpl=image_tmpl,
+        transform=val_augmentor,
+        is_train=False, test_mode=False,
+        seperator=filename_seperator, filter_video=filter_video
+    )
+    data_loader_val = build_dataflow(
+        dataset_val, is_train=False, batch_size=args.batch_size,
+        workers=args.num_workers, is_distributed=args.distributed
+    )
+    if not args.initial_checkpoint:
+        raise RuntimeError("Passe --initial_checkpoint apontando pro checkpoint do modelo.")
+    checkpoint = torch.load(args.initial_checkpoint, map_location='cpu')
+    # muitos checkpoints vêm como {"model": state_dict, ...}
+    if isinstance(checkpoint, dict) and "model" in checkpoint:
+        utils.load_checkpoint(model, checkpoint["model"])
+    else:
+        # se for state_dict direto
+        model.load_state_dict(checkpoint, strict=False)
+    # eval
+    y_true, y_score, y_pred = eval_with_outputs(
+        data_loader_val, model, device, threshold=args.threshold
+    )
+    acc = accuracy_score(y_true, y_pred)
+    bacc = balanced_accuracy_score(y_true, y_pred)
+    prec, rec, f1, _ = precision_recall_fscore_support(
+        y_true, y_pred, average="binary", zero_division=0
+    )
+    cm = confusion_matrix(y_true, y_pred)
+    roc_auc = roc_auc_score(y_true, y_score)
+    pr_auc = average_precision_score(y_true, y_score)
+    print(f"\nN={len(y_true)} | thr={args.threshold:.3f}")
+    print(f"acc={acc:.4f} | bacc={bacc:.4f} | prec={prec:.4f} | rec={rec:.4f} | f1={f1:.4f} | roc_auc={roc_auc:.4f} | pr_auc={pr_auc:.4f}")
+    print(classification_report(y_true, y_pred, digits=4, zero_division=0))
+    outdir = args.metrics_out.strip() if args.metrics_out else args.output_dir
+    os.makedirs(outdir, exist_ok=True)
+    out_json = {
+        "threshold": float(args.threshold),
+        "acc": float(acc),
+        "balanced_acc": float(bacc),
+        "precision": float(prec),
+        "recall": float(rec),
+        "f1": float(f1),
+        "roc_auc": float(roc_auc),
+        "pr_auc": float(pr_auc),
+        "confusion_matrix": cm.tolist(),
+        "n": int(len(y_true)),
+    }
+    with open(os.path.join(outdir, "metrics.json"), "w", encoding="utf-8") as f:
+        json.dump(out_json, f, indent=2)
+    np.savez(os.path.join(outdir, "eval_outputs.npz"),
+             y_true=y_true, y_score=y_score, y_pred=y_pred)
+    if args.save_plots:
+        plot_confusion(cm, os.path.join(outdir, "cm.png"))
+        plot_roc(y_true, y_score, os.path.join(outdir, "roc.png"))
+        plot_pr(y_true, y_score, os.path.join(outdir, "pr.png"))
+        print(f"\n✔ Plots + metrics saved in: {os.path.abspath(outdir)}")
+    else:
+        print(f"\n✔ Metrics saved in: {os.path.abspath(os.path.join(outdir, 'metrics.json'))}")
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('DeiT evaluation script', parents=[get_args_parser()])
+    args = parser.parse_args()
+    if args.output_dir:
+        Path(args.output_dir).mkdir(parents=True, exist_ok=True)
+    main(args)

utils.py ADDED Viewed

	@@ -0,0 +1,265 @@

+# Copyright (c) 2015-present, Facebook, Inc.
+# All rights reserved.
+#
+# This source code is licensed under the CC-by-NC license found in the
+# LICENSE file in the root directory of this source tree.
+#
+"""
+Misc functions, including distributed helpers.
+Mostly copy-paste from torchvision references.
+"""
+import io
+import os
+import time
+from collections import defaultdict, deque
+import datetime
+import tempfile
+import torch
+import torch.distributed as dist
+from fvcore.common.checkpoint import Checkpointer
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        if not is_dist_avail_and_initialized():
+            return
+        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
+        dist.barrier()
+        dist.all_reduce(t)
+        t = t.tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        return d.median().item()
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+    @property
+    def global_avg(self):
+        return self.total / self.count
+    @property
+    def max(self):
+        return max(self.deque)
+    @property
+    def value(self):
+        return self.deque[-1]
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value)
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(
+            type(self).__name__, attr))
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append(
+                "{}: {}".format(name, str(meter))
+            )
+        return self.delimiter.join(loss_str)
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+    def log_every(self, iterable, print_freq, header=None):
+        i = 0
+        if not header:
+            header = ''
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt='{avg:.4f}')
+        data_time = SmoothedValue(fmt='{avg:.4f}')
+        space_fmt = ':' + str(len(str(len(iterable)))) + 'd'
+        log_msg = [
+            header,
+            '[{0' + space_fmt + '}/{1}]',
+            'eta: {eta}',
+            '{meters}',
+            'time: {time}',
+            'data: {data}'
+        ]
+        if torch.cuda.is_available():
+            log_msg.append('max mem: {memory:.0f}')
+        log_msg = self.delimiter.join(log_msg)
+        MB = 1024.0 * 1024.0
+        for obj in iterable:
+            data_time.update(time.time() - end)
+            yield obj
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0 or i == len(iterable) - 1:
+                eta_seconds = iter_time.global_avg * (len(iterable) - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                if torch.cuda.is_available():
+                    print(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time),
+                        memory=torch.cuda.max_memory_allocated() / MB))
+                else:
+                    print(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time)))
+            i += 1
+            end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print('{} Total time: {} ({:.4f} s / it)'.format(
+            header, total_time_str, total_time / len(iterable)))
+def _load_checkpoint_for_ema(model_ema, checkpoint):
+    """
+    Workaround for ModelEma._load_checkpoint to accept an already-loaded object
+    """
+    mem_file = io.BytesIO()
+    torch.save(checkpoint, mem_file)
+    mem_file.seek(0)
+    model_ema._load_checkpoint(mem_file)
+"""
+def load_checkpoint(model, state_dict, mode=None):
+    # reuse Checkpointer in fvcore to support flexible loading
+    ckpt = Checkpointer(model, save_to_disk=False)
+    # since Checkpointer requires the weight to be put under `model` field, we need to save it to disk
+    tmp_path = tempfile.NamedTemporaryFile('w+b')
+    torch.save({'model': state_dict}, tmp_path.name)
+    ckpt.load(tmp_path.name)
+    """
+def load_checkpoint(model, state_dict):
+    # Load checkpoint directly (avoid writing temp files on Windows)
+    if isinstance(state_dict, dict) and 'state_dict' in state_dict:
+        state_dict = state_dict['state_dict']
+    missing, unexpected = model.load_state_dict(state_dict, strict=False)
+    if len(missing) > 0:
+        print(f"[load_checkpoint] Missing keys: {len(missing)}")
+    if len(unexpected) > 0:
+        print(f"[load_checkpoint] Unexpected keys: {len(unexpected)}")
+def setup_for_distributed(is_master):
+    """
+    This function disables printing when not in master process
+    """
+    import builtins as __builtin__
+    builtin_print = __builtin__.print
+    def print(*args, **kwargs):
+        force = kwargs.pop('force', False)
+        if is_master or force:
+            builtin_print(*args, **kwargs)
+    __builtin__.print = print
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
+def get_rank():
+    if not is_dist_avail_and_initialized():
+        return 0
+    return dist.get_rank()
+def is_main_process():
+    return get_rank() == 0
+def save_on_master(*args, **kwargs):
+    if is_main_process():
+        torch.save(*args, **kwargs)
+def init_distributed_mode(args):
+    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
+        args.rank = int(os.environ["RANK"])
+        args.world_size = int(os.environ['WORLD_SIZE'])
+        args.gpu = int(os.environ['LOCAL_RANK'])
+    elif 'SLURM_PROCID' in os.environ:
+        args.rank = int(os.environ['SLURM_PROCID'])
+        args.gpu = args.rank % torch.cuda.device_count()
+    else:
+        print('Not using distributed mode')
+        args.distributed = False
+        return
+    args.distributed = True
+    torch.cuda.set_device(args.gpu)
+    args.dist_backend = 'nccl'
+    print('| distributed init (rank {}): {}'.format(
+        args.rank, args.dist_url), flush=True)
+    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
+                                         world_size=args.world_size, rank=args.rank)
+    torch.distributed.barrier()
+    setup_for_distributed(args.rank == 0)

video_dataset.py ADDED Viewed

	@@ -0,0 +1,1228 @@

+import os
+import six
+from typing import Union
+import random
+import numpy as np
+import torch
+from PIL import Image
+import torch.utils.data as data
+try:
+    import lmdb
+    import pyarrow as pa
+    _HAS_LMDB = True
+except ImportError as e:
+    _HAS_LMDB = False
+    _LMDB_ERROR_MSG = e
+try:
+    import av
+    _HAS_PYAV = True
+except ImportError as e:
+    _HAS_PYAV = False
+    _PYAV_ERROR_MSG = e
+def random_clip(video_frames, sampling_rate, frames_per_clip, fixed_offset=False, start_frame_idx=0, end_frame_idx=None):
+    """
+    Args:
+        video_frames (int): total frame number of a video
+        sampling_rate (int): sampling rate for clip, pick one every k frames
+        frames_per_clip (int): number of frames of a clip
+        fixed_offset (bool): used with sample offset to decide the offset value deterministically.
+    Returns:
+        list[int]: frame indices (started from zero)
+    """
+    new_sampling_rate = sampling_rate
+    highest_idx = video_frames - new_sampling_rate * frames_per_clip if end_frame_idx is None else end_frame_idx
+    if highest_idx <= 0:
+        random_offset = 0
+    else:
+        if fixed_offset:
+            random_offset = (video_frames - new_sampling_rate * frames_per_clip) // 2
+        else:
+            random_offset = int(np.random.randint(start_frame_idx, highest_idx, 1))
+    # print(start_frame_idx, highest_idx, random_offset)
+    frame_idx = [int(random_offset + i * sampling_rate) % video_frames for i in range(frames_per_clip)]
+    return frame_idx
+def compute_img_diff(image_1, image_2, bound=255.0):
+    image_diff = np.asarray(image_1, dtype=np.float) - np.asarray(image_2, dtype=np.float)
+    image_diff += bound
+    image_diff *= (255.0 / float(2 * bound))
+    image_diff = image_diff.astype(np.uint8)
+    image_diff = Image.fromarray(image_diff)
+    return image_diff
+def load_image(root_path, directory, image_tmpl, idx, modality):
+    """
+    :param root_path:
+    :param directory:
+    :param image_tmpl:
+    :param idx: if it is a list, load a batch of images
+    :param modality:
+    :return:
+    """
+    def _safe_load_image(img_path):
+        img = None
+        num_try = 0
+        while num_try < 10:
+            try:
+                img_tmp = Image.open(img_path)
+                img = img_tmp.copy()
+                img_tmp.close()
+                break
+            except Exception as e:
+                print('[Will try load again] error loading image: {}, '
+                      'error: {}'.format(img_path, str(e)))
+                num_try += 1
+        if img is None:
+            raise ValueError('[Fail 10 times] error loading image: {}'.format(img_path))
+        return img
+    if not isinstance(idx, list):
+        idx = [idx]
+    out = []
+    if modality == 'rgb':
+        for i in idx:
+            image_path_file = os.path.join(root_path, directory, image_tmpl.format(i))
+            out.append(_safe_load_image(image_path_file))
+    elif modality == 'rgbdiff':
+        tmp = {}
+        new_idx = np.unique(np.concatenate((np.asarray(idx), np.asarray(idx) + 1)))
+        for i in new_idx:
+            image_path_file = os.path.join(root_path, directory, image_tmpl.format(i))
+            tmp[i] = _safe_load_image(image_path_file)
+        for k in idx:
+            img_ = compute_img_diff(tmp[k + 1], tmp[k])
+            out.append(img_)
+        del tmp
+    elif modality == 'flow':
+        for i in idx:
+            flow_x_name = os.path.join(root_path, directory, "x_" + image_tmpl.format(i))
+            flow_y_name = os.path.join(root_path, directory, "y_" + image_tmpl.format(i))
+            out.extend([_safe_load_image(flow_x_name), _safe_load_image(flow_y_name)])
+    return out
+def load_sound(data_dir, record, idx, fps, audio_length, resampling_rate,
+               window_size=10, step_size=5, eps=1e-6):
+    import librosa
+    """idx must be the center frame of a clip"""
+    centre_sec = (record.start_frame + idx) / fps
+    left_sec = centre_sec - (audio_length / 2.0)
+    right_sec = centre_sec + (audio_length / 2.0)
+    audio_fname = os.path.join(data_dir, record.path)
+    # TODO: generate 0s if the audio file does not exist.
+    if not os.path.exists(audio_fname):
+        return [Image.fromarray(np.zeros((256, 256 * int(audio_length / 1.28))))]
+    samples, sr = librosa.core.load(audio_fname, sr=None, mono=True)
+    duration = samples.shape[0] / float(resampling_rate)
+    left_sample = int(round(left_sec * resampling_rate))
+    right_sample = int(round(right_sec * resampling_rate))
+    required_samples = int(round(resampling_rate * audio_length))
+    if left_sec < 0:
+        samples = samples[:required_samples]
+    elif right_sec > duration:
+        samples = samples[-required_samples:]
+    else:
+        samples = samples[left_sample:right_sample]
+    # TODO: is the size of spec is fixed if number of samples are different?
+    # if the samples is not long enough, repeat the waveform
+    if len(samples) < required_samples:
+        multiplies = required_samples / len(samples)
+        samples = np.tile(samples, int(multiplies + 0.5) + 1)
+        samples = samples[:required_samples]
+    # log sepcgram
+    nperseg = int(round(window_size * resampling_rate / 1e3))
+    noverlap = int(round(step_size * resampling_rate / 1e3))
+    spec = librosa.stft(samples, n_fft=511, window='hann', hop_length=noverlap,
+                        win_length=nperseg, pad_mode='constant')
+    spec = np.log(np.real(spec * np.conj(spec)) + eps)
+    img = Image.fromarray(spec)
+    return [img]
+def load_data_lmdb(videos, idx, modality):
+    def _convert_buffer_to_PIL(tmp_buf, is_flow=False):
+        data = six.BytesIO()
+        data.write(tmp_buf)
+        data.seek(0)
+        img_tmp = Image.open(data).convert('RGB' if not is_flow else 'L')
+        img_ = img_tmp.copy()
+        img_tmp.close()
+        return img_
+    img = []
+    if modality == 'rgb':
+        buf = [videos[i] for i in idx]
+        for x in buf:
+            img_ = _convert_buffer_to_PIL(x)
+            img.append(img_)
+    elif modality == 'flow':
+        new_idx = np.asarray(idx) * 2 - 1
+        buf = [[videos[i], videos[i + 1]] for i in new_idx]
+        for x in buf:
+            flow_x = _convert_buffer_to_PIL(x[0], True)
+            flow_y = _convert_buffer_to_PIL(x[1], True)
+            img.extend([flow_x, flow_y])
+    elif modality == 'rgbdiff':
+        tmp = {}
+        new_idx = np.unique(np.concatenate((np.asarray(idx), np.asarray(idx) + 1)))
+        for i in new_idx:
+            tmp[i] = _convert_buffer_to_PIL(videos[i])
+        for k in idx:
+            img_ = compute_img_diff(tmp[k + 1], tmp[k])
+            img.append(img_)
+        del tmp
+    return img
+def sample_train_clip(video_length, num_consecutive_frames, num_frames, sample_freq, dense_sampling, num_clips=1):
+    max_frame_idx = max(1, video_length - num_consecutive_frames + 1)
+    if dense_sampling:
+        frame_idx = np.zeros((num_clips, num_frames), dtype=int)
+        if num_clips == 1:  # backward compatibility
+            frame_idx[0] = np.asarray(random_clip(max_frame_idx, sample_freq, num_frames, False))
+        else:
+            max_start_frame_idx = max_frame_idx - sample_freq * num_frames
+            frames_per_segment = max_start_frame_idx // num_clips
+            for i in range(num_clips):
+                if frames_per_segment <= 0:
+                    frame_idx[i] = np.asarray(random_clip(max_frame_idx, sample_freq, num_frames, False))
+                    #frame_idx[i] = [frame_idx[i][2],frame_idx[i][0],frame_idx[i][1],frame_idx[i][3]]
+                    #frame_idx[i] = [frame_idx[i][3],frame_idx[i][2],frame_idx[i][1],frame_idx[i][0]]
+                else:
+                    frame_idx[i] = np.asarray(random_clip(max_frame_idx, sample_freq, num_frames, False, i * frames_per_segment, (i + 1) * frames_per_segment))
+                    #1423
+                    #frame_idx[i] = [frame_idx[i][2],frame_idx[i][0],frame_idx[i][1],frame_idx[i][3]]
+                    #frame_idx[i] = [frame_idx[i][3],frame_idx[i][2],frame_idx[i][1],frame_idx[i][0]]
+        frame_idx = frame_idx.flatten()
+        """
+        def _check_interval_overlapped(int_1, int_2):
+            if int_1[0] < int_2[0]:
+                int_l, int_r = int_1, int_2
+            else:
+                int_l, int_r = int_2, int_1
+            return True if int_l[-1] > int_r[0] else False
+        clips = 0
+        num_tries = 0
+        #all_frame_idx = np.arange(max_frame_idx - sample_freq * num_frames)
+        while clips < num_clips and num_tries < 1000:
+            curr_clips = np.asarray(random_clip(max_frame_idx, sample_freq, num_frames))
+            overlap = False
+            for i in range(clips):
+                overlap = _check_interval_overlapped((frame_idx[i][0], frame_idx[i][-1]), (curr_clips[0], curr_clips[-1]) )
+                if overlap:
+                    break
+            if overlap:
+                num_tries += 1
+                continue
+            else:
+                frame_idx[clips] = curr_clips
+                clips += 1
+        for i in range(clips, num_clips):
+            frame_idx[i] = np.asarray(random_clip(max_frame_idx, sample_freq, num_frames))
+        # sort the intervals
+        frame_idx = frame_idx[np.argsort(frame_idx[:, 0]), ...]
+        frame_idx = frame_idx.flatten()
+        """
+    else:  # uniform sampling
+        # import pdb;pdb.set_trace()
+        total_frames = num_frames * sample_freq
+        ave_frames_per_group = max_frame_idx // num_frames
+        if ave_frames_per_group >= sample_freq:
+            # randomly sample f images per segement
+            frame_idx = np.arange(0, num_frames) * ave_frames_per_group
+            frame_idx = np.repeat(frame_idx, repeats=sample_freq)
+            offsets = np.random.choice(ave_frames_per_group, sample_freq, replace=False)
+            offsets = np.tile(offsets, num_frames)
+            frame_idx = frame_idx + offsets
+        elif max_frame_idx < total_frames:
+            # need to sample the same images
+            frame_idx = np.random.choice(max_frame_idx, total_frames)
+        else:
+            # sample cross all images
+            frame_idx = np.random.choice(max_frame_idx, total_frames, replace=False)
+        frame_idx = np.sort(frame_idx)
+    # print(frame_idx)
+    frame_idx = frame_idx + 1
+    # random.shuffle(frame_idx)
+    return frame_idx
+def sample_val_test_clip(video_length, num_consecutive_frames, num_frames, sample_freq, dense_sampling,
+                         fixed_offset, num_clips, whole_video):
+    max_frame_idx = max(1, video_length - num_consecutive_frames + 1)
+    # import pdb;pdb.set_trace()
+    if whole_video:
+        return np.arange(1, max_frame_idx, step=sample_freq, dtype=int)
+    if dense_sampling:
+        if fixed_offset:
+            sample_pos = max(1, 1 + max_frame_idx - sample_freq * num_frames)
+            t_stride = sample_freq
+            start_list = np.linspace(0, sample_pos - 1, num=num_clips, dtype=int)
+            frame_idx = []
+            for start_idx in start_list.tolist():
+                frame_idx += [(idx * t_stride + start_idx) % max_frame_idx for idx in
+                              range(num_frames)]
+        else:
+            frame_idx = []
+            for i in range(num_clips):
+                frame_idx.extend(random_clip(max_frame_idx, sample_freq, num_frames))
+        frame_idx = np.asarray(frame_idx) + 1
+    else:  # uniform sampling
+        if fixed_offset:
+            frame_idices = []
+            sample_offsets = list(range(-num_clips // 2 + 1, num_clips // 2 + 1))
+            for sample_offset in sample_offsets:
+                if max_frame_idx > num_frames:
+                    tick = max_frame_idx / float(num_frames)
+                    curr_sample_offset = sample_offset
+                    if curr_sample_offset >= tick / 2.0:
+                        curr_sample_offset = tick / 2.0 - 1e-4
+                    elif curr_sample_offset < -tick / 2.0:
+                        curr_sample_offset = -tick / 2.0
+                    frame_idx = np.array([int(tick / 2.0 + curr_sample_offset + tick * x) for x in
+                                          range(num_frames)])
+                else:
+                    np.random.seed(sample_offset - (-num_clips // 2 + 1))
+                    frame_idx = np.random.choice(max_frame_idx, num_frames)
+                frame_idx = np.sort(frame_idx)
+                frame_idices.extend(frame_idx.tolist())
+        else:
+            frame_idices = []
+            for i in range(num_clips):
+                total_frames = num_frames * sample_freq
+                ave_frames_per_group = max_frame_idx // num_frames
+                if ave_frames_per_group >= sample_freq:
+                    # randomly sample f images per segment
+                    frame_idx = np.arange(0, num_frames) * ave_frames_per_group
+                    frame_idx = np.repeat(frame_idx, repeats=sample_freq)
+                    offsets = np.random.choice(ave_frames_per_group, sample_freq,
+                                               replace=False)
+                    offsets = np.tile(offsets, num_frames)
+                    frame_idx = frame_idx + offsets
+                elif max_frame_idx < total_frames:
+                    # need to sample the same images
+                    np.random.seed(i)
+                    frame_idx = np.random.choice(max_frame_idx, total_frames)
+                else:
+                    # sample cross all images
+                    np.random.seed(i)
+                    frame_idx = np.random.choice(max_frame_idx, total_frames, replace=False)
+                frame_idx = np.sort(frame_idx)
+                frame_idices.extend(frame_idx.tolist())
+        frame_idx = np.asarray(frame_idices) + 1
+    return frame_idx
+class VideoRecord(object):
+    def __init__(self, path, start_frame, end_frame, label, reverse=False):
+        self.path = path
+        self.video_id = os.path.basename(path)
+        self.start_frame = start_frame
+        self.end_frame = end_frame
+        self.label = label
+        self.reverse = reverse
+    @property
+    def num_frames(self):
+        return self.end_frame - self.start_frame + 1
+    def __str__(self):
+        return self.path
+class VideoDataSet(data.Dataset):
+    def __init__(self, root_path, list_file, num_groups=64, frames_per_group=1, sample_offset=0, num_clips=1,
+                 modality='rgb', dense_sampling=True, fixed_offset=True,
+                 image_tmpl='{:05d}.jpg', transform=None, is_train=True, test_mode=False, seperator=' ',
+                 filter_video=0, num_classes=None, whole_video=False,
+                 fps=29.97, audio_length=1.28, resampling_rate=24000):
+        """
+        Arguments have different meaning when dense_sampling is True:
+            - num_groups ==> number of frames
+            - frames_per_group ==> sample every K frame
+            - sample_offset ==> number of clips used in validation or test mode
+        Args:
+            root_path (str): the file path to the root of video folder
+            list_file (str): the file list, each line with folder_path, start_frame, end_frame, label_id
+            num_groups (int): number of frames per data sample
+            frames_per_group (int): number of frames within one group
+            sample_offset (int): used in validation/test, the offset when sampling frames from a group
+            modality (str): rgb or flow
+            dense_sampling (bool): dense sampling in I3D
+            fixed_offset (bool): used for generating the same videos used in TSM
+            image_tmpl (str): template of image ids
+            transform: the transformer for preprocessing
+            is_train (bool): shuffle the video but keep the causality
+            test_mode (bool): testing mode, no label
+            whole_video (bool): take whole video
+            fps (float): frame rate per second, used to localize sound when frame idx is selected.
+            audio_length (float): the time window to extract audio feature.
+            resampling_rate (int): used to resampling audio extracted from wav
+        """
+        if modality not in ['flow', 'rgb', 'rgbdiff', 'sound']:
+            raise ValueError("modality should be 'flow' or 'rgb' or 'rgbdiff' or 'sound'.")
+        self.root_path = root_path
+        self.list_file = os.path.join(root_path, list_file)
+        self.num_groups = num_groups
+        self.num_frames = num_groups
+        self.frames_per_group = frames_per_group
+        self.sample_freq = frames_per_group
+        self.num_clips = num_clips
+        self.sample_offset = sample_offset
+        self.fixed_offset = fixed_offset
+        self.dense_sampling = dense_sampling
+        self.modality = modality.lower()
+        self.image_tmpl = image_tmpl
+        self.transform = transform
+        self.is_train = is_train
+        self.test_mode = test_mode
+        self.separator = seperator
+        self.filter_video = filter_video
+        self.whole_video = whole_video
+        self.fps = fps
+        self.audio_length = audio_length
+        self.resampling_rate = resampling_rate
+        self.video_length = (self.num_frames * self.sample_freq) / self.fps
+        if self.modality in ['flow', 'rgbdiff']:
+            self.num_consecutive_frames = 5
+        else:
+            self.num_consecutive_frames = 1
+        self.video_list, self.multi_label = self._parse_list()
+        self.num_classes = num_classes
+    def _parse_list(self):
+        # usually it is [video_id, num_frames, class_idx]
+        # or [video_id, start_frame, end_frame, list of class_idx]
+        tmp = []
+        original_video_numbers = 0
+        for x in open(self.list_file):
+            elements = x.strip().split(self.separator)
+            start_frame = int(elements[1])
+            end_frame = int(elements[2])
+            total_frame = end_frame - start_frame + 1
+            original_video_numbers += 1
+            if self.test_mode:
+                tmp.append(elements)
+            else:
+                if total_frame >= self.filter_video:
+                    tmp.append(elements)
+        num = len(tmp)
+        print("The number of videos is {} (with more than {} frames) "
+              "(original: {})".format(num, self.filter_video, original_video_numbers), flush=True)
+        assert (num > 0)
+        # TODO: a better way to check if multi-label or not
+        multi_label = np.mean(np.asarray([len(x) for x in tmp])) > 4.0
+        file_list = []
+        for item in tmp:
+            if self.test_mode:
+                file_list.append([item[0], int(item[1]), int(item[2]), -1])
+            else:
+                labels = []
+                for i in range(3, len(item)):
+                    labels.append(float(item[i]))
+                if not multi_label:
+                    labels = labels[0] if len(labels) == 1 else labels
+                file_list.append([item[0], int(item[1]), int(item[2]), labels])
+        video_list = [VideoRecord(item[0], item[1], item[2], item[3]) for item in file_list]
+        # flow model has one frame less
+        if self.modality in ['rgbdiff']:
+            for i in range(len(video_list)):
+                video_list[i].end_frame -= 1
+        #if self.is_train:
+        #    video_list = video_list[:50000]
+        return video_list, multi_label
+    def remove_data(self, idx):
+        original_video_num = len(self.video_list)
+        self.video_list = [v for i, v in enumerate(self.video_list) if i not in idx]
+        print("Original videos: {}\t remove {} videos, remaining {} videos".format(original_video_num, len(idx), len(self.video_list)))
+    def _sample_indices(self, record):
+        return sample_train_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                 self.sample_freq, self.dense_sampling, self.num_clips)
+    def _get_val_indices(self, record):
+        return sample_val_test_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                    self.sample_freq, self.dense_sampling, self.fixed_offset,
+                                    self.num_clips, self.whole_video)
+    def __getitem__(self, index):
+        """
+        Returns:
+            torch.FloatTensor: (3xgxf)xHxW dimension, g is number of groups and f is the frames per group.
+            torch.FloatTensor: the label
+        """
+        record = self.video_list[index]
+        # check this is a legit video folder
+        indices = self._sample_indices(record) if self.is_train else self._get_val_indices(record)
+        images = self.get_data(record, indices)
+        images = self.transform(images)
+        label = self.get_label(record)
+        # re-order data to targeted format.
+        return images, label
+    def get_data(self, record, indices):
+        images = []
+        if self.whole_video:
+            tmp = len(indices) % self.num_frames
+            if tmp != 0:
+                indices = indices[:-tmp]
+            num_clips = len(indices) // self.num_frames
+            # print(tmp, indices, self.num_frames, num_clips)
+        else:
+            num_clips = self.num_clips
+        if self.modality == 'sound':
+            new_indices = [indices[i * self.num_frames: (i + 1) * self.num_frames]
+                           for i in range(num_clips)]
+            for curr_indiecs in new_indices:
+                center_idx = (curr_indiecs[self.num_frames // 2 - 1] + curr_indiecs[self.num_frames // 2]) // 2 \
+                    if self.num_frames % 2 == 0 else curr_indiecs[self.num_frames // 2]
+                center_idx = min(record.num_frames, center_idx)
+                seg_imgs = load_sound(self.root_path, record, center_idx,
+                                      self.fps, self.audio_length, self.resampling_rate)
+                images.extend(seg_imgs)
+        else:
+            images = []
+            for seg_ind in indices:
+                new_seg_ind = [min(seg_ind + record.start_frame - 1 + i, record.num_frames)
+                               for i in range(self.num_consecutive_frames)]
+                seg_imgs = load_image(self.root_path, record.path, self.image_tmpl,
+                                      new_seg_ind, self.modality)
+                images.extend(seg_imgs)
+        return images
+    def get_label(self, record):
+        if self.test_mode:
+            # in test mode, return the video id as label
+            label = record.video_id
+        else:
+            if not self.multi_label:
+                label = int(record.label)
+            else:
+                # create a binary vector.
+                label = torch.zeros(self.num_classes, dtype=torch.float)
+                for x in record.label:
+                    label[int(x)] = 1.0
+        return label
+    def __len__(self):
+        return len(self.video_list)
+class VideoDataSetLMDB(data.Dataset):
+    # do not support sound
+    def __init__(self, datadir, db_name, num_groups=16, frames_per_group=1, sample_offset=0, num_clips=1,
+                 modality='rgb', dense_sampling=False, fixed_offset=True,
+                 image_tmpl='{:05d}.jpg', transform=None, is_train=True, test_mode=False,
+                 seperator=' ', filter_video=0, num_classes=None, whole_video=False,
+                 fps=29.97, audio_length=1.28, resampling_rate=24000):
+        """
+        Arguments have different meaning when dense_sampling is True:
+            - num_groups ==> number of frames
+            - frames_per_group ==> sample every K frame
+            - sample_offset ==> number of clips used in validation or test mode
+        Args:
+            db_path (str): the file path to the root of video folder
+            num_groups (int): number of frames per data sample
+            frames_per_group (int): number of frames within one group
+            sample_offset (int): used in validation/test, the offset when sampling frames from a group
+            modality (str): rgb or flow
+            dense_sampling (bool): dense sampling in I3D
+            fixed_offset (bool): used for generating the same videos used in TSM
+            image_tmpl (str): template of image ids
+            transform: the transformer for preprocessing
+            is_train (bool): shuffle the video but keep the causality
+            test_mode (bool): testing mode, no label
+        """
+        # TODO: handle multi-label?
+        # TODO: flow data?
+        if not _HAS_LMDB:
+            raise ValueError(_LMDB_ERROR_MSG)
+        if modality not in ['flow', 'rgb', 'rgbdiff']:
+            raise ValueError("modality should be 'flow' or 'rgb'.")
+        self.db_path = os.path.join(datadir, db_name)
+        self.num_groups = num_groups
+        self.num_frames = num_groups
+        self.frames_per_group = frames_per_group
+        self.sample_freq = frames_per_group
+        self.num_clips = num_clips
+        self.sample_offset = sample_offset
+        self.fixed_offset = fixed_offset
+        self.dense_sampling = dense_sampling
+        self.modality = modality.lower()
+        self.image_tmpl = image_tmpl
+        self.transform = transform
+        self.is_train = is_train
+        self.test_mode = test_mode
+        self.seperator = seperator
+        self.filter_video = filter_video
+        self.whole_video = whole_video
+        self.fps = fps
+        self.audio_length = audio_length
+        self.resampling_rate = resampling_rate
+        self.video_length = (self.num_frames * self.sample_freq) / self.fps
+        if self.modality in ['flow', 'rgbdiff']:
+            self.num_consecutive_frames = 5
+        else:
+            self.num_consecutive_frames = 1
+        self.multi_label = None
+        self.db = None
+        db = lmdb.open(self.db_path, max_readers=1, subdir=os.path.isdir(self.db_path),
+                       readonly=True, lock=False, readahead=False, meminit=False)
+        with db.begin(write=False) as txn:
+            self.length = pa.deserialize(txn.get(b'__len__'))
+            self.keys = pa.deserialize(txn.get(b'__keys__'))
+        db.close()
+        # TODO: a hack way to filter video
+        self.list_file = self.db_path.replace(".lmdb", ".txt")
+        valid_video_numbers = self.length
+        invalid_video_ids = []
+        if self.filter_video > 0:
+            valid_video_numbers = 0
+            invalid_video_ids = []
+            for x in open(self.list_file):
+                elements = x.strip().split(self.seperator)
+                start_frame = int(elements[1])
+                end_frame = int(elements[2])
+                total_frame = end_frame - start_frame + 1
+                if self.test_mode:
+                    valid_video_numbers += 1
+                else:
+                    if total_frame >= self.filter_video:
+                        valid_video_numbers += 1
+                    else:
+                        name = u'{}'.format(elements[0].split("/")[-1]).encode('ascii')
+                        invalid_video_ids.append(name)
+        print("The number of videos is {} (with more than {} frames) "
+              "(original: {})".format(valid_video_numbers, self.filter_video, self.length),
+              flush=True)
+        # remove keys and update length
+        self.length = valid_video_numbers
+        self.keys = [k for k in self.keys if k not in invalid_video_ids]
+        if self.length != len(self.keys):
+            raise ValueError("Do not filter video correctly.")
+        self.num_classes = num_classes
+        self.unpacked_video = None
+    def remove_data(self, idx):
+        original_video_num = self.length
+        self.keys = [v for i, v in enumerate(self.keys) if i not in idx]
+        self.length -= len(idx)
+        print("Original videos: {}\t remove {} videos, remaining {} videos".format(original_video_num, len(idx), self.length))
+    def _sample_indices(self, record):
+        return sample_train_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                 self.sample_freq, self.dense_sampling, self.num_clips)
+    def _get_val_indices(self, record):
+        return sample_val_test_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                    self.sample_freq, self.dense_sampling, self.fixed_offset,
+                                    self.num_clips, self.whole_video)
+    def __getitem__(self, index):
+        unpacked_video = self.maybe_open_and_get_buffer(index)
+        num_frames = unpacked_video[0] - 1 if self.modality == 'rgbdiff' else unpacked_video[0]
+        record = VideoRecord(self.keys[index].decode("utf-8"), 1, num_frames, unpacked_video[-1])
+        indices = self._sample_indices(record) if self.is_train else self._get_val_indices(record)
+        images = self.get_data(record, indices, unpacked_video)
+        images = self.transform(images)
+        label = self.get_label(record)
+        self.unpacked_video = None
+        # re-order data to targeted format.
+        return images, label
+    def maybe_open_and_get_buffer(self, index):
+        if self.db is None:
+            self.db = lmdb.open(self.db_path, max_readers=1, subdir=os.path.isdir(self.db_path),
+                                readonly=True, lock=False, readahead=False, meminit=False)
+        with self.db.begin(write=False) as txn:
+            byteflow = txn.get(self.keys[index])
+        try:
+            unpacked_video = pa.deserialize(byteflow)
+        except Exception as e:
+            with self.db.begin(write=False) as txn:
+                byteflow = txn.get(self.keys[0])
+            unpacked_video = pa.deserialize(byteflow)
+            print(self.keys[index], e, flush=True)
+        self.unpacked_video = unpacked_video
+        return unpacked_video
+    def get_data(self, record, indices, unpacked_video):
+        images = []
+        for seg_ind in indices:
+            new_seg_ind = [min(seg_ind + record.start_frame - 1 + i, record.num_frames)
+                           for i in range(self.num_consecutive_frames)]
+            img = load_data_lmdb(unpacked_video, new_seg_ind, self.modality)
+            images.extend(img)
+        return images
+    def get_label(self, record):
+        if self.test_mode:
+            # in test mode, return the video id as label
+            label = record.video_id
+        else:
+            if not self.multi_label:
+                label = int(record.label)
+            else:
+                # create a binary vector.
+                label = torch.zeros(self.num_classes, dtype=torch.float)
+                for x in record.label:
+                    label[int(x)] = 1.0
+        return label
+    def __len__(self):
+        return self.length
+class MultiVideoDataSet(data.Dataset):
+    def __init__(self, root_path, list_file, num_groups=64, frames_per_group=1, sample_offset=0, num_clips=1,
+                 modality='rgb', dense_sampling=False, fixed_offset=True,
+                 image_tmpl='{:05d}.jpg', transform=None, is_train=True, test_mode=False, seperator=' ',
+                 filter_video=0, num_classes=None, whole_video=False,
+                 fps=29.97, audio_length=1.28, resampling_rate=24000):
+        """
+        # root_path, modality and transform become list, each for one modality
+        Argments have different meaning when dense_sampling is True:
+            - num_groups ==> number of frames
+            - frames_per_group ==> sample every K frame
+            - sample_offset ==> number of clips used in validation or test mode
+        Args:
+            root_path (str): the file path to the root of video folder
+            list_file (str): the file list, each line with folder_path, start_frame, end_frame, label_id
+            num_groups (int): number of frames per data sample
+            frames_per_group (int): number of frames within one group
+            sample_offset (int): used in validation/test, the offset when sampling frames from a group
+            modality (str): rgb or flow
+            dense_sampling (bool): dense sampling in I3D
+            fixed_offset (bool): used for generating the same videos used in TSM
+            image_tmpl (str): template of image ids
+            transform: the transformer for preprocessing
+            is_train (bool): shuffle the video but keep the causality
+            test_mode (bool): testing mode, no label
+        """
+        video_datasets = []
+        for i in range(len(modality)):
+            tmp = VideoDataSet(root_path[i], os.path.join(root_path[i], list_file),
+                               num_groups, frames_per_group, sample_offset,
+                               num_clips, modality[i], dense_sampling, fixed_offset,
+                               image_tmpl, transform[i], is_train, test_mode, seperator,
+                               filter_video, num_classes, whole_video, fps, audio_length, resampling_rate)
+            video_datasets.append(tmp)
+        self.video_datasets = video_datasets
+        self.is_train = is_train
+        self.test_mode = test_mode
+        self.num_frames = num_groups
+        self.sample_freq = frames_per_group
+        self.dense_sampling = dense_sampling
+        self.num_clips = num_clips
+        self.fixed_offset = fixed_offset
+        self.modality = modality
+        self.num_classes = num_classes
+        self.whole_video = whole_video
+        self.video_list = video_datasets[0].video_list
+        self.num_consecutive_frames = max([x.num_consecutive_frames for x in self.video_datasets])
+    def _sample_indices(self, record):
+        return sample_train_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                 self.sample_freq, self.dense_sampling, self.num_clips)
+    def _get_val_indices(self, record):
+        return sample_val_test_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                    self.sample_freq, self.dense_sampling, self.fixed_offset,
+                                    self.num_clips, self.whole_video)
+    def remove_data(self, idx):
+        for i in range(len(self.video_datasets)):
+            self.video_datasets[i].remove_data(idx)
+        self.video_list = self.video_datasets[0].video_list
+    def __getitem__(self, index):
+        """
+        Returns:
+            torch.FloatTensor: (3xgxf)xHxW dimension, g is number of groups and f is the frames per group.
+            torch.FloatTensor: the label
+        """
+        record = self.video_list[index]
+        if self.is_train:
+            indices = self._sample_indices(record)
+        else:
+            indices = self._get_val_indices(record)
+        multi_modalities = []
+        for modality, video_dataset in zip(self.modality, self.video_datasets):
+            record = video_dataset.video_list[index]
+            images = video_dataset.get_data(record, indices)
+            images = video_dataset.transform(images)
+            label = video_dataset.get_label(record)
+            multi_modalities.append((images, label))
+        return [x for x, y in multi_modalities], multi_modalities[0][1]
+    def __len__(self):
+        return len(self.video_list)
+class MultiVideoDataSetLMDB(data.Dataset):
+    def __init__(self, root_path, list_file, num_groups=64, frames_per_group=1, sample_offset=0, num_clips=1,
+                 modality='rgb', dense_sampling=False, fixed_offset=True,
+                 image_tmpl='{:05d}.jpg', transform=None, is_train=True, test_mode=False, seperator=' ',
+                 filter_video=0, num_classes=None, whole_video=False,
+                 fps=29.97, audio_length=1.28, resampling_rate=24000):
+        """
+        # root_path, modality and transform become list, each for one modality
+        Argments have different meaning when dense_sampling is True:
+            - num_groups ==> number of frames
+            - frames_per_group ==> sample every K frame
+            - sample_offset ==> number of clips used in validation or test mode
+        Args:
+            root_path (str): the file path to the root of video folder
+            list_file (str): the file list, each line with folder_path, start_frame, end_frame, label_id
+            num_groups (int): number of frames per data sample
+            frames_per_group (int): number of frames within one group
+            sample_offset (int): used in validation/test, the offset when sampling frames from a group
+            modality (str): rgb or flow
+            dense_sampling (bool): dense sampling in I3D
+            fixed_offset (bool): used for generating the same videos used in TSM
+            image_tmpl (str): template of image ids
+            transform: the transformer for preprocessing
+            is_train (bool): shuffle the video but keep the causality
+            test_mode (bool): testing mode, no label
+        """
+        video_datasets = []
+        for i in range(len(modality)):
+            if modality[i] == 'sound':
+                list_file_ = list_file.replace(".lmdb", ".txt")
+                tmp = VideoDataSet(root_path[i], os.path.join(root_path[i], list_file_),
+                                   num_groups, frames_per_group, sample_offset,
+                                   num_clips, modality[i], dense_sampling, fixed_offset,
+                                   image_tmpl, transform[i], is_train, test_mode, seperator,
+                                   filter_video, num_classes, whole_video, fps, audio_length, resampling_rate)
+            else:
+                tmp = VideoDataSetLMDB(root_path[i], list_file, num_groups, frames_per_group,
+                                       sample_offset, num_clips, modality[i], dense_sampling,
+                                       fixed_offset, image_tmpl, transform[i], is_train, test_mode,
+                                       seperator, filter_video, num_classes, whole_video, fps, audio_length,
+                                       resampling_rate)
+            video_datasets.append(tmp)
+        self.video_datasets = video_datasets
+        self.is_train = is_train
+        self.test_mode = test_mode
+        self.num_frames = num_groups
+        self.sample_freq = frames_per_group
+        self.dense_sampling = dense_sampling
+        self.num_clips = num_clips
+        self.fixed_offset = fixed_offset
+        self.modality = modality
+        self.num_classes = num_classes
+        self.whole_video = whole_video
+        self.num_consecutive_frames = max([x.num_consecutive_frames for x in self.video_datasets])
+    def _sample_indices(self, record):
+        return sample_train_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                 self.sample_freq, self.dense_sampling, self.num_clips)
+    def _get_val_indices(self, record):
+        return sample_val_test_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                    self.sample_freq, self.dense_sampling, self.fixed_offset,
+                                    self.num_clips, self.whole_video)
+    def remove_data(self, idx):
+        for i in range(len(self.video_datasets)):
+            self.video_datasets[i].remove_data(idx)
+    def __getitem__(self, index):
+        """
+        Returns:
+            torch.FloatTensor: (3xgxf)xHxW dimension, g is number of groups and f is the frames per group.
+            torch.FloatTensor: the label
+        """
+        multi_modalities = []
+        indices = None
+        for modality, video_dataset in zip(self.modality, self.video_datasets):
+            if indices is None:
+                if modality == 'sound':
+                    record = video_dataset.video_list[index]
+                else:
+                    unpacked_video = video_dataset.maybe_open_and_get_buffer(index)
+                    num_frames = unpacked_video[0] - 1 if modality == 'rgbdiff' else unpacked_video[0]
+                    record = VideoRecord(video_dataset.keys[index].decode("utf-8"), 1, num_frames, unpacked_video[-1])
+                indices = video_dataset._sample_indices(record) if video_dataset.is_train else video_dataset._get_val_indices(record)
+            if modality == 'sound':
+                record = video_dataset.video_list[index]
+                images = video_dataset.get_data(record, indices)
+            else:
+                if video_dataset.unpacked_video is None:
+                    video_dataset.maybe_open_and_get_buffer(index)
+                unpacked_video = video_dataset.unpacked_video
+                num_frames = unpacked_video[0] - 1 if modality == 'rgbdiff' else unpacked_video[0]
+                record = VideoRecord(video_dataset.keys[index].decode("utf-8"), 1, num_frames, unpacked_video[-1])
+                images = video_dataset.get_data(record, indices, video_dataset.unpacked_video)
+                video_dataset.unpacked_video = None
+            images = video_dataset.transform(images)
+            label = video_dataset.get_label(record)
+            multi_modalities.append((images, label))
+        return [x for x, y in multi_modalities], multi_modalities[0][1]
+    def __len__(self):
+        return len(self.video_datasets[0])
+class VideoDataSetOnline(VideoDataSet):
+    def __init__(self, root_path, list_file, num_groups=8, frames_per_group=1, sample_offset=0,
+                 num_clips=1, modality='rgb', dense_sampling=False, fixed_offset=True,
+                 image_tmpl='{:05d}.jpg', transform=None, is_train=True, test_mode=False, seperator=' ',
+                 filter_video=0, num_classes=None, whole_video=False,
+                 fps=29.97, audio_length=1.28, resampling_rate=24000):
+        """
+        Arguments have different meaning when dense_sampling is True:
+            - num_groups ==> number of frames
+            - frames_per_group ==> sample every K frame
+            - sample_offset ==> number of clips used in validation or test mode
+        Args:
+            root_path (str): the file path to the root of video folder
+            list_file (str): the file list, each line with folder_path, start_frame, end_frame, label_id
+            num_groups (int): number of frames per data sample
+            frames_per_group (int): number of frames within one group
+            sample_offset (int): used in validation/test, the offset when sampling frames from a group
+            modality (str): rgb or flow
+            dense_sampling (bool): dense sampling in I3D
+            fixed_offset (bool): used for generating the same videos used in TSM
+            image_tmpl (str): template of image ids
+            transform: the transformer for preprocessing
+            is_train (bool): shuffle the video but keep the causality
+            test_mode (bool): testing mode, no label
+            fps (float): frame rate per second, used to localize sound when frame idx is selected.
+            audio_length (float): the time window to extract audio feature.
+            resampling_rate (int): used to resampling audio extracted from wav
+        """
+        if not _HAS_PYAV:
+            raise ValueError(_PYAV_ERROR_MSG)
+        if modality not in ['rgb', 'rgbdiff']:
+            raise ValueError("modality should be 'rgb' or 'rgbdiff'.")
+        super().__init__(root_path, list_file, num_groups, frames_per_group, sample_offset,
+                         num_clips, modality, dense_sampling, fixed_offset,
+                         image_tmpl, transform, is_train, test_mode, seperator,
+                         filter_video, num_classes, whole_video, fps, audio_length, resampling_rate)
+    def remove_data(self, idx):
+        original_video_num = len(self.video_list)
+        self.video_list = [v for i, v in enumerate(self.video_list) if i not in idx]
+        print("Original videos: {}\t remove {} videos, remaining {} videos".format(original_video_num, len(idx), len(self.video_list)))
+    def get_data(self, record, indices):
+        indices = indices - 1
+        container = av.open(os.path.join(self.root_path, record.path))
+        container.streams.video[0].thread_type = "AUTO"
+        frames_length = container.streams.video[0].frames
+        duration = container.streams.video[0].duration
+        if duration is None or frames_length == 0:
+            # If failed to fetch the decoding information, decode the entire video.
+            # video_start_pts, video_end_pts = 0, math.inf
+            decode_all = True
+        else:
+            # Perform selective decoding.
+            if frames_length != record.num_frames:
+                # remap the index
+                length_ratio = frames_length / record.num_frames
+                indices = np.around(indices * length_ratio).astype(int)
+            start_idx, end_idx = min(indices), max(indices)
+            # if self.modality == 'rgbdiff':
+            #    end_idx += (self.num_consecutive_frames + 1)
+            timebase = duration / frames_length
+            video_start_pts = int(start_idx * timebase)
+            video_end_pts = int(end_idx * timebase)
+            decode_all = False
+        def _selective_decoding(container, index, timebase):
+            margin = 1024
+            start_idx, end_idx = min(index), max(index)
+            video_start_pts = int(start_idx * timebase)
+            video_end_pts = int(end_idx * timebase)
+            seek_offset = max(video_start_pts - margin, 0)
+            container.seek(seek_offset, any_frame=False, backward=True,
+                           stream=container.streams.video[0])
+            success = True
+            video_frames = None
+            try:
+                frames = {}
+                for frame in container.decode({'video': 0}):
+                    if frame.pts < video_start_pts:
+                        continue
+                    if frame.pts <= video_end_pts:
+                        frames[frame.pts] = frame
+                    else:
+                        break
+                # the decoded frames is a whole region but we might subsample it
+                video_frames = np.asarray([frames[pts].to_rgb().to_ndarray() for pts in sorted(frames)])
+                index = np.linspace(0, max(0, len(video_frames) - 1), num=self.num_frames, dtype=int)
+                if len(video_frames) == 0:  # somehow decoding is wrong
+                    success = False
+                else:
+                    video_frames = video_frames[index, ...]
+            except Exception as e:
+                success = False
+            return video_frames, success
+        # If video stream was found, fetch video frames from the video.
+        # Seeking in the stream is imprecise. Thus, seek to an ealier PTS by a
+        # margin pts.
+        if not decode_all:
+            timebase = duration / frames_length
+            video_frames = None
+            for i in range(self.num_clips):
+                curr_index = indices[(i) * self.num_frames: (i + 1) * self.num_frames]
+                curr_video_frames, success = _selective_decoding(container, curr_index, timebase)
+                if not success:
+                    decode_all = True
+                    break
+                if video_frames is not None:
+                    video_frames = np.concatenate((video_frames, curr_video_frames), axis=0)
+                else:
+                    video_frames = curr_video_frames
+        if decode_all:
+            container.seek(0, any_frame=False, backward=True, stream=container.streams.video[0])
+            frames = {}
+            for frame in container.decode({'video': 0}):
+                frames[frame.pts] = frame
+            video_frames = np.asarray([frames[pts].to_rgb().to_ndarray() for pts in sorted(frames)])
+            total_frames = len(video_frames)
+            if total_frames != record.num_frames:
+                # remap the index
+                length_ratio = total_frames / record.num_frames
+                indices = np.around(indices * length_ratio).astype(int)
+            video_frames = video_frames[indices, ...]
+            """
+            if self.modality == 'rgbdiff':
+                video_diff = np.asarray(video_frames[1:, ...].copy(), dtype=np.float) - np.asarray(video_frames[:-1, ...].copy(), dtype=np.float)
+                video_diff += 255.0
+                video_diff *= (255.0 / float(2 * 255.0))
+                video_diff = video_diff.astype(np.uint8)
+                for seg_ind in indices:
+                    new_seg_ind = [min(seg_ind + i, total_frames - 1)
+                                   for i in range(self.num_consecutive_frames) ]
+                video_frames = video_diff
+            else:
+            """
+        images = [Image.fromarray(frame) for frame in video_frames]
+        # TODO: support rgb diff, calculate end_pts differently.
+        container.close()
+        return images
+class MultiVideoDataSetOnline(data.Dataset):
+    def __init__(self, root_path, list_file, num_groups=64, frames_per_group=1, sample_offset=0, num_clips=1,
+                 modality='rgb', dense_sampling=False, fixed_offset=True,
+                 image_tmpl='{:05d}.jpg', transform=None, is_train=True, test_mode=False, seperator=' ',
+                 filter_video=0, num_classes=None, whole_video=False,
+                 fps=29.97, audio_length=1.28, resampling_rate=24000):
+        """
+        # root_path, modality and transform become list, each for one modality
+        Argments have different meaning when dense_sampling is True:
+            - num_groups ==> number of frames
+            - frames_per_group ==> sample every K frame
+            - sample_offset ==> number of clips used in validation or test mode
+        Args:
+            root_path (str): the file path to the root of video folder
+            list_file (str): the file list, each line with folder_path, start_frame, end_frame, label_id
+            num_groups (int): number of frames per data sample
+            frames_per_group (int): number of frames within one group
+            sample_offset (int): used in validation/test, the offset when sampling frames from a group
+            modality (str): rgb or flow
+            dense_sampling (bool): dense sampling in I3D
+            fixed_offset (bool): used for generating the same videos used in TSM
+            image_tmpl (str): template of image ids
+            transform: the transformer for preprocessing
+            is_train (bool): shuffle the video but keep the causality
+            test_mode (bool): testing mode, no label
+        """
+        # TODO: support mixed LMDB, pyAV, etc.
+        video_datasets = []
+        for i in range(len(modality)):
+            if modality[i] == 'rgb' or modality[i] == 'rgbdiff':
+                video_dataset_cls = VideoDataSetOnline
+                list_file_ = list_file
+            elif modality[i] == 'sound':
+                video_dataset_cls = VideoDataSet
+                list_file_ = list_file
+            elif modality[i] == 'flow':
+                video_dataset_cls = VideoDataSetLMDB
+                list_file_ = list_file.replace(".txt", ".lmdb")
+            tmp = video_dataset_cls(root_path[i], list_file_,
+                                    num_groups, frames_per_group, sample_offset,
+                                    num_clips, modality[i], dense_sampling, fixed_offset,
+                                    image_tmpl, transform[i], is_train, test_mode, seperator,
+                                    filter_video, num_classes, whole_video, fps, audio_length, resampling_rate)
+            video_datasets.append(tmp)
+        self.video_datasets = video_datasets
+        self.is_train = is_train
+        self.test_mode = test_mode
+        self.num_frames = num_groups
+        self.sample_freq = frames_per_group
+        self.dense_sampling = dense_sampling
+        self.num_clips = num_clips
+        self.fixed_offset = fixed_offset
+        self.modality = modality
+        self.num_classes = num_classes
+        self.whole_video = whole_video
+        self.video_list = video_datasets[0].video_list
+        self.num_consecutive_frames = max([x.num_consecutive_frames for x in self.video_datasets])
+    def _sample_indices(self, record):
+        return sample_train_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                 self.sample_freq, self.dense_sampling, self.num_clips)
+    def _get_val_indices(self, record):
+        return sample_val_test_clip(record.num_frames, self.num_consecutive_frames, self.num_frames,
+                                    self.sample_freq, self.dense_sampling, self.fixed_offset,
+                                    self.num_clips, self.whole_video)
+    def remove_data(self, idx):
+        for i in range(len(self.video_datasets)):
+            self.video_datasets[i].remove_data(idx)
+        self.video_list = self.video_datasets[0].video_list
+    def __getitem__(self, index):
+        """
+        Returns:
+            torch.FloatTensor: (3xgxf)xHxW dimension, g is number of groups and f is the frames per group.
+            torch.FloatTensor: the label
+        """
+        multi_modalities = []
+        indices = None
+        for modality, video_dataset in zip(self.modality, self.video_datasets):
+            if indices is None:
+                if modality != 'flow':
+                    record = video_dataset.video_list[index]
+                else:
+                    unpacked_video = video_dataset.maybe_open_and_get_buffer(index)
+                    num_frames = unpacked_video[0]
+                    record = VideoRecord(video_dataset.keys[index].decode("utf-8"), 1, num_frames, unpacked_video[-1])
+                indices = video_dataset._sample_indices(record) if video_dataset.is_train else video_dataset._get_val_indices(record)
+            if modality != 'flow':
+                record = video_dataset.video_list[index]
+                images = video_dataset.get_data(record, indices)
+            else:
+                if video_dataset.unpacked_video is None:
+                    video_dataset.maybe_open_and_get_buffer(index)
+                unpacked_video = video_dataset.unpacked_video
+                num_frames = unpacked_video[0]
+                record = VideoRecord(video_dataset.keys[index].decode("utf-8"), 1, num_frames, unpacked_video[-1])
+                images = video_dataset.get_data(record, indices, video_dataset.unpacked_video)
+                video_dataset.unpacked_video = None
+            images = video_dataset.transform(images)
+            label = video_dataset.get_label(record)
+            multi_modalities.append((images, label))
+        return [x for x, y in multi_modalities], multi_modalities[0][1]
+    def __len__(self):
+        return len(self.video_datasets[0])
+def get_dataloader(loader_type, *args, **kwargs) -> \
+        Union[VideoDataSetLMDB, VideoDataSetOnline, VideoDataSet]:
+    if loader_type == 'lmdb':
+        return VideoDataSetLMDB(*args, **kwargs)
+    elif loader_type == 'pyav':
+        return VideoDataSetOnline(*args, **kwargs)
+    elif loader_type == 'jpeg':
+        return VideoDataSet(*args, **kwargs)
+    else:
+        raise ValueError(f'Unknown dataloader type: {loader_type}')
+def get_multimodality_dataloader(loader_type, *args, **kwargs) -> \
+        Union[MultiVideoDataSetLMDB, MultiVideoDataSetOnline, MultiVideoDataSet]:
+    if loader_type == 'lmdb':
+        return MultiVideoDataSetLMDB(*args, **kwargs)
+    elif loader_type == 'pyav':
+        return MultiVideoDataSetOnline(*args, **kwargs)
+    elif loader_type == 'jpeg':
+        return MultiVideoDataSet(*args, **kwargs)
+    else:
+        raise ValueError(f'Unknown dataloader type: {loader_type}')

video_dataset_aug.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import multiprocessing
+from typing import List
+import torch
+import torch.nn.parallel
+import torch.optim
+import torch.utils.data
+import torch.utils.data.distributed
+import torchvision.transforms as transforms
+from video_transforms import (GroupRandomHorizontalFlip, GroupOverSample,
+                               GroupMultiScaleCrop, GroupScale, GroupCenterCrop, GroupRandomCrop,
+                               GroupNormalize, Stack, ToTorchFormatTensor, GroupRandomScale,GroupCutout)
+def get_augmentor(is_train: bool, image_size: int, mean: List[float] = None,
+                  std: List[float] = None, disable_scaleup: bool = False,
+                  threed_data: bool = False, version: str = 'v1', scale_range: [int] = None,
+                  modality: str = 'rgb', num_clips: int = 1, num_crops: int = 1, cut_out=True,dataset: str = ''):
+    mean = [0.485, 0.456, 0.406] if mean is None else mean
+    std = [0.229, 0.224, 0.225] if std is None else std
+    scale_range = [256, 320] if scale_range is None else scale_range
+    augments = []
+    if is_train:
+        if version == 'v1':
+            augments += [
+                GroupMultiScaleCrop(image_size, [1, .875, .75, .66])
+            ]
+        elif version == 'v2':
+            augments += [
+                GroupRandomScale(scale_range),
+                GroupRandomCrop(image_size),
+            ]
+        if not (dataset.startswith('ststv') or 'jester' in dataset or 'mini_ststv' in dataset):
+            augments += [GroupRandomHorizontalFlip(is_flow=(modality == 'flow'))]
+    else:
+        scaled_size = image_size if disable_scaleup else int(image_size / 0.875 + 0.5)
+        if num_crops == 1:
+            augments += [
+                GroupScale(scaled_size),
+                GroupCenterCrop(image_size)
+            ]
+        else:
+            flip = True if num_crops == 10 else False
+            augments += [
+                GroupOverSample(image_size, scaled_size, num_crops=num_crops, flip=flip),
+            ]
+    augments += [
+        Stack(threed_data=threed_data),
+        ToTorchFormatTensor(num_clips_crops=num_clips * num_crops),
+        GroupNormalize(mean=mean, std=std, threed_data=threed_data)
+    ]
+    if cut_out:
+        augments += [GroupCutout(n_holes=1,length=16)]
+    augmentor = transforms.Compose(augments)
+    return augmentor
+def build_dataflow(dataset, is_train, batch_size, workers=36, is_distributed=False):
+    workers = min(workers, multiprocessing.cpu_count())
+    shuffle = False
+    sampler = torch.utils.data.distributed.DistributedSampler(dataset) if is_distributed else None
+    if is_train:
+        shuffle = sampler is None
+    data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=shuffle,
+                                              num_workers=workers, drop_last = True,pin_memory=True, sampler=sampler)
+    return data_loader

video_dataset_config.py ADDED Viewed

	@@ -0,0 +1,31 @@

+DATASET_CONFIG = {
+    'ffpp': {
+        'num_classes': 2,
+        'train_list_name': 'cdf_test_fold.txt',
+        'val_list_name': 'cdf_test_fold.txt',
+        'test_list_name': 'cdf_test_fold.txt',
+        'filename_seperator': " ",
+        'image_tmpl': '{:04d}.jpg',
+        'filter_video': 3,
+    }
+}
+def get_dataset_config(dataset, use_lmdb=False):
+    ret = DATASET_CONFIG[dataset]
+    num_classes = ret['num_classes']
+    train_list_name = ret['train_list_name'].replace("txt", "lmdb") if use_lmdb \
+        else ret['train_list_name']
+    val_list_name = ret['val_list_name'].replace("txt", "lmdb") if use_lmdb \
+        else ret['val_list_name']
+    test_list_name = ret['test_list_name'].replace("txt", "lmdb") if use_lmdb \
+        else ret['test_list_name']
+    filename_seperator = ret['filename_seperator']
+    image_tmpl = ret['image_tmpl']
+    filter_video = ret.get('filter_video', 0)
+    label_file = ret.get('label_file', None)
+    return num_classes, train_list_name, val_list_name, test_list_name, filename_seperator, \
+           image_tmpl, filter_video, label_file

video_transforms.py ADDED Viewed

	@@ -0,0 +1,420 @@

+import torchvision
+import random
+from PIL import Image, ImageOps
+import numbers
+import torch
+import numpy as np
+import math
+class GroupRandomCrop(object):
+    def __init__(self, size):
+        if isinstance(size, numbers.Number):
+            self.size = (int(size), int(size))
+        else:
+            self.size = size
+    def __call__(self, img_group):
+        w, h = img_group[0].size
+        th, tw = self.size
+        out_images = list()
+        x1 = random.randint(0, w - tw)
+        y1 = random.randint(0, h - th)
+        for img in img_group:
+            assert(img.size[0] == w and img.size[1] == h)
+            if w == tw and h == th:
+                out_images.append(img)
+            else:
+                out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
+        return out_images
+class GroupCenterCrop(object):
+    def __init__(self, size):
+        self.worker = torchvision.transforms.CenterCrop(size)
+    def __call__(self, img_group):
+        return [self.worker(img) for img in img_group]
+class GroupRandomHorizontalFlip(object):
+    """Randomly horizontally flips the given PIL.Image with a probability of 0.5
+    """
+    def __init__(self, is_flow=False):
+        self.is_flow = is_flow
+    def __call__(self, img_group, is_flow=False):
+        v = random.random()
+        if v < 0.5:
+            ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
+            if self.is_flow:
+                for i in range(0, len(ret), 2):
+                    ret[i] = ImageOps.invert(ret[i])  # invert flow pixel values when flipping
+            return ret
+        else:
+            return img_group
+class GroupNormalize(object):
+    def __init__(self, mean, std, threed_data=False):
+        self.threed_data = threed_data
+        if self.threed_data:
+            # convert to the proper format
+            self.mean = torch.FloatTensor(mean).view(len(mean), 1, 1, 1)
+            self.std = torch.FloatTensor(std).view(len(std), 1, 1, 1)
+        else:
+            self.mean = mean
+            self.std = std
+    def __call__(self, tensor):
+        if self.threed_data:
+            tensor.sub_(self.mean).div_(self.std)
+        else:
+            rep_mean = self.mean * (tensor.size()[0] // len(self.mean))
+            rep_std = self.std * (tensor.size()[0] // len(self.std))
+            # TODO: make efficient
+            for t, m, s in zip(tensor, rep_mean, rep_std):
+                t.sub_(m).div_(s)
+        return tensor
+class GroupCutout(object):
+    """Randomly mask out one or more patches from an image.
+    Args:
+        n_holes (int): Number of patches to cut out of each image.
+        length (int): The length (in pixels) of each square patch.
+    """
+    def __init__(self, n_holes, length):
+        self.n_holes = n_holes
+        self.length = length
+    def __call__(self, imgs):
+        """
+        Args:
+            img (Tensor): Tensor image of size (C, H, W).
+        Returns:
+            Tensor: Image with n_holes of dimension length x length cut out of it.
+        """
+        new_imgs = []
+        # import pdb;pdb.set_trace()
+        C,W,H = imgs.shape #72,224,224
+        # print(C,W,H)
+        # imgs = imgs.reshape(-1,3,H,W)
+        y = np.random.randint(H)
+        x = np.random.randint(W)
+        for i in range(0,imgs.shape[0],3):
+            h = W
+            w = H
+            mask = np.ones((h, w), np.float32)
+            for n in range(self.n_holes):
+                y1 = np.clip(y - self.length // 2, 0, h)
+                y2 = np.clip(y + self.length // 2, 0, h)
+                x1 = np.clip(x - self.length // 2, 0, w)
+                x2 = np.clip(x + self.length // 2, 0, w)
+                mask[y1: y2, x1: x2] = 0.
+            mask = torch.from_numpy(mask)
+            mask = mask.expand_as(imgs[i:i+3])
+            img = imgs[i:i+3] * mask
+            new_imgs.append(img)
+        # import pdb;pdb.set_trace()
+        new_imgs = torch.stack(new_imgs,0).reshape(C,H,W)
+        # print(new_imgs.shape)
+        return new_imgs
+class GroupScale(object):
+    """ Rescales the input PIL.Image to the given 'size'.
+    'size' will be the size of the smaller edge.
+    For example, if height > width, then image will be
+    rescaled to (size * height / width, size)
+    size: size of the smaller edge
+    interpolation: Default: PIL.Image.BILINEAR
+    """
+    def __init__(self, size, interpolation=Image.BILINEAR):
+        self.worker = torchvision.transforms.Resize(size, interpolation)
+    def __call__(self, img_group):
+        return [self.worker(img) for img in img_group]
+class GroupRandomScale(object):
+    """ Rescales the input PIL.Image to the given 'size'.
+    'size' will be the size of the smaller edge.
+    For example, if height > width, then image will be
+    rescaled to (size * height / width, size)
+    size: size of the smaller edge
+    interpolation: Default: PIL.Image.BILINEAR
+    Randomly select the smaller edge from the range of 'size'.
+    """
+    def __init__(self, size, interpolation=Image.BILINEAR):
+        self.size = size
+        self.interpolation = interpolation
+    def __call__(self, img_group):
+        selected_size = np.random.randint(low=self.size[0], high=self.size[1] + 1, dtype=int)
+        scale = GroupScale(selected_size, interpolation=self.interpolation)
+        return scale(img_group)
+class GroupOverSample(object):
+    def __init__(self, crop_size, scale_size=None, num_crops=5, flip=False):
+        self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size)
+        if scale_size is not None:
+            self.scale_worker = GroupScale(scale_size)
+        else:
+            self.scale_worker = None
+        if num_crops not in [1, 3, 5, 10]:
+            raise ValueError("num_crops should be in [1, 3, 5, 10] but ({})".format(num_crops))
+        self.num_crops = num_crops
+        self.flip = flip
+    def __call__(self, img_group):
+        if self.scale_worker is not None:
+            img_group = self.scale_worker(img_group)
+        image_w, image_h = img_group[0].size
+        crop_w, crop_h = self.crop_size
+        if self.num_crops == 3:
+            w_step = (image_w - crop_w) // 4
+            h_step = (image_h - crop_h) // 4
+            offsets = list()
+            if image_w != crop_w and image_h != crop_h:
+                offsets.append((0 * w_step, 0 * h_step))  # top
+                offsets.append((4 * w_step, 4 * h_step))  # bottom
+                offsets.append((2 * w_step, 2 * h_step))  # center
+            else:
+                if image_w < image_h:
+                    offsets.append((2 * w_step, 0 * h_step))  # top
+                    offsets.append((2 * w_step, 4 * h_step))  # bottom
+                    offsets.append((2 * w_step, 2 * h_step))  # center
+                else:
+                    offsets.append((0 * w_step, 2 * h_step))  # left
+                    offsets.append((4 * w_step, 2 * h_step))  # right
+                    offsets.append((2 * w_step, 2 * h_step))  # center
+        else:
+            offsets = GroupMultiScaleCrop.fill_fix_offset(False, image_w, image_h, crop_w, crop_h)
+        oversample_group = list()
+        for o_w, o_h in offsets:
+            normal_group = list()
+            flip_group = list()
+            for i, img in enumerate(img_group):
+                crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h))
+                normal_group.append(crop)
+                if self.flip:
+                    flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT)
+                    if img.mode == 'L' and i % 2 == 0:
+                        flip_group.append(ImageOps.invert(flip_crop))
+                    else:
+                        flip_group.append(flip_crop)
+            oversample_group.extend(normal_group)
+            if self.flip:
+                oversample_group.extend(flip_group)
+        return oversample_group
+class GroupMultiScaleCrop(object):
+    def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True):
+        self.scales = scales if scales is not None else [1, 875, .75, .66]
+        self.max_distort = max_distort
+        self.fix_crop = fix_crop
+        self.more_fix_crop = more_fix_crop
+        self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size]
+        self.interpolation = Image.BILINEAR
+    def __call__(self, img_group):
+        im_size = img_group[0].size
+        crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size)
+        crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group]
+        ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation)
+                         for img in crop_img_group]
+        return ret_img_group
+    def _sample_crop_size(self, im_size):
+        image_w, image_h = im_size[0], im_size[1]
+        # find a crop size
+        base_size = min(image_w, image_h)
+        crop_sizes = [int(base_size * x) for x in self.scales]
+        crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes]
+        crop_w = [self.input_size[0] if abs(x - self.input_size[0]) < 3 else x for x in crop_sizes]
+        pairs = []
+        for i, h in enumerate(crop_h):
+            for j, w in enumerate(crop_w):
+                if abs(i - j) <= self.max_distort:
+                    pairs.append((w, h))
+        crop_pair = random.choice(pairs)
+        if not self.fix_crop:
+            w_offset = random.randint(0, image_w - crop_pair[0])
+            h_offset = random.randint(0, image_h - crop_pair[1])
+        else:
+            w_offset, h_offset = self._sample_fix_offset(image_w, image_h, crop_pair[0], crop_pair[1])
+        return crop_pair[0], crop_pair[1], w_offset, h_offset
+    def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h):
+        offsets = self.fill_fix_offset(self.more_fix_crop, image_w, image_h, crop_w, crop_h)
+        return random.choice(offsets)
+    @staticmethod
+    def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h):
+        w_step = (image_w - crop_w) // 4
+        h_step = (image_h - crop_h) // 4
+        ret = list()
+        ret.append((0, 0))  # upper left
+        ret.append((4 * w_step, 0))  # upper right
+        ret.append((0, 4 * h_step))  # lower left
+        ret.append((4 * w_step, 4 * h_step))  # lower right
+        ret.append((2 * w_step, 2 * h_step))  # center
+        if more_fix_crop:
+            ret.append((0, 2 * h_step))  # center left
+            ret.append((4 * w_step, 2 * h_step))  # center right
+            ret.append((2 * w_step, 4 * h_step))  # lower center
+            ret.append((2 * w_step, 0 * h_step))  # upper center
+            ret.append((1 * w_step, 1 * h_step))  # upper left quarter
+            ret.append((3 * w_step, 1 * h_step))  # upper right quarter
+            ret.append((1 * w_step, 3 * h_step))  # lower left quarter
+            ret.append((3 * w_step, 3 * h_step))  # lower righ quarter
+        return ret
+class GroupRandomSizedCrop(object):
+    """Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size
+    and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio
+    This is popularly used to train the Inception networks
+    size: size of the smaller edge
+    interpolation: Default: PIL.Image.BILINEAR
+    """
+    def __init__(self, size, interpolation=Image.BILINEAR):
+        self.size = size
+        self.interpolation = interpolation
+    def __call__(self, img_group):
+        for attempt in range(10):
+            area = img_group[0].size[0] * img_group[0].size[1]
+            target_area = random.uniform(0.08, 1.0) * area
+            aspect_ratio = random.uniform(3. / 4, 4. / 3)
+            w = int(round(math.sqrt(target_area * aspect_ratio)))
+            h = int(round(math.sqrt(target_area / aspect_ratio)))
+            if random.random() < 0.5:
+                w, h = h, w
+            if w <= img_group[0].size[0] and h <= img_group[0].size[1]:
+                x1 = random.randint(0, img_group[0].size[0] - w)
+                y1 = random.randint(0, img_group[0].size[1] - h)
+                found = True
+                break
+        else:
+            found = False
+            x1 = 0
+            y1 = 0
+        if found:
+            out_group = list()
+            for img in img_group:
+                img = img.crop((x1, y1, x1 + w, y1 + h))
+                assert(img.size == (w, h))
+                out_group.append(img.resize((self.size, self.size), self.interpolation))
+            return out_group
+        else:
+            # Fallback
+            scale = GroupScale(self.size, interpolation=self.interpolation)
+            crop = GroupRandomCrop(self.size)
+            return crop(scale(img_group))
+class Stack(object):
+    def __init__(self, roll=False, threed_data=False):
+        self.roll = roll
+        self.threed_data = threed_data
+    def __call__(self, img_group):
+        if img_group[0].mode == 'L':
+            return np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2)
+        elif img_group[0].mode == 'RGB':
+            if self.threed_data:
+                return np.stack(img_group, axis=0)
+            else:
+                if self.roll:
+                    return np.concatenate([np.array(x)[:, :, ::-1] for x in img_group], axis=2)
+                else:
+                    return np.concatenate(img_group, axis=2)
+class ToTorchFormatTensor(object):
+    """ Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255]
+    to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] """
+    def __init__(self, div=True, num_clips_crops=1):
+        self.div = div
+        self.num_clips_crops = num_clips_crops
+    def __call__(self, pic):
+        if isinstance(pic, np.ndarray):
+            # handle numpy array
+            if len(pic.shape) == 4:
+                # ((NF)xCxHxW) --> (Cx(NF)xHxW)
+                img = torch.from_numpy(pic).permute(3, 0, 1, 2).contiguous()
+            else:  # data is HW(FC)
+                img = torch.from_numpy(pic).permute(2, 0, 1).contiguous()
+        else:
+            # handle PIL Image
+            img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
+            img = img.view(pic.size[1], pic.size[0], len(pic.mode))
+            # put it from HWC to CHW format
+            # yikes, this transpose takes 80% of the loading time/CPU
+            img = img.transpose(0, 1).transpose(0, 2).contiguous()
+        return img.float().div(255) if self.div else img.float()
+class IdentityTransform(object):
+    def __call__(self, data):
+        return data
+if __name__ == "__main__":
+    trans = torchvision.transforms.Compose([
+        GroupScale(256),
+        GroupRandomCrop(224),
+        GroupOverSample(224, 224, num_crops=3, flip=False),
+        Stack(),
+        ToTorchFormatTensor(num_clips_crops=9),
+        GroupNormalize(
+            mean=[.485, .456, .406],
+            std=[.229, .224, .225]
+        )]
+    )