TikTok Content Safety - Multimodal Fusion Model

This model combines text and video features using Late Fusion with Cross-Attention to classify TikTok content as safe or harmful.

Architecture

Text Backbone (XLM-RoBERTa) → Text Features (768-dim)
Video Backbone (VideoMAE)    → Video Features (768-dim)
                                  ↓
                         Cross-Attention Fusion
                                  ↓
                         Gating Mechanism
                                  ↓
                        Classifier (2 classes)

Usage

This is a custom model. You need to download and use the LateFusionModel class:

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch.nn as nn

# Download model weights
weights_path = hf_hub_download(repo_id="KhoiBui/tiktok-multimodal-fusion-classifier", filename="model.safetensors")
config_path = hf_hub_download(repo_id="KhoiBui/tiktok-multimodal-fusion-classifier", filename="fusion_config.json")

# Load config
import json
with open(config_path) as f:
    config = json.load(f)

# Initialize and load model (using LateFusionModel class from your codebase)
# model = LateFusionModel(config)
# model.load_state_dict(load_file(weights_path))

Model Details

Text Backbone: XLM-RoBERTa-base
Video Backbone: VideoMAE-base
Fusion: Cross-Attention with Gating
Task: Binary classification (safe/harmful)

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including KhoiBui/tiktok-multimodal-fusion-classifier

UIT-fine-tuned-Models

Collection

This collections includes fine-tuned models from CS221, CS431 at UIT • 11 items • Updated Feb 3