TikTok Content Safety - Multimodal Fusion Model

This model combines text and video features using Late Fusion with Cross-Attention to classify TikTok content as safe or harmful.

Architecture

Text Backbone (XLM-RoBERTa) โ†’ Text Features (768-dim)
Video Backbone (VideoMAE)    โ†’ Video Features (768-dim)
                                  โ†“
                         Cross-Attention Fusion
                                  โ†“
                         Gating Mechanism
                                  โ†“
                        Classifier (2 classes)

Usage

This is a custom model. You need to download and use the LateFusionModel class:

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch.nn as nn

# Download model weights
weights_path = hf_hub_download(repo_id="KhoiBui/tiktok-multimodal-fusion-classifier", filename="model.safetensors")
config_path = hf_hub_download(repo_id="KhoiBui/tiktok-multimodal-fusion-classifier", filename="fusion_config.json")

# Load config
import json
with open(config_path) as f:
    config = json.load(f)

# Initialize and load model (using LateFusionModel class from your codebase)
# model = LateFusionModel(config)
# model.load_state_dict(load_file(weights_path))

Model Details

  • Text Backbone: XLM-RoBERTa-base
  • Video Backbone: VideoMAE-base
  • Fusion: Cross-Attention with Gating
  • Task: Binary classification (safe/harmful)
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including KhoiBui/tiktok-multimodal-fusion-classifier