Energy-Based Constraint Networks
Pretrained weights for "Energy-Based Constraint Networks: Learning Structural Coherence Across Modalities"
Checkpoints
Text Domain (frozen BERT-base-uncased encoder)
| File | Params | CLI flag | Result |
|---|---|---|---|
nl_bert_constraint_best.pt |
7.4M | — | 93.4% on 6 trained corruption types, 87.2% on 9 unseen types |
Vision Domain (frozen DINOv2 ViT-B/14 encoder)
| File | Params | CLI flag | Role |
|---|---|---|---|
pretrained_paired_best.pt |
3.6M | --struct_checkpoint |
Structural branch. Corruption pretrained + FF++ paired fine-tuning. Detects face swaps, expression transfers, identity manipulations |
freq_only_best.pt |
3.6M | --freq_checkpoint |
Frequency branch. Processes DINOv2(frequency heatmap). Detects GAN smoothing, texture loss |
local_only_best.pt |
3.6M | --local_checkpoint |
Local texture branch. Trained on local corruptions + NeuralTextures. Detects neural rendering, localized inconsistencies |
vision_constraint_celeb.pt |
3.6M | --pretrained |
Corruption-only pretrained model (zero deepfake training data). Used as initialization for structural branch. Achieves 0.850 AUC on FF++ Deepfakes without seeing any deepfakes |
Combined Vision Results (structural + frequency + local texture)
| Benchmark | AUC (mean ± std, 5 seeds) |
|---|---|
| FF++ Deepfakes | 0.959 ± 0.001 |
| FF++ Face2Face | 0.909 ± 0.003 |
| FF++ FaceSwap | 0.919 ± 0.003 |
| FF++ NeuralTextures | 0.880 ± 0.005 |
| FF++ FaceShifter | 0.897 ± 0.005 |
| Celeb-DF (cross-dataset, no training data) | 0.870 ± 0.019 |
Usage
Text
import torch
import torch.nn as nn
from model import ConstraintNetwork
from bert_encoder import BERTWindowEncoder
# Load encoder and model
encoder = BERTWindowEncoder("bert-base-uncased", "cuda", layer=-2, window_size=8)
model = ConstraintNetwork(d_model=384, d_state=64, vocab_size=None,
max_seq_len=32, dropout=0.15, alpha=0.3).cuda()
model.input_proj = nn.Linear(768, 384).cuda()
model.load_state_dict(torch.load("nl_bert_constraint_best.pt", map_location="cuda"))
model.eval()
# Evaluate a paragraph
text = "Marie Curie was born in Warsaw. She studied physics in Paris."
embs = encoder.encode_paragraph(text, max_windows=32).cuda()
if embs.shape[0] < 32:
pad = torch.zeros(32 - embs.shape[0], 768, device="cuda")
embs = torch.cat([embs, pad])
embs = embs.unsqueeze(0)
with torch.no_grad():
energy = model(embs)
print(f"Energy: {energy.item():+.4f}") # lower = more coherent
Vision (three-branch evaluation)
python eval_combined_final.py \
--struct_checkpoint pretrained_paired_best.pt \
--freq_checkpoint freq_only_best.pt \
--local_checkpoint local_only_best.pt \
--real_dir ff_c23_faces/original \
--fake_dirs ff_c23_faces/Deepfakes ff_c23_faces/Face2Face \
ff_c23_faces/FaceSwap ff_c23_faces/NeuralTextures \
ff_c23_faces/FaceShifter \
--max_images 5000
Training from scratch
# Text: train on WikiText-103 (embeddings cached after first run)
python train_bert.py
# Vision: train three branches independently
python train_paired.py --real_dir ... --fake_dirs ... --pretrained vision_constraint_celeb.pt
python train_freq_only.py --real_dir ... --fake_dirs ...
python train_local_only.py --real_dir ... --fake_dirs ...
Architecture
Input → SSM (×6) → Dual-head attention (×2) → Energy head → E(x)
linear causal + bidirectional mean + α·max
cost single W per head
no Q/K/V, no KV cache
- Same architecture for both text and vision — only the input projection layer changes
- Each forward pass is stateless (no KV cache)
- Per-position energy decomposition enables violation localization
- α = 0.3 (insensitive in [0.2, 1.0] across both modalities)
Requirements
- Python 3.10+
- PyTorch 2.0+
- CUDA GPU
transformers(BERT, DINOv2)datasets(WikiText-103)mamba-ssm+causal-conv1d
Citation
@article{shinde2026energy,
title={Energy-Based Constraint Networks: Learning Structural Coherence Across Modalities},
author={Shinde, Chirag},
year={2026},
url={https://github.com/cs-cmyk/energy-constraint-networks}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support