LUCID — DistilBERT for Short-Form Video Manipulation Detection

"You're not addicted. You're being engineered. See how."

lucid-distilbert is a fine-tuned DistilBERT classifier that scores short-form social video text (TikTok captions + transcripts + on-screen overlay OCR) along six research-grounded psychological manipulation dimensions:

Dimension Academic grounding
Outrage Bait Crockett 2017; Brady et al. 2017, 2021
FOMO Trigger Przybylski et al. 2013; Cialdini 2009
Engagement Bait Meta 2017; Munger 2020; Mathur et al. 2019
Emotional Manipulation Cialdini et al. 1987; Small et al. 2007; Kramer et al. 2014
Curiosity Gap Loewenstein 1994; Blom & Hansen 2015; Scott 2021
Dopamine Design Skinner 1953; Alter 2017; Montag et al. 2019

The model has two parallel output heads on a shared [CLS] representation:

  • A regression head predicting the 0–100 composite Scroll Trap Score (sigmoid × 100).
  • A multi-label head with 6 binary classifiers — one per dimension — each returning P(tactic present).

Per-dimension probabilities are trained with binary cross-entropy against rubric severity labels binarized at severity ≥ 1. The composite regression head is trained with MSE against rubric-derived ground truth.


Intended use

Primary use case. Research / educational tool for analyzing short-form video content at the post level. Given a fused text stream from a single TikTok-style post (caption + audio transcript + on-screen text), return a severity score per manipulation dimension and an aggregate 0–100 composite.

Users this was built for. Trust & Safety practitioners, platform policy researchers, media literacy educators, and end users who want vocabulary for what a specific post is doing to their attention.

Not intended for.

  • Individual creator moderation or takedowns. The model scores posts, not intent; using it to judge whether a specific creator is acting in bad faith would misread the labels.
  • Demographic profiling of creators or audiences.
  • Any high-stakes automated decision without human review.
  • Content in languages or cultural contexts other than English-language, predominantly US/UK social-media discourse. Manipulation norms are culturally situated; applying the model outside its training distribution requires rubric reconstruction.

Training data

Total labeled corpus: 3,527 items.

Source Approx. size Purpose
Webis Clickbait Corpus 2017 ~2,000 Pretraining-style signal; continuous severity
Stop Clickbait 2016 ~1,500 Weak supervision; binary clickbait
TikTok (yt-dlp scrape) ~200 In-domain evaluation + demo gallery

Labeling — LLM-as-judge with human validation

Because existing datasets carry only binary clickbait labels, we used Claude Sonnet 4.5 (Anthropic) as a scalable labeling oracle, prompted with the 6-dimension rubric above (full text in repo docs/RUBRIC.md) and 8 few-shot examples per severity level. This approach is explicitly in the lineage of Constitutional AI / RLAIF (Bai et al. 2022) — an LLM prompted with human-written principles produces training labels for a smaller supervised model.

We validate Claude's labels against a 100-post human gold set hand-labeled by the author, reporting per-dimension Spearman rank correlation and Krippendorff's α (ordinal) as agreement metrics.

Agreement numbers appear in the companion technical report once gold-set labeling completes.


Training

  • Base model. distilbert-base-uncased (Sanh et al. 2019), 66M parameters.
  • Fine-tuning. Full fine-tune (no layer freezing). Dual heads attached to the [CLS] pooled representation.
  • Optimizer. AdamW, lr=2e-5, weight_decay=0.01.
  • Schedule. Linear LR with 10% warmup, 4 epochs.
  • Batch size. 32.
  • Max sequence length. 256 tokens.
  • Loss. MSE(composite) + 1.0 × BCEWithLogitsLoss(dimensions).
  • Hardware. Single NVIDIA H100 via Duke Colab credits. Training completed in ~2 minutes.
  • Checkpoint selection. Best epoch by validation composite MAE; saved state is from epoch 4 with val MAE=5.88.

Reproducibility


Evaluation

Held-out test split of 529 items (stratified 15% of corpus).

Test-set metrics

Metric Value
Macro F1 (per-dim binary, threshold ≥1) 0.334
Macro accuracy (per-dim binary) 0.904
Composite MAE (0–100 scale) 5.90
Composite RMSE 7.12
Composite R² +0.368

How to interpret

  • Positive composite R² (+0.368) means the model explains real variance in the composite score beyond a constant mean predictor. For comparison, the naive keyword-matching baseline has R²=−0.594 and the classical (TF-IDF + XGBoost) baseline has R²=−1.462. Deep is the only model that beats the mean.
  • The macro F1 of 0.334 is lower than the classical baseline's 0.425. This reflects an intentional calibration difference: the deep model's per-dim probabilities are softer, producing fewer firings but better-calibrated confidences. See the technical report §6 for the full per-dimension breakdown.

Noise robustness

Character-level noise injection on 100 test items (seed=7), mean |Δ score| on the 0–100 composite:

Noise rate Mean Δ Median Δ Max Δ
5% 4.2 2.0 26
10% 5.4 4.0 27
20% 7.7 5.5 37
35% 10.2 9.0 32

At realistic OCR / transcription noise levels (5–10%), the composite Scroll Trap Score shifts ~4–5 points on a 0–100 scale — graceful degradation, suggesting the model has learned semantic rather than surface-lexical features.


Usage

Via HuggingFace transformers

This model has a custom multi-output head (composite_head + dimension_head), so it cannot be loaded with AutoModelForSequenceClassification. Use the repo's inference module:

from backend.inference.deep import DeepPredictor

predictor = DeepPredictor(hf_repo="lindsaygross32/lucid-distilbert")
pred = predictor.predict("DON'T SCROLL! HANG ON! HANG ON!! I have one question...")

print(pred.scroll_trap_score)
# 28
print(pred.dimension_scores)
# {'outrage_bait': 0.11, 'fomo_trigger': 0.23, 'engagement_bait': 0.29,
#  'emotional_manipulation': 0.04, 'curiosity_gap': 0.68, 'dopamine_design': 0.25}

Per-dimension token attribution (Integrated Gradients)

pred, per_dim_tokens = predictor.explain(
    "DON'T SCROLL! HANG ON! Will you be my friend?",
    top_k=8,
)
# per_dim_tokens["engagement_bait"] -> [
#   {"token": "you", "position": 9, "attribution": +0.34},
#   {"token": "question", "position": 14, "attribution": +0.26},
#   ...
# ]

Integrated Gradients (Sundararajan, Taly, Yan 2017) produces signed per-token attributions. Positive attribution → token pushes the head toward "tactic present," negative → toward absent.

Live demo

https://lucid-seven-pied.vercel.app


Limitations and ethical considerations

  1. Intent vs. effect. The model measures tactic presence, not creator intent. A post using emotional appeals to raise money for a sick family member scores higher on Emotional Manipulation — but that is not a judgment of bad faith. Any downstream tooling built on top of this model must preserve that distinction.

  2. Cultural and linguistic scope. Training data is English-language, predominantly US-origin social content. Manipulation norms vary across cultures; the model should not be used on non-English content or in cultures with meaningfully different rhetorical conventions without rubric reconstruction.

  3. Labeling source bias. Our labels come from a single LLM judge (Claude Sonnet 4.5) validated against a single human annotator. A world where many systems use the same LLM as judge risks correlated labeling errors. Multi-model, multi-annotator labeling would be the right long-term direction.

  4. Small corpus. 3,527 total items is modest for a 6-way multi-label task. Expect higher variance than reported on new distributions.

  5. Format–content confounds. The classical baseline over-fires on listicle-format text because training data (Stop Clickbait) conflates listicle format with clickbait manipulation. The deep model is more robust but the underlying confound is not fully eliminated.

  6. Creator-level aggregation risk. This model scores posts. Rolling scores up to the creator level (e.g., "creator X's average Scroll Trap Score") creates harassment vectors and should not be done without additional review.

  7. Not a safety classifier. This is an educational tool for surfacing rhetorical moves, not a hate-speech / harm detector. It explicitly says nothing about whether content is harmful or false.


Citation

If you use lucid-distilbert in academic work, please cite:

@misc{gross2026lucid,
  title        = {LUCID: Multimodal Detection of Short-Form Video Manipulation Tactics},
  author       = {Lindsay Gross},
  year         = {2026},
  howpublished = {\url{https://github.com/lindsaygross/Lucid}},
  note         = {Duke AIPI 540 final project},
}

Academic grounding for the 6-dimension rubric is documented in full in docs/RUBRIC.md.


License

MIT. See the LICENSE in the repo.


Contact

Lindsay Gross — Duke AIPI, Spring 2026 — background in Trust & Safety.

Issues / collaboration: github.com/lindsaygross/Lucid/issues.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lindsaygross32/lucid-distilbert

Finetuned
(11699)
this model

Evaluation results