Instructions to use lindsaygross32/lucid-distilbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lindsaygross32/lucid-distilbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="lindsaygross32/lucid-distilbert")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("lindsaygross32/lucid-distilbert", dtype="auto") - Notebooks
- Google Colab
- Kaggle
LUCID — DistilBERT for Short-Form Video Manipulation Detection
"You're not addicted. You're being engineered. See how."
lucid-distilbert is a fine-tuned DistilBERT classifier that scores short-form social video text (TikTok captions + transcripts + on-screen overlay OCR) along six research-grounded psychological manipulation dimensions:
| Dimension | Academic grounding |
|---|---|
| Outrage Bait | Crockett 2017; Brady et al. 2017, 2021 |
| FOMO Trigger | Przybylski et al. 2013; Cialdini 2009 |
| Engagement Bait | Meta 2017; Munger 2020; Mathur et al. 2019 |
| Emotional Manipulation | Cialdini et al. 1987; Small et al. 2007; Kramer et al. 2014 |
| Curiosity Gap | Loewenstein 1994; Blom & Hansen 2015; Scott 2021 |
| Dopamine Design | Skinner 1953; Alter 2017; Montag et al. 2019 |
The model has two parallel output heads on a shared [CLS] representation:
- A regression head predicting the 0–100 composite Scroll Trap Score (sigmoid × 100).
- A multi-label head with 6 binary classifiers — one per dimension — each returning
P(tactic present).
Per-dimension probabilities are trained with binary cross-entropy against rubric severity labels binarized at severity ≥ 1. The composite regression head is trained with MSE against rubric-derived ground truth.
Intended use
Primary use case. Research / educational tool for analyzing short-form video content at the post level. Given a fused text stream from a single TikTok-style post (caption + audio transcript + on-screen text), return a severity score per manipulation dimension and an aggregate 0–100 composite.
Users this was built for. Trust & Safety practitioners, platform policy researchers, media literacy educators, and end users who want vocabulary for what a specific post is doing to their attention.
Not intended for.
- Individual creator moderation or takedowns. The model scores posts, not intent; using it to judge whether a specific creator is acting in bad faith would misread the labels.
- Demographic profiling of creators or audiences.
- Any high-stakes automated decision without human review.
- Content in languages or cultural contexts other than English-language, predominantly US/UK social-media discourse. Manipulation norms are culturally situated; applying the model outside its training distribution requires rubric reconstruction.
Training data
Total labeled corpus: 3,527 items.
| Source | Approx. size | Purpose |
|---|---|---|
| Webis Clickbait Corpus 2017 | ~2,000 | Pretraining-style signal; continuous severity |
| Stop Clickbait 2016 | ~1,500 | Weak supervision; binary clickbait |
| TikTok (yt-dlp scrape) | ~200 | In-domain evaluation + demo gallery |
Labeling — LLM-as-judge with human validation
Because existing datasets carry only binary clickbait labels, we used Claude Sonnet 4.5 (Anthropic) as a scalable labeling oracle, prompted with the 6-dimension rubric above (full text in repo docs/RUBRIC.md) and 8 few-shot examples per severity level. This approach is explicitly in the lineage of Constitutional AI / RLAIF (Bai et al. 2022) — an LLM prompted with human-written principles produces training labels for a smaller supervised model.
We validate Claude's labels against a 100-post human gold set hand-labeled by the author, reporting per-dimension Spearman rank correlation and Krippendorff's α (ordinal) as agreement metrics.
Agreement numbers appear in the companion technical report once gold-set labeling completes.
Training
- Base model.
distilbert-base-uncased(Sanh et al. 2019), 66M parameters. - Fine-tuning. Full fine-tune (no layer freezing). Dual heads attached to the
[CLS]pooled representation. - Optimizer. AdamW,
lr=2e-5,weight_decay=0.01. - Schedule. Linear LR with 10% warmup, 4 epochs.
- Batch size. 32.
- Max sequence length. 256 tokens.
- Loss.
MSE(composite) + 1.0 × BCEWithLogitsLoss(dimensions). - Hardware. Single NVIDIA H100 via Duke Colab credits. Training completed in ~2 minutes.
- Checkpoint selection. Best epoch by validation composite MAE; saved state is from epoch 4 with val MAE=5.88.
Reproducibility
- Full training notebook:
notebooks/train_lucid.ipynb - Training script (CPU fallback):
scripts/train_deep.py - Random seed: 42
- Splits: 70/15/15 stratified on composite-score bins
Evaluation
Held-out test split of 529 items (stratified 15% of corpus).
Test-set metrics
| Metric | Value |
|---|---|
| Macro F1 (per-dim binary, threshold ≥1) | 0.334 |
| Macro accuracy (per-dim binary) | 0.904 |
| Composite MAE (0–100 scale) | 5.90 |
| Composite RMSE | 7.12 |
| Composite R² | +0.368 |
How to interpret
- Positive composite R² (+0.368) means the model explains real variance in the composite score beyond a constant mean predictor. For comparison, the naive keyword-matching baseline has R²=−0.594 and the classical (TF-IDF + XGBoost) baseline has R²=−1.462. Deep is the only model that beats the mean.
- The macro F1 of 0.334 is lower than the classical baseline's 0.425. This reflects an intentional calibration difference: the deep model's per-dim probabilities are softer, producing fewer firings but better-calibrated confidences. See the technical report §6 for the full per-dimension breakdown.
Noise robustness
Character-level noise injection on 100 test items (seed=7), mean |Δ score| on the 0–100 composite:
| Noise rate | Mean Δ | Median Δ | Max Δ |
|---|---|---|---|
| 5% | 4.2 | 2.0 | 26 |
| 10% | 5.4 | 4.0 | 27 |
| 20% | 7.7 | 5.5 | 37 |
| 35% | 10.2 | 9.0 | 32 |
At realistic OCR / transcription noise levels (5–10%), the composite Scroll Trap Score shifts ~4–5 points on a 0–100 scale — graceful degradation, suggesting the model has learned semantic rather than surface-lexical features.
Usage
Via HuggingFace transformers
This model has a custom multi-output head (composite_head + dimension_head), so it cannot be loaded with AutoModelForSequenceClassification. Use the repo's inference module:
from backend.inference.deep import DeepPredictor
predictor = DeepPredictor(hf_repo="lindsaygross32/lucid-distilbert")
pred = predictor.predict("DON'T SCROLL! HANG ON! HANG ON!! I have one question...")
print(pred.scroll_trap_score)
# 28
print(pred.dimension_scores)
# {'outrage_bait': 0.11, 'fomo_trigger': 0.23, 'engagement_bait': 0.29,
# 'emotional_manipulation': 0.04, 'curiosity_gap': 0.68, 'dopamine_design': 0.25}
Per-dimension token attribution (Integrated Gradients)
pred, per_dim_tokens = predictor.explain(
"DON'T SCROLL! HANG ON! Will you be my friend?",
top_k=8,
)
# per_dim_tokens["engagement_bait"] -> [
# {"token": "you", "position": 9, "attribution": +0.34},
# {"token": "question", "position": 14, "attribution": +0.26},
# ...
# ]
Integrated Gradients (Sundararajan, Taly, Yan 2017) produces signed per-token attributions. Positive attribution → token pushes the head toward "tactic present," negative → toward absent.
Live demo
https://lucid-seven-pied.vercel.app
Limitations and ethical considerations
Intent vs. effect. The model measures tactic presence, not creator intent. A post using emotional appeals to raise money for a sick family member scores higher on Emotional Manipulation — but that is not a judgment of bad faith. Any downstream tooling built on top of this model must preserve that distinction.
Cultural and linguistic scope. Training data is English-language, predominantly US-origin social content. Manipulation norms vary across cultures; the model should not be used on non-English content or in cultures with meaningfully different rhetorical conventions without rubric reconstruction.
Labeling source bias. Our labels come from a single LLM judge (Claude Sonnet 4.5) validated against a single human annotator. A world where many systems use the same LLM as judge risks correlated labeling errors. Multi-model, multi-annotator labeling would be the right long-term direction.
Small corpus. 3,527 total items is modest for a 6-way multi-label task. Expect higher variance than reported on new distributions.
Format–content confounds. The classical baseline over-fires on listicle-format text because training data (Stop Clickbait) conflates listicle format with clickbait manipulation. The deep model is more robust but the underlying confound is not fully eliminated.
Creator-level aggregation risk. This model scores posts. Rolling scores up to the creator level (e.g., "creator X's average Scroll Trap Score") creates harassment vectors and should not be done without additional review.
Not a safety classifier. This is an educational tool for surfacing rhetorical moves, not a hate-speech / harm detector. It explicitly says nothing about whether content is harmful or false.
Citation
If you use lucid-distilbert in academic work, please cite:
@misc{gross2026lucid,
title = {LUCID: Multimodal Detection of Short-Form Video Manipulation Tactics},
author = {Lindsay Gross},
year = {2026},
howpublished = {\url{https://github.com/lindsaygross/Lucid}},
note = {Duke AIPI 540 final project},
}
Academic grounding for the 6-dimension rubric is documented in full in docs/RUBRIC.md.
License
MIT. See the LICENSE in the repo.
Contact
Lindsay Gross — Duke AIPI, Spring 2026 — background in Trust & Safety.
Issues / collaboration: github.com/lindsaygross/Lucid/issues.
Model tree for lindsaygross32/lucid-distilbert
Base model
distilbert/distilbert-base-uncasedEvaluation results
- Macro F1self-reported0.334
- Macro accuracyself-reported0.904
- Composite MAE (0–100 scale)self-reported5.900
- Composite R²self-reported0.368