--- language: en license: other library_name: transformers tags: - text-classification - distilbert - podcast - ad-detection - skipr datasets: - custom base_model: distilbert-base-uncased pipeline_tag: text-classification --- # Skipr Ad Classifier Fine-tuned DistilBERT model that detects sponsor/ad segments in YouTube podcast transcript text. **Also available:** [dkayaaaa/ad-classifier-quantised](https://huggingface.co/dkayaaaa/ad-classifier-quantised) — INT8 ONNX version (~64 MB) for faster, lighter inference with ONNX Runtime. ## Model description This model classifies a short transcript window as either an ad/sponsor segment or normal podcast content. It was trained as part of the [Skipr](https://github.com/YOUR_ORG/skippy-model-training) pipeline for skipping sponsor segments in YouTube podcasts. - **Architecture:** `distilbert-base-uncased` - **Task:** Binary sequence classification - **Labels:** - `0` — not an ad segment - `1` — ad/sponsor segment - **Max sequence length:** 512 tokens - **Training:** 3 epochs, fine-tuned from `distilbert-base-uncased` ## Intended use Use this model to classify transcript windows (typically ~20 caption snippets) as ad vs non-ad content. It is designed for use in the Skipr browser extension and related inference services. **Out of scope:** - General sentiment or topic classification - Non-English text (trained on English podcast transcripts) - Full-video classification without segmentation ## Training data ## Training data The model was trained on a mix of real and synthetic transcript windows: - **Base set (~800 samples):** weak-labeled YouTube podcast segments - Positive: segments matching sponsor keywords/brands - Negative: normal podcast content - **Augmented set (~1,200 samples):** synthetic variants generated with Llama 8B via Ollama, preserving the original label. Synthetic data generated through strategies; paraphrase, new scenario, style shift, fragment, vocabulary shift Original labels are heuristic — the model learns from keyword-labeled examples. Synthetic data increases linguistic diversity but inherits the same label assumptions. ## Usage ### Transformers (Python) ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch repo = "dkayaaaa/ad-classifier" tokenizer = AutoTokenizer.from_pretrained(repo) model = AutoModelForSequenceClassification.from_pretrained(repo) text = "this episode is brought to you by our friends at..." inputs = tokenizer( text, return_tensors="pt", padding="max_length", truncation=True, max_length=512, ) with torch.no_grad(): logits = model(**inputs).logits prediction = logits.argmax(dim=-1).item() print("ad" if prediction == 1 else "not ad")