kayaaaa
/

ad-classifier

Text Classification

text-embeddings-inference

Model card Files Files and versions

ad-classifier / README.md

kayaaaa's picture

Update README.md

1f43cef verified 21 days ago

|

History Blame Contribute Delete

2.78 kB

	---
	language: en
	license: other
	library_name: transformers
	tags:
	- text-classification
	- distilbert
	- podcast
	- ad-detection
	- skipr
	datasets:
	- custom
	base_model: distilbert-base-uncased
	pipeline_tag: text-classification
	---

	# Skipr Ad Classifier

	Fine-tuned DistilBERT model that detects sponsor/ad segments in YouTube podcast transcript text.

	Also available: [dkayaaaa/ad-classifier-quantised](https://huggingface.co/dkayaaaa/ad-classifier-quantised) — INT8 ONNX version (~64 MB) for faster, lighter inference with ONNX Runtime.

	## Model description

	This model classifies a short transcript window as either an ad/sponsor segment or normal podcast content. It was trained as part of the [Skipr](https://github.com/YOUR_ORG/skippy-model-training) pipeline for skipping sponsor segments in YouTube podcasts.

	- Architecture: `distilbert-base-uncased`
	- Task: Binary sequence classification
	- Labels:
	- `0` — not an ad segment
	- `1` — ad/sponsor segment
	- Max sequence length: 512 tokens
	- Training: 3 epochs, fine-tuned from `distilbert-base-uncased`

	## Intended use

	Use this model to classify transcript windows (typically ~20 caption snippets) as ad vs non-ad content. It is designed for use in the Skipr browser extension and related inference services.

	Out of scope:
	- General sentiment or topic classification
	- Non-English text (trained on English podcast transcripts)
	- Full-video classification without segmentation

	## Training data

	## Training data

	The model was trained on a mix of real and synthetic transcript windows:

	- Base set (~800 samples): weak-labeled YouTube podcast segments
	- Positive: segments matching sponsor keywords/brands
	- Negative: normal podcast content
	- Augmented set (~1,200 samples): synthetic variants generated with Llama 8B via Ollama, preserving the original label. Synthetic data generated through strategies; paraphrase, new scenario, style shift, fragment, vocabulary shift

	Original labels are heuristic — the model learns from keyword-labeled examples. Synthetic data increases linguistic diversity but inherits the same label assumptions.

	## Usage

	### Transformers (Python)

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	repo = "dkayaaaa/ad-classifier"
	tokenizer = AutoTokenizer.from_pretrained(repo)
	model = AutoModelForSequenceClassification.from_pretrained(repo)

	text = "this episode is brought to you by our friends at..."
	inputs = tokenizer(
	text,
	return_tensors="pt",
	padding="max_length",
	truncation=True,
	max_length=512,
	)

	with torch.no_grad():
	logits = model(**inputs).logits
	prediction = logits.argmax(dim=-1).item()

	print("ad" if prediction == 1 else "not ad")