Circularmachines
/

atproto_classifier

Text Classification

Model card Files Files and versions

atproto_classifier / README.md

Circularmachines's picture

Circularmachines

Upload folder using huggingface_hub

5eb63ed verified 4 months ago

|

History Blame Contribute Delete

1.34 kB

	---
	language:
	- multilingual
	base_model: intfloat/multilingual-e5-small
	pipeline_tag: text-classification
	---

	# feed-classifier

	A multilingual feed-value classifier. Fine-tuned from `intfloat/multilingual-e5-small` with a classification head to score Bluesky posts by feed worthiness.

	## Usage

	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	model_id = "Circularmachines/atproto_classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)
	model.eval()

	texts = ["passage: some post text here"]
	inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)

	with torch.no_grad():
	probs = F.softmax(model(**inputs).logits, dim=-1)

	score = probs[0][1].item() # P(feed-worthy)
	label = int(score > 0.5)
	```

	## Training

	- Base model: `intfloat/multilingual-e5-small`
	- Architecture: `BertForSequenceClassification` (2 classes: not feed-worthy / feed-worthy)
	- Input prefix: `passage: {text}` (matches e5 training convention)
	- Training data: LLM-inferred labels via a DSPy-optimized Qwen classifier
	- Validation: Human-labeled Bluesky posts (held out, never used in training)
	- Labels: 0 = not feed-worthy, 1 = feed-worthy