atproto_classifier / README.md
Circularmachines's picture
Upload folder using huggingface_hub
5eb63ed verified
|
Raw
History Blame Contribute Delete
1.34 kB
---
language:
- multilingual
base_model: intfloat/multilingual-e5-small
pipeline_tag: text-classification
---
# feed-classifier
A multilingual feed-value classifier. Fine-tuned from `intfloat/multilingual-e5-small` with a classification head to score Bluesky posts by feed worthiness.
## Usage
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "Circularmachines/atproto_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
texts = ["passage: some post text here"]
inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
probs = F.softmax(model(**inputs).logits, dim=-1)
score = probs[0][1].item() # P(feed-worthy)
label = int(score > 0.5)
```
## Training
- **Base model**: `intfloat/multilingual-e5-small`
- **Architecture**: `BertForSequenceClassification` (2 classes: not feed-worthy / feed-worthy)
- **Input prefix**: `passage: {text}` (matches e5 training convention)
- **Training data**: LLM-inferred labels via a DSPy-optimized Qwen classifier
- **Validation**: Human-labeled Bluesky posts (held out, never used in training)
- **Labels**: 0 = not feed-worthy, 1 = feed-worthy