| --- |
| language: |
| - multilingual |
| base_model: intfloat/multilingual-e5-small |
| pipeline_tag: text-classification |
| --- |
| |
| # feed-classifier |
|
|
| A multilingual feed-value classifier. Fine-tuned from `intfloat/multilingual-e5-small` with a classification head to score Bluesky posts by feed worthiness. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| import torch.nn.functional as F |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| |
| model_id = "Circularmachines/atproto_classifier" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForSequenceClassification.from_pretrained(model_id) |
| model.eval() |
| |
| texts = ["passage: some post text here"] |
| inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512) |
| |
| with torch.no_grad(): |
| probs = F.softmax(model(**inputs).logits, dim=-1) |
| |
| score = probs[0][1].item() # P(feed-worthy) |
| label = int(score > 0.5) |
| ``` |
|
|
| ## Training |
|
|
| - **Base model**: `intfloat/multilingual-e5-small` |
| - **Architecture**: `BertForSequenceClassification` (2 classes: not feed-worthy / feed-worthy) |
| - **Input prefix**: `passage: {text}` (matches e5 training convention) |
| - **Training data**: LLM-inferred labels via a DSPy-optimized Qwen classifier |
| - **Validation**: Human-labeled Bluesky posts (held out, never used in training) |
| - **Labels**: 0 = not feed-worthy, 1 = feed-worthy |
|
|