WeReCooking
/

ModernBERT-risk-classifier

Text Classification

automation-detection

text-embeddings-inference

Model card Files Files and versions

Nekochu commited on 7 days ago

Commit

a5609f8

·

verified ·

1 Parent(s): 3678a0e

Create README.md

Files changed (1) hide show

README.md +113 -0

README.md ADDED Viewed

	@@ -0,0 +1,113 @@

+---
+license: apache-2.0
+base_model: answerdotai/ModernBERT-base
+library_name: transformers
+pipeline_tag: text-classification
+tags:
+- modernbert
+- text-classification
+- spam-detection
+- automation-detection
+- long-context
+- pytorch
+- safetensors
+language:
+- en
+metrics:
+- f1
+- precision
+- recall
+---
+# raga
+A tiny spicy ModernBERT classifier for text-risk signals.
+> Potato did not write a README, so this appeared by magic!
+## What does it classify?
+Probably text / account-behavior risk labels, inferred from the eval table:
+- `transactional_spam` — spammy transactional or promo-style content
+- `extractive_presence` — likely copy/extraction/presence-pattern signal
+- `engagement_automation` — botty engagement / automated interaction signal
+- `account_farming` — account-growth or farming behavior signal
+Exact label semantics depend on the training data.
+## Model
+- Base: `answerdotai/ModernBERT-base`
+- Type: ModernBERT sequence classifier
+- Context: up to 8,192 tokens
+- Best for: classification, moderation-ish filters, long text scoring
+## Eval snapshot
+| Label | F1 | Precision | Recall | Notes |
+|---|---:|---:|---:|---|
+| `transactional_spam` | 0.94 | 0.89 | 0.99 | 🟢 Excellent |
+| `extractive_presence` | 0.84 | 0.73 | 0.99 | 🟢 Great recall |
+| `engagement_automation` | 0.65 | 0.53 | 0.85 | 🟡 Precision weak |
+| `account_farming` | 0.62 | 0.61 | 0.63 | 🟡 Hardest label |
+## Install
+```bash
+pip install -U "transformers>=4.48.0" torch
+````
+Optional GPU speedup:
+```bash
+pip install flash-attn
+```
+## Inference
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+model_id = "WeReCooking/raga"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else None,
+    device_map="auto" if torch.cuda.is_available() else None,
+    # attn_implementation="flash_attention_2",  # optional, if installed
+)
+text = "paste text to classify here"
+inputs = tokenizer(
+    text,
+    return_tensors="pt",
+    truncation=True,
+    max_length=getattr(model.config, "max_position_embeddings", 8192),
+)
+# ModernBERT does not need token_type_ids
+inputs.pop("token_type_ids", None)
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+with torch.no_grad():
+    logits = model(**inputs).logits[0].float()
+id2label = {int(k): v for k, v in model.config.id2label.items()}
+multi = getattr(model.config, "problem_type", None) == "multi_label_classification"
+scores = torch.sigmoid(logits) if multi else torch.softmax(logits, dim=-1)
+for i, score in sorted(enumerate(scores.tolist()), key=lambda x: x[1], reverse=True):
+    print(f"{id2label.get(i, str(i))}: {score:.4f}")
+```
+## Notes
+Use threshold `0.50` for multi-label as a starting point, then tune per label.
+`transactional_spam` looks strong.
+`engagement_automation` and `account_farming` probably need calibration before serious use.