--- license: apache-2.0 base_model: answerdotai/ModernBERT-base library_name: transformers pipeline_tag: text-classification tags: - modernbert - text-classification - spam-detection - automation-detection - long-context - pytorch - safetensors language: - en metrics: - f1 - precision - recall --- # raga - mahoraga from anime (His ability is to adapt to nature itself) A tiny spicy ModernBERT classifier for text-risk signals - Made by @PotatoOff > Potato did not write a README, so this appeared by magic! ## What does it classify? Probably text / account-behavior risk labels, inferred from the eval table: - `transactional_spam` — spammy transactional or promo-style content - `extractive_presence` — likely copy/extraction/presence-pattern signal - `engagement_automation` — botty engagement / automated interaction signal - `account_farming` — account-growth or farming behavior signal Exact label semantics depend on the training data. ## Model - Base: `answerdotai/ModernBERT-base` - Type: ModernBERT sequence classifier - Context: up to 8,192 tokens - Best for: classification, moderation-ish filters, long text scoring ## Eval snapshot | Label | F1 | Precision | Recall | Notes | |---|---:|---:|---:|---| | `transactional_spam` | 0.94 | 0.89 | 0.99 | 🟢 Excellent | | `extractive_presence` | 0.84 | 0.73 | 0.99 | 🟢 Great recall | | `engagement_automation` | 0.65 | 0.53 | 0.85 | 🟡 Precision weak | | `account_farming` | 0.62 | 0.61 | 0.63 | 🟡 Hardest label | ## Install ```bash pip install -U "transformers>=4.48.0" torch ```` Optional GPU speedup: ```bash pip install flash-attn ``` ## Inference ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification model_id = "WeReCooking/raga" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained( model_id, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else None, device_map="auto" if torch.cuda.is_available() else None, # attn_implementation="flash_attention_2", # optional, if installed ) text = "paste text to classify here" inputs = tokenizer( text, return_tensors="pt", truncation=True, max_length=getattr(model.config, "max_position_embeddings", 8192), ) # ModernBERT does not need token_type_ids inputs.pop("token_type_ids", None) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): logits = model(**inputs).logits[0].float() id2label = {int(k): v for k, v in model.config.id2label.items()} multi = getattr(model.config, "problem_type", None) == "multi_label_classification" scores = torch.sigmoid(logits) if multi else torch.softmax(logits, dim=-1) for i, score in sorted(enumerate(scores.tolist()), key=lambda x: x[1], reverse=True): print(f"{id2label.get(i, str(i))}: {score:.4f}") ``` ## Notes Use threshold `0.50` for multi-label as a starting point, then tune per label. `transactional_spam` looks strong. `engagement_automation` and `account_farming` probably need calibration before serious use.