Nekochu commited on
Commit
a5609f8
·
verified ·
1 Parent(s): 3678a0e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: answerdotai/ModernBERT-base
4
+ library_name: transformers
5
+ pipeline_tag: text-classification
6
+ tags:
7
+ - modernbert
8
+ - text-classification
9
+ - spam-detection
10
+ - automation-detection
11
+ - long-context
12
+ - pytorch
13
+ - safetensors
14
+ language:
15
+ - en
16
+ metrics:
17
+ - f1
18
+ - precision
19
+ - recall
20
+ ---
21
+
22
+ # raga
23
+
24
+ A tiny spicy ModernBERT classifier for text-risk signals.
25
+
26
+ > Potato did not write a README, so this appeared by magic!
27
+
28
+ ## What does it classify?
29
+
30
+ Probably text / account-behavior risk labels, inferred from the eval table:
31
+
32
+ - `transactional_spam` — spammy transactional or promo-style content
33
+ - `extractive_presence` — likely copy/extraction/presence-pattern signal
34
+ - `engagement_automation` — botty engagement / automated interaction signal
35
+ - `account_farming` — account-growth or farming behavior signal
36
+
37
+ Exact label semantics depend on the training data.
38
+
39
+ ## Model
40
+
41
+ - Base: `answerdotai/ModernBERT-base`
42
+ - Type: ModernBERT sequence classifier
43
+ - Context: up to 8,192 tokens
44
+ - Best for: classification, moderation-ish filters, long text scoring
45
+
46
+ ## Eval snapshot
47
+
48
+ | Label | F1 | Precision | Recall | Notes |
49
+ |---|---:|---:|---:|---|
50
+ | `transactional_spam` | 0.94 | 0.89 | 0.99 | 🟢 Excellent |
51
+ | `extractive_presence` | 0.84 | 0.73 | 0.99 | 🟢 Great recall |
52
+ | `engagement_automation` | 0.65 | 0.53 | 0.85 | 🟡 Precision weak |
53
+ | `account_farming` | 0.62 | 0.61 | 0.63 | 🟡 Hardest label |
54
+
55
+ ## Install
56
+
57
+ ```bash
58
+ pip install -U "transformers>=4.48.0" torch
59
+ ````
60
+
61
+ Optional GPU speedup:
62
+
63
+ ```bash
64
+ pip install flash-attn
65
+ ```
66
+
67
+ ## Inference
68
+
69
+ ```python
70
+ import torch
71
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
72
+
73
+ model_id = "WeReCooking/raga"
74
+
75
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
76
+
77
+ model = AutoModelForSequenceClassification.from_pretrained(
78
+ model_id,
79
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else None,
80
+ device_map="auto" if torch.cuda.is_available() else None,
81
+ # attn_implementation="flash_attention_2", # optional, if installed
82
+ )
83
+
84
+ text = "paste text to classify here"
85
+
86
+ inputs = tokenizer(
87
+ text,
88
+ return_tensors="pt",
89
+ truncation=True,
90
+ max_length=getattr(model.config, "max_position_embeddings", 8192),
91
+ )
92
+
93
+ # ModernBERT does not need token_type_ids
94
+ inputs.pop("token_type_ids", None)
95
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
96
+
97
+ with torch.no_grad():
98
+ logits = model(**inputs).logits[0].float()
99
+
100
+ id2label = {int(k): v for k, v in model.config.id2label.items()}
101
+ multi = getattr(model.config, "problem_type", None) == "multi_label_classification"
102
+
103
+ scores = torch.sigmoid(logits) if multi else torch.softmax(logits, dim=-1)
104
+
105
+ for i, score in sorted(enumerate(scores.tolist()), key=lambda x: x[1], reverse=True):
106
+ print(f"{id2label.get(i, str(i))}: {score:.4f}")
107
+ ```
108
+
109
+ ## Notes
110
+
111
+ Use threshold `0.50` for multi-label as a starting point, then tune per label.
112
+ `transactional_spam` looks strong.
113
+ `engagement_automation` and `account_farming` probably need calibration before serious use.