NiklasKoch commited on
Commit
1b9f350
·
verified ·
1 Parent(s): e23e6ce

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +164 -0
  2. adapter_config.json +48 -0
  3. adapter_model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: answerdotai/ModernBERT-base
3
+ library_name: peft
4
+ tags:
5
+ - text-classification
6
+ - reddit
7
+ - conversation-analysis
8
+ - constructive-dialogue
9
+ - modernbert
10
+ - lora
11
+ - transformers
12
+ - lightweight
13
+ - high-throughput
14
+ language:
15
+ - en
16
+ datasets:
17
+ - reddit
18
+ pipeline_tag: text-classification
19
+ ---
20
+
21
+ # ModernBERT Reddit Discussion Classifier
22
+
23
+ A lightweight, high-throughput ModernBERT-based model for classifying constructive vs non-constructive conversations in online forums like Reddit. Optimized for processing vast amounts of Reddit discussion data efficiently.
24
+
25
+ ## Model Description
26
+
27
+ This model is a QLoRA (Quantized LoRA) fine-tuned version of `answerdotai/ModernBERT-base` specifically designed as a **lightweight** solution for large-scale Reddit discussion analysis.
28
+
29
+ - **Model Type**: Text Classification (Binary)
30
+ - **Base Model**: answerdotai/ModernBERT-base
31
+ - **Training Method**: QLoRA with self-training
32
+ - **Task**: Binary classification of conversation constructiveness
33
+ - **Language**: English
34
+
35
+ ## Intended Uses
36
+
37
+ ### Primary Use Case
38
+ - Classifying Reddit discussions as constructive or non-constructive
39
+ - Content moderation assistance
40
+ - Large-scale conversation quality analysis
41
+ - Social media research
42
+
43
+ ### Direct Use
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
46
+ from peft import PeftModel
47
+ import torch
48
+
49
+ # Load base model and tokenizer
50
+ base_model_name = "answerdotai/ModernBERT-base"
51
+ tokenizer = AutoTokenizer.from_pretrained(base_model_name)
52
+ model = AutoModelForSequenceClassification.from_pretrained(
53
+ base_model_name,
54
+ num_labels=2
55
+ )
56
+
57
+ # Load the fine-tuned adapters
58
+ model = PeftModel.from_pretrained(model, "NiklasKoch/modernbert-discussion-classifier")
59
+ model.eval()
60
+
61
+ # Classify text (optimized for batch processing)
62
+ def classify_text(text):
63
+ inputs = tokenizer(
64
+ text,
65
+ return_tensors="pt",
66
+ truncation=True,
67
+ padding=True,
68
+ max_length=4096
69
+ )
70
+
71
+ # Move inputs to same device as model (important for GPU usage)
72
+ inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()}
73
+
74
+ with torch.no_grad():
75
+ outputs = model(**inputs)
76
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
77
+
78
+ # 0 = non-constructive, 1 = constructive
79
+ predicted_class = torch.argmax(predictions, dim=-1).item()
80
+ confidence = predictions[0][predicted_class].item()
81
+
82
+ return {
83
+ 'class': 'constructive' if predicted_class == 1 else 'non-constructive',
84
+ 'confidence': confidence,
85
+ 'scores': {
86
+ 'non-constructive': predictions[0][0].item(),
87
+ 'constructive': predictions[0][1].item()
88
+ }
89
+ }
90
+
91
+ # Example usage - Reddit discussion
92
+ text = "[author0] LEGO: What do you think you're doing?!? [author1] I don't get it did he reveal bionicle reboot or smthn? [author2] Not really, he did announce something but was super vague, seems like a sort of passion project we wants to do with the community, he even said it might not even be bionicle. [author1] So is that image fan made or is it one of his passion projects [author2] Those pictures are real and on his insta, he did a stream talking about it I'm sure you can find somewhere, search up Fabre bionicle stream 2020 or something. [author1] OK thanks"
93
+ result = classify_text(text)
94
+ print(result)
95
+ ```
96
+
97
+ ## Training Details
98
+
99
+ ### Training Data
100
+ - **Source**: https://archive.org/download/pushshift_reddit_200506_to_202212/
101
+ - **Size**: ~1.4 million Reddit threads filtered for English language and minimum 2 authors
102
+ - **Labels**: Binary (constructive/non-constructive conversations)
103
+ - **Additional Data**: YNACC and IAC datasets for initial supervised training
104
+
105
+ ### Training Procedure
106
+ - **Training Method**: Self-training
107
+ - **Quantization**: 4-bit QLoRA for efficiency
108
+ - **LoRA Config**:
109
+ - `r`: 16
110
+ - `lora_alpha`: 32
111
+ - `lora_dropout`: 0.1
112
+ - Target modules: `Wqkv`, `Wo`, `Wi`, `dense`
113
+ - **Loss Function**: Focal Loss with class weighting
114
+ - **Max Sequence Length**: 4096 tokens
115
+ - **Batch Size**: 64
116
+ - **Learning Rate**: 2e-6
117
+
118
+ ### Training Hardware
119
+ - 48 hours on 4x NVIDIA A100 40GB GPUs
120
+
121
+ ## Performance
122
+
123
+ ### Evaluation Results
124
+
125
+ ```
126
+ YNACC:
127
+ Accuracy: 0.63
128
+ Precision: 0.63
129
+ F1-Score: 0.65
130
+
131
+ IAC:
132
+ Accuracy: 0.79
133
+ Precision: 0.85
134
+ F1-Score: 0.87
135
+
136
+ Reddit:
137
+ Accuracy: 0.57
138
+ Precision: 0.74
139
+ F1-Score: 0.67
140
+ ```
141
+
142
+ ## Limitations and Bias
143
+
144
+ - **Language**: English only
145
+ - **Bias**: May reflect biases present in Reddit discussions and training data
146
+
147
+ ## Ethical Considerations
148
+
149
+ - Human oversight is recommended for important moderation decisions
150
+
151
+ ## Technical Specifications
152
+
153
+ - **Model Architecture**: ModernBERT + Classification Head
154
+ - **Parameters**: ~150M base + LoRA adapters + classification head
155
+ - **Precision**: 4-bit quantized base model with full-precision adapters
156
+ - **Framework**: PyTorch, Transformers, PEFT (any recent version - you may see harmless warnings about configuration parameters)
157
+
158
+ ## Model Card Authors
159
+
160
+ Niklas Koch, Georg August University of Göttingen
161
+
162
+ ## Model Card Contact
163
+
164
+ niklas.koch01@stud.uni-goettingen.de
adapter_config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "answerdotai/ModernBERT-base",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 32,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.1,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": [
22
+ "classifier",
23
+ "classifier",
24
+ "score",
25
+ "classifier",
26
+ "score",
27
+ "classifier",
28
+ "score",
29
+ "classifier",
30
+ "score"
31
+ ],
32
+ "peft_type": "LORA",
33
+ "qalora_group_size": 16,
34
+ "r": 16,
35
+ "rank_pattern": {},
36
+ "revision": null,
37
+ "target_modules": [
38
+ "Wi",
39
+ "Wqkv",
40
+ "dense",
41
+ "Wo"
42
+ ],
43
+ "task_type": "SEQ_CLS",
44
+ "trainable_token_indices": null,
45
+ "use_dora": false,
46
+ "use_qalora": false,
47
+ "use_rslora": false
48
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46e2e62b7c20299da64c9049672738806def82c7684aa3b0bdefa34be0438ac8
3
+ size 13643392