tkbarb10 commited on
Commit
24244fc
·
verified ·
1 Parent(s): ddb1970

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -7
README.md CHANGED
@@ -4,11 +4,17 @@ license: mit
4
  base_model: vinai/bertweet-large
5
  tags:
6
  - generated_from_trainer
 
7
  metrics:
8
  - accuracy
9
  model-index:
10
  - name: BERTweet-large-self-labeling
11
  results: []
 
 
 
 
 
12
  ---
13
 
14
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -16,27 +22,144 @@ should probably proofread and complete it, then remove this comment. -->
16
 
17
  # BERTweet-large-self-labeling
18
 
19
- This model is a fine-tuned version of [vinai/bertweet-large](https://huggingface.co/vinai/bertweet-large) on an unknown dataset.
20
  It achieves the following results on the evaluation set:
 
21
  - Loss: 0.5607
22
  - Accuracy: 0.7885
23
- - F1 Macro: 0.7817
24
  - F1 Weighted: 0.7885
25
 
26
  ## Model description
27
 
28
- More information needed
 
 
 
 
29
 
30
  ## Intended uses & limitations
31
 
32
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ## Training and evaluation data
35
 
36
- More information needed
37
 
38
  ## Training procedure
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ### Training hyperparameters
41
 
42
  The following hyperparameters were used during training:
@@ -44,7 +167,7 @@ The following hyperparameters were used during training:
44
  - train_batch_size: 32
45
  - eval_batch_size: 64
46
  - seed: 42
47
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
48
  - lr_scheduler_type: linear
49
  - lr_scheduler_warmup_steps: 300
50
  - num_epochs: 2
@@ -52,6 +175,8 @@ The following hyperparameters were used during training:
52
 
53
  ### Training results
54
 
 
 
55
  | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 Macro | F1 Weighted |
56
  |:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|:-----------:|
57
  | 0.5943 | 1.0 | 1540 | 0.5735 | 0.7708 | 0.7592 | 0.7708 |
@@ -63,4 +188,4 @@ The following hyperparameters were used during training:
63
  - Transformers 5.0.0
64
  - Pytorch 2.10.0+cu128
65
  - Datasets 4.0.0
66
- - Tokenizers 0.22.2
 
4
  base_model: vinai/bertweet-large
5
  tags:
6
  - generated_from_trainer
7
+ - multi_label_classification
8
  metrics:
9
  - accuracy
10
  model-index:
11
  - name: BERTweet-large-self-labeling
12
  results: []
13
+ datasets:
14
+ - ADS509/full_experiment_labels
15
+ language:
16
+ - en
17
+ pipeline_tag: text-classification
18
  ---
19
 
20
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
22
 
23
  # BERTweet-large-self-labeling
24
 
25
+ This model is a fine-tuned version of [vinai/bertweet-large](https://huggingface.co/vinai/bertweet-large) a dataset consisting of social media comments from 5 separate sources
26
  It achieves the following results on the evaluation set:
27
+
28
  - Loss: 0.5607
29
  - Accuracy: 0.7885
30
+ - **F1 Macro: 0.7817**
31
  - F1 Weighted: 0.7885
32
 
33
  ## Model description
34
 
35
+ We retrained the classification layer of Bert Base for a multi-label classification task on our self-labeled data.
36
+ The model description of the base model can be found at the link above and the description of the dataset can be found [here](ADS509/full_experiment_labels). The
37
+ fine-tuning parameters are listed below. The initial model used in this experiment was bert-base-uncased. After decent results, we decided to
38
+ use this model as it was pre-trained on a copious amount of Twitter data, which more closely aligned with our dataset. Turned out to be a good
39
+ decision as this model was a **7.2%** improvement over bert-base on the evaluation data.
40
 
41
  ## Intended uses & limitations
42
 
43
+ Intended use for this model is to better understand the nature of different social media websites and the nature of the discourse on that
44
+ site beyond the usual "positive", "negative", "neutral" sentiment of most models. The labels for the commentary data are as follows:
45
+
46
+ - Argumentative
47
+ - Opinion
48
+ - Informational
49
+ - Expressive
50
+ - Neutral
51
+
52
+ We think there is promise in this approach, and as this is the initial step towards a deeper understanding of social commentary,
53
+ there are several limitations to outline
54
+
55
+ - As there were a total of 70k records, data was primarily labeled by language models, with the prompt including correctly labeled examples
56
+ and incorrectly labeled examples with the correct label. Three language models were tasked with labeling, and only the majority vote
57
+ labels were kept. Three-way tie samples were set aside. Future iterations would benefit from more models labeling, and more human
58
+ labeled examples
59
+ - When reviewing records were ambiguous or that the classifier incorrectly predicted, it was clear that the labeling scheme is fuzzy in
60
+ some instances. For instance, many "Opinion" comments can be viewed as "Expressive" "Arguments", leading to ambiguous labeling from models.
61
+ It would be worth exploring a more nuanced labeling scheme, perhaps splitting "Expressive" into 2-3 labels and Opinion into another 1 or 2
62
+ - Due to the nature of the project, the commentary data used for training was subject to the following limitations
63
+ - Queries were isolated to "politics" or "US politics"
64
+ - With one exception, all comment data is dated from Jan 1, 2026 to Feb 12, 2026
65
+ - We set a ceiling and a floor for number of comments per post. No posts with under 10 comments were used, and for posts with
66
+ several comments, we only pulled the most recent 300
67
 
68
  ## Training and evaluation data
69
 
70
+ A full description of the dataset can be found [here](ADS509/full_experiment_labels)
71
 
72
  ## Training procedure
73
 
74
+ The full code used for training is below. We found overfitting to occur after 2 epochs
75
+
76
+ ```
77
+ tokenizer = AutoTokenizer.from_pretrained("bert-base_uncased")
78
+
79
+ # Function to tokenize data with
80
+ def tokenize_function(batch):
81
+ return tokenizer(
82
+ batch['text'],
83
+ truncation=True,
84
+ max_length=512 # Can't be greater than model max length
85
+ )
86
+
87
+ # Tokenize Data
88
+ train_data = dataset['train'].map(tokenize_function, batched=True)
89
+ test_data = dataset['test'].map(tokenize_function, batched=True)
90
+ valid_data = dataset['valid'].map(tokenize_function, batched=True)
91
+
92
+ # Convert lists to tensors
93
+ train_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])
94
+ test_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])
95
+ valid_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])
96
+
97
+ model = AutoModelForSequenceClassification.from_pretrained(
98
+ MODEL_ID,
99
+ num_labels=5, # adjust this based on number of labels you're training on
100
+ device_map='cuda',
101
+ dtype='auto',
102
+ label2id=label2id,
103
+ id2label=id2label
104
+ )
105
+
106
+ # Metric function for evaluation in Trainer
107
+ def compute_metrics(eval_pred):
108
+ predictions, labels = eval_pred
109
+ predictions = np.argmax(predictions, axis=1)
110
+
111
+ return {
112
+ 'accuracy': accuracy_score(labels, predictions),
113
+ 'f1_macro': f1_score(labels, predictions, average='macro'),
114
+ 'f1_weighted': f1_score(labels, predictions, average='weighted')
115
+ }
116
+
117
+ # Data collator to handle padding dynamically per batch
118
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
119
+
120
+ training_args = TrainingArguments(
121
+ output_dir='./bert-comment',
122
+ num_train_epochs=2,
123
+ per_device_train_batch_size=32,
124
+ per_device_eval_batch_size=64,
125
+ learning_rate=2e-5,
126
+ weight_decay=0.01,
127
+ warmup_steps=300,
128
+
129
+ # Evaluation & saving
130
+ eval_strategy='epoch',
131
+ save_strategy='epoch',
132
+ load_best_model_at_end=True,
133
+ metric_for_best_model='f1_macro',
134
+
135
+ # Logging
136
+ logging_steps=100,
137
+ report_to='tensorboard',
138
+
139
+ # Other
140
+ seed=42,
141
+ fp16=torch.cuda.is_available(), # Mixed precision if GPU available
142
+ )
143
+
144
+ # Set up Trainer
145
+ trainer = Trainer(
146
+ model=model,
147
+ args=training_args,
148
+ train_dataset=train_data,
149
+ eval_dataset=valid_data,
150
+ processing_class=tokenizer,
151
+ data_collator=data_collator,
152
+ compute_metrics=compute_metrics
153
+ )
154
+
155
+ # Train!
156
+ trainer.train()
157
+
158
+ # Evaluate
159
+ eval_results = trainer.evaluate()
160
+ print(eval_results)
161
+ ```
162
+
163
  ### Training hyperparameters
164
 
165
  The following hyperparameters were used during training:
 
167
  - train_batch_size: 32
168
  - eval_batch_size: 64
169
  - seed: 42
170
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08
171
  - lr_scheduler_type: linear
172
  - lr_scheduler_warmup_steps: 300
173
  - num_epochs: 2
 
175
 
176
  ### Training results
177
 
178
+ As this is a multi-label classification problem and there is class imbalance, the main metric we evaluate this model by is `f1_macro`
179
+
180
  | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 Macro | F1 Weighted |
181
  |:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|:-----------:|
182
  | 0.5943 | 1.0 | 1540 | 0.5735 | 0.7708 | 0.7592 | 0.7708 |
 
188
  - Transformers 5.0.0
189
  - Pytorch 2.10.0+cu128
190
  - Datasets 4.0.0
191
+ - Tokenizers 0.22.2