ADS509
/

BERTweet-large-self-labeling

@@ -4,11 +4,17 @@ license: mit
 base_model: vinai/bertweet-large
 tags:
 - generated_from_trainer
 metrics:
 - accuracy
 model-index:
 - name: BERTweet-large-self-labeling
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -16,27 +22,144 @@ should probably proofread and complete it, then remove this comment. -->
 # BERTweet-large-self-labeling
-This model is a fine-tuned version of [vinai/bertweet-large](https://huggingface.co/vinai/bertweet-large) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.5607
 - Accuracy: 0.7885
-- F1 Macro: 0.7817
 - F1 Weighted: 0.7885
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -44,7 +167,7 @@ The following hyperparameters were used during training:
 - train_batch_size: 32
 - eval_batch_size: 64
 - seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: linear
 - lr_scheduler_warmup_steps: 300
 - num_epochs: 2
@@ -52,6 +175,8 @@ The following hyperparameters were used during training:
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 Macro | F1 Weighted |
 |:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|:-----------:|
 | 0.5943        | 1.0   | 1540 | 0.5735          | 0.7708   | 0.7592   | 0.7708      |
@@ -63,4 +188,4 @@ The following hyperparameters were used during training:
 - Transformers 5.0.0
 - Pytorch 2.10.0+cu128
 - Datasets 4.0.0
-- Tokenizers 0.22.2

 base_model: vinai/bertweet-large
 tags:
 - generated_from_trainer
+- multi_label_classification
 metrics:
 - accuracy
 model-index:
 - name: BERTweet-large-self-labeling
   results: []
+datasets:
+- ADS509/full_experiment_labels
+language:
+- en
+pipeline_tag: text-classification
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # BERTweet-large-self-labeling
+This model is a fine-tuned version of [vinai/bertweet-large](https://huggingface.co/vinai/bertweet-large) a dataset consisting of social media comments from 5 separate sources
 It achieves the following results on the evaluation set:
 - Loss: 0.5607
 - Accuracy: 0.7885
+- **F1 Macro: 0.7817**
 - F1 Weighted: 0.7885
 ## Model description
+We retrained the classification layer of Bert Base for a multi-label classification task on our self-labeled data.
+The model description of the base model can be found at the link above and the description of the dataset can be found [here](ADS509/full_experiment_labels). The
+fine-tuning parameters are listed below. The initial model used in this experiment was bert-base-uncased.  After decent results, we decided to
+use this model as it was pre-trained on a copious amount of Twitter data, which more closely aligned with our dataset.  Turned out to be a good
+decision as this model was a **7.2%** improvement over bert-base on the evaluation data.
 ## Intended uses & limitations
+Intended use for this model is to better understand the nature of different social media websites and the nature of the discourse on that
+site beyond the usual "positive", "negative", "neutral" sentiment of most models. The labels for the commentary data are as follows:
+- Argumentative
+- Opinion
+- Informational
+- Expressive
+- Neutral
+We think there is promise in this approach, and as this is the initial step towards a deeper understanding of social commentary,
+there are several limitations to outline
+  - As there were a total of 70k records, data was primarily labeled by language models, with the prompt including correctly labeled examples
+    and incorrectly labeled examples with the correct label. Three language models were tasked with labeling, and only the majority vote
+    labels were kept. Three-way tie samples were set aside. Future iterations would benefit from more models labeling, and more human
+    labeled examples
+  - When reviewing records were ambiguous or that the classifier incorrectly predicted, it was clear that the labeling scheme is fuzzy in
+    some instances. For instance, many "Opinion" comments can be viewed as "Expressive" "Arguments", leading to ambiguous labeling from models.
+    It would be worth exploring a more nuanced labeling scheme, perhaps splitting "Expressive" into 2-3 labels and Opinion into another 1 or 2
+  - Due to the nature of the project, the commentary data used for training was subject to the following limitations
+    - Queries were isolated to "politics" or "US politics"
+    - With one exception, all comment data is dated from Jan 1, 2026 to Feb 12, 2026
+    - We set a ceiling and a floor for number of comments per post. No posts with under 10 comments were used, and for posts with
+      several comments, we only pulled the most recent 300
 ## Training and evaluation data
+A full description of the dataset can be found [here](ADS509/full_experiment_labels)
 ## Training procedure
+The full code used for training is below.  We found overfitting to occur after 2 epochs
+```
+tokenizer = AutoTokenizer.from_pretrained("bert-base_uncased")
+# Function to tokenize data with
+def tokenize_function(batch):
+    return tokenizer(
+        batch['text'],
+        truncation=True,
+        max_length=512 # Can't be greater than model max length
+    )
+# Tokenize Data
+train_data = dataset['train'].map(tokenize_function, batched=True)
+test_data = dataset['test'].map(tokenize_function, batched=True)
+valid_data = dataset['valid'].map(tokenize_function, batched=True)
+# Convert lists to tensors
+train_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])
+test_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])
+valid_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])
+model = AutoModelForSequenceClassification.from_pretrained(
+    MODEL_ID,
+    num_labels=5, # adjust this based on number of labels you're training on
+    device_map='cuda',
+    dtype='auto',
+    label2id=label2id,
+    id2label=id2label
+)
+# Metric function for evaluation in Trainer
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    predictions = np.argmax(predictions, axis=1)
+    return {
+        'accuracy': accuracy_score(labels, predictions),
+        'f1_macro': f1_score(labels, predictions, average='macro'),
+        'f1_weighted': f1_score(labels, predictions, average='weighted')
+    }
+# Data collator to handle padding dynamically per batch
+data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+training_args = TrainingArguments(
+    output_dir='./bert-comment',
+    num_train_epochs=2,
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=64,
+    learning_rate=2e-5,
+    weight_decay=0.01,
+    warmup_steps=300,
+    # Evaluation & saving
+    eval_strategy='epoch',
+    save_strategy='epoch',
+    load_best_model_at_end=True,
+    metric_for_best_model='f1_macro',
+    # Logging
+    logging_steps=100,
+    report_to='tensorboard',
+    # Other
+    seed=42,
+    fp16=torch.cuda.is_available(),  # Mixed precision if GPU available
+)
+# Set up Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_data,
+    eval_dataset=valid_data,
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics
+)
+# Train!
+trainer.train()
+# Evaluate
+eval_results = trainer.evaluate()
+print(eval_results)
+```
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - train_batch_size: 32
 - eval_batch_size: 64
 - seed: 42
+- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
 - lr_scheduler_warmup_steps: 300
 - num_epochs: 2
 ### Training results
+As this is a multi-label classification problem and there is class imbalance, the main metric we evaluate this model by is `f1_macro`
 | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 Macro | F1 Weighted |
 |:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|:-----------:|
 | 0.5943        | 1.0   | 1540 | 0.5735          | 0.7708   | 0.7592   | 0.7708      |
 - Transformers 5.0.0
 - Pytorch 2.10.0+cu128
 - Datasets 4.0.0
+- Tokenizers 0.22.2