| | --- |
| | library_name: transformers |
| | license: mit |
| | base_model: vinai/bertweet-large |
| | tags: |
| | - generated_from_trainer |
| | - multi_label_classification |
| | metrics: |
| | - accuracy |
| | model-index: |
| | - name: BERTweet-large-self-labeling |
| | results: [] |
| | datasets: |
| | - ADS509/full_experiment_labels |
| | language: |
| | - en |
| | pipeline_tag: text-classification |
| | --- |
| | |
| | <!-- This model card has been generated automatically according to the information the Trainer had access to. You |
| | should probably proofread and complete it, then remove this comment. --> |
| |
|
| | # BERTweet-large-self-labeling |
| |
|
| | This model is a fine-tuned version of [vinai/bertweet-large](https://huggingface.co/vinai/bertweet-large) a dataset consisting of social media comments from 5 separate sources |
| | It achieves the following results on the evaluation set: |
| |
|
| | - Loss: 0.5607 |
| | - Accuracy: 0.7885 |
| | - **F1 Macro: 0.7817** |
| | - F1 Weighted: 0.7885 |
| |
|
| | ## Model description |
| |
|
| | We retrained the classification layer of Bert Base for a multi-label classification task on our self-labeled data. |
| | The model description of the base model can be found at the link above and the description of the dataset can be found [here](ADS509/full_experiment_labels). The |
| | fine-tuning parameters are listed below. The initial model used in this experiment was bert-base-uncased. After decent results, we decided to |
| | use this model as it was pre-trained on a copious amount of Twitter data, which more closely aligned with our dataset. Turned out to be a good |
| | decision as this model was a **7.2%** improvement over bert-base on the evaluation data. |
| |
|
| | ## Intended uses & limitations |
| |
|
| | Intended use for this model is to better understand the nature of different social media websites and the nature of the discourse on that |
| | site beyond the usual "positive", "negative", "neutral" sentiment of most models. The labels for the commentary data are as follows: |
| |
|
| | - Argumentative |
| | - Opinion |
| | - Informational |
| | - Expressive |
| | - Neutral |
| | |
| | We think there is promise in this approach, and as this is the initial step towards a deeper understanding of social commentary, |
| | there are several limitations to outline |
| |
|
| | - As there were a total of 70k records, data was primarily labeled by language models, with the prompt including correctly labeled examples |
| | and incorrectly labeled examples with the correct label. Three language models were tasked with labeling, and only the majority vote |
| | labels were kept. Three-way tie samples were set aside. Future iterations would benefit from more models labeling, and more human |
| | labeled examples |
| | - When reviewing records were ambiguous or that the classifier incorrectly predicted, it was clear that the labeling scheme is fuzzy in |
| | some instances. For instance, many "Opinion" comments can be viewed as "Expressive" "Arguments", leading to ambiguous labeling from models. |
| | It would be worth exploring a more nuanced labeling scheme, perhaps splitting "Expressive" into 2-3 labels and Opinion into another 1 or 2 |
| | - Due to the nature of the project, the commentary data used for training is subject to the following limitations |
| | - Queries were isolated to "politics" or "US politics" |
| | - All comment data is dated from Jan 1, 2025 to Feb 12, 2026, with the majority originating in 2026 |
| | - We set a ceiling and a floor for number of comments per post. No posts with under 10 comments were used, and number of comments scraped |
| | were capped at 300 |
| | |
| | ## Training and evaluation data |
| |
|
| | A full description of the dataset can be found [here](ADS509/full_experiment_labels) |
| |
|
| | ## Training procedure |
| |
|
| | The full code used for training is below. We found overfitting to occur after 2 epochs |
| |
|
| | ```python |
| | tokenizer = AutoTokenizer.from_pretrained("bert-base_uncased") |
| | |
| | # Function to tokenize data with |
| | def tokenize_function(batch): |
| | return tokenizer( |
| | batch['text'], |
| | truncation=True, |
| | max_length=512 # Can't be greater than model max length |
| | ) |
| | |
| | # Tokenize Data |
| | train_data = dataset['train'].map(tokenize_function, batched=True) |
| | test_data = dataset['test'].map(tokenize_function, batched=True) |
| | valid_data = dataset['valid'].map(tokenize_function, batched=True) |
| | |
| | # Convert lists to tensors |
| | train_data.set_format("torch", columns=['input_ids', "attention_mask", "label"]) |
| | test_data.set_format("torch", columns=['input_ids', "attention_mask", "label"]) |
| | valid_data.set_format("torch", columns=['input_ids', "attention_mask", "label"]) |
| | |
| | model = AutoModelForSequenceClassification.from_pretrained( |
| | MODEL_ID, |
| | num_labels=5, # adjust this based on number of labels you're training on |
| | device_map='cuda', |
| | dtype='auto', |
| | label2id=label2id, |
| | id2label=id2label |
| | ) |
| | |
| | # Metric function for evaluation in Trainer |
| | def compute_metrics(eval_pred): |
| | predictions, labels = eval_pred |
| | predictions = np.argmax(predictions, axis=1) |
| | |
| | return { |
| | 'accuracy': accuracy_score(labels, predictions), |
| | 'f1_macro': f1_score(labels, predictions, average='macro'), |
| | 'f1_weighted': f1_score(labels, predictions, average='weighted') |
| | } |
| | |
| | # Data collator to handle padding dynamically per batch |
| | data_collator = DataCollatorWithPadding(tokenizer=tokenizer) |
| | |
| | training_args = TrainingArguments( |
| | output_dir='./bert-comment', |
| | num_train_epochs=2, |
| | per_device_train_batch_size=32, |
| | per_device_eval_batch_size=64, |
| | learning_rate=2e-5, |
| | weight_decay=0.01, |
| | warmup_steps=300, |
| | |
| | # Evaluation & saving |
| | eval_strategy='epoch', |
| | save_strategy='epoch', |
| | load_best_model_at_end=True, |
| | metric_for_best_model='f1_macro', |
| | |
| | # Logging |
| | logging_steps=100, |
| | report_to='tensorboard', |
| | |
| | # Other |
| | seed=42, |
| | fp16=torch.cuda.is_available(), # Mixed precision if GPU available |
| | ) |
| | |
| | # Set up Trainer |
| | trainer = Trainer( |
| | model=model, |
| | args=training_args, |
| | train_dataset=train_data, |
| | eval_dataset=valid_data, |
| | processing_class=tokenizer, |
| | data_collator=data_collator, |
| | compute_metrics=compute_metrics |
| | ) |
| | |
| | # Train! |
| | trainer.train() |
| | |
| | # Evaluate |
| | eval_results = trainer.evaluate() |
| | print(eval_results) |
| | ``` |
| |
|
| | ### Training hyperparameters |
| |
|
| | The following hyperparameters were used during training: |
| | - learning_rate: 2e-05 |
| | - train_batch_size: 32 |
| | - eval_batch_size: 64 |
| | - seed: 42 |
| | - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 |
| | - lr_scheduler_type: linear |
| | - lr_scheduler_warmup_steps: 300 |
| | - num_epochs: 2 |
| | - mixed_precision_training: Native AMP |
| | |
| | ### Training results |
| | |
| | As this is a multi-label classification problem and there is class imbalance, the main metric we evaluate this model by is `f1_macro` |
| |
|
| | | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 Macro | F1 Weighted | |
| | |:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|:-----------:| |
| | | 0.5943 | 1.0 | 1540 | 0.5735 | 0.7708 | 0.7592 | 0.7708 | |
| | | 0.3951 | 2.0 | 3080 | 0.5607 | 0.7885 | 0.7817 | 0.7885 | |
| |
|
| |
|
| | ### Framework versions |
| |
|
| | - Transformers 5.0.0 |
| | - Pytorch 2.10.0+cu128 |
| | - Datasets 4.0.0 |
| | - Tokenizers 0.22.2 |