Update README.md

6a01b67 verified 23 days ago

7.01 kB

	---
	library_name: transformers
	license: mit
	base_model: vinai/bertweet-large
	tags:
	- generated_from_trainer
	- multi_label_classification
	metrics:
	- accuracy
	model-index:
	- name: BERTweet-large-self-labeling
	results: []
	datasets:
	- ADS509/full_experiment_labels
	language:
	- en
	pipeline_tag: text-classification
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# BERTweet-large-self-labeling

	This model is a fine-tuned version of [vinai/bertweet-large](https://huggingface.co/vinai/bertweet-large) a dataset consisting of social media comments from 5 separate sources
	It achieves the following results on the evaluation set:

	- Loss: 0.5607
	- Accuracy: 0.7885
	- F1 Macro: 0.7817
	- F1 Weighted: 0.7885

	## Model description

	We retrained the classification layer of Bert Base for a multi-label classification task on our self-labeled data.
	The model description of the base model can be found at the link above and the description of the dataset can be found [here](ADS509/full_experiment_labels). The
	fine-tuning parameters are listed below. The initial model used in this experiment was bert-base-uncased. After decent results, we decided to
	use this model as it was pre-trained on a copious amount of Twitter data, which more closely aligned with our dataset. Turned out to be a good
	decision as this model was a 7.2% improvement over bert-base on the evaluation data.

	## Intended uses & limitations

	Intended use for this model is to better understand the nature of different social media websites and the nature of the discourse on that
	site beyond the usual "positive", "negative", "neutral" sentiment of most models. The labels for the commentary data are as follows:

	- Argumentative
	- Opinion
	- Informational
	- Expressive
	- Neutral

	We think there is promise in this approach, and as this is the initial step towards a deeper understanding of social commentary,
	there are several limitations to outline

	- As there were a total of 70k records, data was primarily labeled by language models, with the prompt including correctly labeled examples
	and incorrectly labeled examples with the correct label. Three language models were tasked with labeling, and only the majority vote
	labels were kept. Three-way tie samples were set aside. Future iterations would benefit from more models labeling, and more human
	labeled examples
	- When reviewing records were ambiguous or that the classifier incorrectly predicted, it was clear that the labeling scheme is fuzzy in
	some instances. For instance, many "Opinion" comments can be viewed as "Expressive" "Arguments", leading to ambiguous labeling from models.
	It would be worth exploring a more nuanced labeling scheme, perhaps splitting "Expressive" into 2-3 labels and Opinion into another 1 or 2
	- Due to the nature of the project, the commentary data used for training is subject to the following limitations
	- Queries were isolated to "politics" or "US politics"
	- All comment data is dated from Jan 1, 2025 to Feb 12, 2026, with the majority originating in 2026
	- We set a ceiling and a floor for number of comments per post. No posts with under 10 comments were used, and number of comments scraped
	were capped at 300

	## Training and evaluation data

	A full description of the dataset can be found [here](ADS509/full_experiment_labels)

	## Training procedure

	The full code used for training is below. We found overfitting to occur after 2 epochs

	```python
	tokenizer = AutoTokenizer.from_pretrained("bert-base_uncased")

	# Function to tokenize data with
	def tokenize_function(batch):
	return tokenizer(
	batch['text'],
	truncation=True,
	max_length=512 # Can't be greater than model max length
	)

	# Tokenize Data
	train_data = dataset['train'].map(tokenize_function, batched=True)
	test_data = dataset['test'].map(tokenize_function, batched=True)
	valid_data = dataset['valid'].map(tokenize_function, batched=True)

	# Convert lists to tensors
	train_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])
	test_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])
	valid_data.set_format("torch", columns=['input_ids', "attention_mask", "label"])

	model = AutoModelForSequenceClassification.from_pretrained(
	MODEL_ID,
	num_labels=5, # adjust this based on number of labels you're training on
	device_map='cuda',
	dtype='auto',
	label2id=label2id,
	id2label=id2label
	)

	# Metric function for evaluation in Trainer
	def compute_metrics(eval_pred):
	predictions, labels = eval_pred
	predictions = np.argmax(predictions, axis=1)

	return {
	'accuracy': accuracy_score(labels, predictions),
	'f1_macro': f1_score(labels, predictions, average='macro'),
	'f1_weighted': f1_score(labels, predictions, average='weighted')
	}

	# Data collator to handle padding dynamically per batch
	data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

	training_args = TrainingArguments(
	output_dir='./bert-comment',
	num_train_epochs=2,
	per_device_train_batch_size=32,
	per_device_eval_batch_size=64,
	learning_rate=2e-5,
	weight_decay=0.01,
	warmup_steps=300,

	# Evaluation & saving
	eval_strategy='epoch',
	save_strategy='epoch',
	load_best_model_at_end=True,
	metric_for_best_model='f1_macro',

	# Logging
	logging_steps=100,
	report_to='tensorboard',

	# Other
	seed=42,
	fp16=torch.cuda.is_available(), # Mixed precision if GPU available
	)

	# Set up Trainer
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=train_data,
	eval_dataset=valid_data,
	processing_class=tokenizer,
	data_collator=data_collator,
	compute_metrics=compute_metrics
	)

	# Train!
	trainer.train()

	# Evaluate
	eval_results = trainer.evaluate()
	print(eval_results)
	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 32
	- eval_batch_size: 64
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 300
	- num_epochs: 2
	- mixed_precision_training: Native AMP

	### Training results

	As this is a multi-label classification problem and there is class imbalance, the main metric we evaluate this model by is `f1_macro`

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| F1 Macro \| F1 Weighted \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|:--------:\|:-----------:\|
	\| 0.5943 \| 1.0 \| 1540 \| 0.5735 \| 0.7708 \| 0.7592 \| 0.7708 \|
	\| 0.3951 \| 2.0 \| 3080 \| 0.5607 \| 0.7885 \| 0.7817 \| 0.7885 \|


	### Framework versions

	- Transformers 5.0.0
	- Pytorch 2.10.0+cu128
	- Datasets 4.0.0
	- Tokenizers 0.22.2