Fine-tuning model with Huggingface Trainer
Hi,
I'm trying to fine-tune this model on my own dataset using the Huggingface Trainer, but I get the following error in the training step: ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask,pixel_values..
I'm using the following code:
from transformers import GitForCausalLM
model = GitForCausalLM.from_pretrained("microsoft/git-large-coco")
import torch
from torch.utils.data import Dataset
from PIL import Image
class VCSDatasetProcessor(Dataset):
def __init__(self, root_dir, df, processor, max_target_length=128):
self.root_dir = root_dir
self.df = df
self.processor = processor
self.max_target_length = max_target_length
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
# get file name + text
file_name = self.df["image_id"][idx]
text = self.df["question"][idx]
# prepare image (i.e. resize + normalize)
image = Image.open(self.root_dir + file_name).convert("RGB")
encoding = self.processor(images=image, text=text, padding="max_length", return_tensors="pt", max_length=self.max_target_length)
encoding = {k:v.squeeze() for k,v in encoding.items()}
return encoding
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("microsoft/git-large-coco")
train_set = VCSDatasetProcessor(root_dir="data/images/", df=train_df, processor=processor)
valid_set = VCSDatasetProcessor(root_dir="data/images/", df=valid_df, processor=processor)
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
predict_with_generate=True,
evaluation_strategy="steps",
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
fp16=False,
output_dir="./git-hf/",
logging_steps=200,
save_strategy="epoch",
eval_steps=200,
num_train_epochs=4,
)
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge = rouge_metric.compute(predictions=pred_str, references=label_str)
return {"rouge": rouge["rougeL"].mid.fmeasure}
from transformers import default_data_collator
# instantiate trainer
trainer = Seq2SeqTrainer(
model=model,
tokenizer=processor,
args=training_args,
#compute_metrics=compute_metrics,
train_dataset=train_set,
eval_dataset=valid_set,
data_collator=default_data_collator,
)
trainer.train()
The full error trace is the following:
***** Running training *****
Num examples = 8886
Num Epochs = 4
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 8888
Number of trainable parameters = 394196026
vision_config is None. initializing the GitVisionConfig with default values.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [26], line 1
----> 1 trainer.train()
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1570, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1565 self.model_wrapped = self.model
1567 inner_training_loop = find_executable_batch_size(
1568 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1569 )
-> 1570 return inner_training_loop(
1571 args=args,
1572 resume_from_checkpoint=resume_from_checkpoint,
1573 trial=trial,
1574 ignore_keys_for_eval=ignore_keys_for_eval,
1575 )
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1835, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1833 tr_loss_step = self.training_step(model, inputs)
1834 else:
-> 1835 tr_loss_step = self.training_step(model, inputs)
1837 if (
1838 args.logging_nan_inf_filter
1839 and not is_torch_tpu_available()
1840 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1841 ):
1842 # if loss is nan or inf simply add the average of previous logged losses
1843 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2583, in Trainer.training_step(self, model, inputs)
2580 return loss_mb.reduce_mean().detach().to(self.args.device)
2582 with self.compute_loss_context_manager():
-> 2583 loss = self.compute_loss(model, inputs)
2585 if self.args.n_gpu > 1:
2586 loss = loss.mean() # mean() to average on multi-gpu parallel training
File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2628, in Trainer.compute_loss(self, model, inputs, return_outputs)
2626 else:
2627 if isinstance(outputs, dict) and "loss" not in outputs:
-> 2628 raise ValueError(
2629 "The model did not return a loss from the inputs, only the following keys: "
2630 f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
2631 )
2632 # We don't use .loss here since the model may return tuples instead of ModelOutput.
2633 loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask,pixel_values.
I've also tried to fine-tune the model using the code in this tutorial: https://colab.research.google.com/drive/1HLxgrG7xZJ9FvXckNG61J72FkyrbqKAA. That seemed to work, but when I tried to save a checkpoint every epoch, those checkpoints produced strange output on my test set (the model gave the same output for each input image), and the output of the last checkpoint was different to that of the final model that was in memory after conclusion of the training loop (that one actually produced different output for different images), so something seemed to go wrong there.
I used the same code as in thet tutorial, it was only different here:
import torch
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.train()
for epoch in range(10):
print("Epoch:", epoch)
for idx, batch in enumerate(train_dataloader):
input_ids = batch.pop("input_ids").to(device)
pixel_values = batch.pop("pixel_values").to(device)
outputs = model(input_ids=input_ids,
pixel_values=pixel_values,
labels=input_ids)
loss = outputs.loss
if idx % 100 == 0:
print(str(idx) + "/ " + str(len(train_dataloader)) + " Loss:", loss.item())
loss.backward()
optimizer.step()
optimizer.zero_grad()
model.save_pretrained("git2/checkpoint-"+str(epoch))
Can you help me with these problems, or at least with one of the two of them? :)
Best wishes
Hi,
Thanks for your interest in GIT!
You don't need to use the Seq2SeqTrainer for a decoder-only model like GIT.
Seq2SeqTrainer is only meant for Seq2Seq models like T5, BART, PEGASUS, etc.
Hi Niels,
Thanks for the quick reply. With the regular Trainer and TrainingArguments, I get the same problem. How should I use them differently?
I see that you're not preparing any labels for the model. Hence it will be difficult for the model to train ;)
In your PyTorch Dataset, you prepare each image + text pair using the processor, which is great. However you also need to add a labels key to the encoding dictionary, as the model requires pixel_values, input_ids (the prepared image + text pair) and labels(the ground truth targets to produce) to compute a loss. The labels are simply a copy of the input_ids:
encoding["labels"] = input_ids.copy()
as the model internally will shift them one position to compute the cross-entropy loss.
Ah, that makes a lot of sense and works much better, thank you!
However, now I'm wondering, does this conceptually mean that the model is trained to expect an (image, text) pair as input? Because the "text" in this pair is the ground truth caption. It works fine now at inference time (with only the image as input), but I'm just trying to see if I understand the concepts correctly. I'm especially confused because I already trained the BlipModelForConditionalGeneration in the same way, without providing labels, and this seemed to work fine. How is that possible?
That's a good point, that's because of these lines of code which automatically compute the labels for BLIP in case the user doesn't provide them.
Opened an issue here to remove this unexpected behaviour: https://github.com/huggingface/transformers/issues/21510
Thank you! That explained a lot for me. Closing this issue now as it's resolved!