Fine-tuning model with Huggingface Trainer

#1
by rubenjanss - opened

Hi,

I'm trying to fine-tune this model on my own dataset using the Huggingface Trainer, but I get the following error in the training step: ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask,pixel_values..

I'm using the following code:

from transformers import GitForCausalLM
model = GitForCausalLM.from_pretrained("microsoft/git-large-coco")

import torch
from torch.utils.data import Dataset
from PIL import Image

class VCSDatasetProcessor(Dataset):
    def __init__(self, root_dir, df, processor, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.processor = processor
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # get file name + text 
        file_name = self.df["image_id"][idx]
        text = self.df["question"][idx]
        # prepare image (i.e. resize + normalize)
        image = Image.open(self.root_dir + file_name).convert("RGB")

        encoding = self.processor(images=image, text=text, padding="max_length", return_tensors="pt", max_length=self.max_target_length)
        
        encoding = {k:v.squeeze() for k,v in encoding.items()}
        
        return encoding

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("microsoft/git-large-coco")

train_set = VCSDatasetProcessor(root_dir="data/images/", df=train_df, processor=processor)
valid_set = VCSDatasetProcessor(root_dir="data/images/", df=valid_df, processor=processor)

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    fp16=False, 
    output_dir="./git-hf/",
    logging_steps=200,
    save_strategy="epoch",
    eval_steps=200,
    num_train_epochs=4,
)

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge = rouge_metric.compute(predictions=pred_str, references=label_str)

    return {"rouge": rouge["rougeL"].mid.fmeasure}

from transformers import default_data_collator

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=processor,
    args=training_args,
    #compute_metrics=compute_metrics,
    train_dataset=train_set,
    eval_dataset=valid_set,
    data_collator=default_data_collator,
)

trainer.train()

The full error trace is the following:

***** Running training *****
  Num examples = 8886
  Num Epochs = 4
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 8888
  Number of trainable parameters = 394196026
vision_config is None. initializing the GitVisionConfig with default values.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [26], line 1
----> 1 trainer.train()

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1570, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1565     self.model_wrapped = self.model
   1567 inner_training_loop = find_executable_batch_size(
   1568     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1569 )
-> 1570 return inner_training_loop(
   1571     args=args,
   1572     resume_from_checkpoint=resume_from_checkpoint,
   1573     trial=trial,
   1574     ignore_keys_for_eval=ignore_keys_for_eval,
   1575 )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1835, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1833         tr_loss_step = self.training_step(model, inputs)
   1834 else:
-> 1835     tr_loss_step = self.training_step(model, inputs)
   1837 if (
   1838     args.logging_nan_inf_filter
   1839     and not is_torch_tpu_available()
   1840     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1841 ):
   1842     # if loss is nan or inf simply add the average of previous logged losses
   1843     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2583, in Trainer.training_step(self, model, inputs)
   2580     return loss_mb.reduce_mean().detach().to(self.args.device)
   2582 with self.compute_loss_context_manager():
-> 2583     loss = self.compute_loss(model, inputs)
   2585 if self.args.n_gpu > 1:
   2586     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2628, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2626 else:
   2627     if isinstance(outputs, dict) and "loss" not in outputs:
-> 2628         raise ValueError(
   2629             "The model did not return a loss from the inputs, only the following keys: "
   2630             f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
   2631         )
   2632     # We don't use .loss here since the model may return tuples instead of ModelOutput.
   2633     loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask,pixel_values.

I've also tried to fine-tune the model using the code in this tutorial: https://colab.research.google.com/drive/1HLxgrG7xZJ9FvXckNG61J72FkyrbqKAA. That seemed to work, but when I tried to save a checkpoint every epoch, those checkpoints produced strange output on my test set (the model gave the same output for each input image), and the output of the last checkpoint was different to that of the final model that was in memory after conclusion of the training loop (that one actually produced different output for different images), so something seemed to go wrong there.

I used the same code as in thet tutorial, it was only different here:

import torch

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

model.train()

for epoch in range(10):
    print("Epoch:", epoch)
    for idx, batch in enumerate(train_dataloader):
        input_ids = batch.pop("input_ids").to(device)
        pixel_values = batch.pop("pixel_values").to(device)

        outputs = model(input_ids=input_ids,
                        pixel_values=pixel_values,
                        labels=input_ids)

        loss = outputs.loss

        if idx % 100 == 0:
            print(str(idx) + "/ " + str(len(train_dataloader)) + " Loss:", loss.item())

        loss.backward()

        optimizer.step()
        optimizer.zero_grad()
        
    model.save_pretrained("git2/checkpoint-"+str(epoch))

Can you help me with these problems, or at least with one of the two of them? :)

Best wishes

Hi,

Thanks for your interest in GIT!

You don't need to use the Seq2SeqTrainer for a decoder-only model like GIT.

Seq2SeqTrainer is only meant for Seq2Seq models like T5, BART, PEGASUS, etc.

Hi Niels,

Thanks for the quick reply. With the regular Trainer and TrainingArguments, I get the same problem. How should I use them differently?

I see that you're not preparing any labels for the model. Hence it will be difficult for the model to train ;)

In your PyTorch Dataset, you prepare each image + text pair using the processor, which is great. However you also need to add a labels key to the encoding dictionary, as the model requires pixel_values, input_ids (the prepared image + text pair) and labels(the ground truth targets to produce) to compute a loss. The labels are simply a copy of the input_ids:

encoding["labels"] = input_ids.copy()

as the model internally will shift them one position to compute the cross-entropy loss.

Ah, that makes a lot of sense and works much better, thank you!

However, now I'm wondering, does this conceptually mean that the model is trained to expect an (image, text) pair as input? Because the "text" in this pair is the ground truth caption. It works fine now at inference time (with only the image as input), but I'm just trying to see if I understand the concepts correctly. I'm especially confused because I already trained the BlipModelForConditionalGeneration in the same way, without providing labels, and this seemed to work fine. How is that possible?

That's a good point, that's because of these lines of code which automatically compute the labels for BLIP in case the user doesn't provide them.

Opened an issue here to remove this unexpected behaviour: https://github.com/huggingface/transformers/issues/21510

Thank you! That explained a lot for me. Closing this issue now as it's resolved!

rubenjanss changed discussion status to closed

Sign up or log in to comment