Buckets:

|
download
raw
23.6 kB

Title: Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

URL Source: https://arxiv.org/html/2504.01241

Markdown Content: Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

Naimul Haque Alfred University New York, USA naimul011@gmail.com

Abstract

Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU) tasks. As we progress toward an agentic world where LLM-based agents autonomously handle specialized tasks, it becomes crucial for these models to adapt to new tasks without forgetting previously learned information—a challenge known as catastrophic forgetting. This study evaluates the continual fine-tuning of various open-source LLMs with different parameter sizes (specifically models under 10 billion parameters) on key NLU tasks from the GLUE benchmark, including SST-2, MRPC, CoLA, and MNLI. By employing prompt engineering and task-specific adjustments, we assess and compare the models’ abilities to retain prior knowledge while learning new tasks. Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities, making them well-suited for continual learning environments. Additionally, models like Orca-2-7b and Qwen2.5-7B demonstrate impressive learning abilities and overall performance after fine-tuning. This work contributes to understanding catastrophic forgetting in LLMs and highlights prompting engineering to optimize model performance for continual learning scenarios.

1 Introduction

Large Language Models (LLMs) Vaswani et al. (2017) have transformed Natural Language Processing (NLP), delivering state-of-the-art performance on various tasks like sentiment analysis, paraphrase detection, and natural language inference Bubeck et al. (2023). Open-source models such as Llama 3 Llama Team (2024) have become essential in advancing Natural Language Understanding (NLU) Wang et al. (2018) tasks, which are critical for interpreting and generating human language.

As we move towards an agentic world Kapoor et al. (2024) where LLM-based agents autonomously handle specialized tasks, the ability to fine-tune models on multiple tasks without losing accuracy or forgetting previously learned information is crucial. Continual fine-tuning (CF) Fawi (2024) enables models to adapt and excel in varied environments while maintaining performance across tasks.

While prior research has focused on text generation capabilities Luo et al. (2024), less attention has been given to addressing CF in NLU tasks Wang et al. (2018), which are vital for real-world applications. Furthermore, there has been limited exploration of how models with different parameter sizes handle CF during continual fine-tuning, especially models under 10 billion parameters.

This study aims to fill these gaps by evaluating forgetting across various models with different parameter sizes on key NLU tasks from the GLUE benchmark Wang et al. (2018), including SST-2, MRPC, CoLA, and MNLI. By conducting extensive experimentation with prompt engineering and task-specific adjustments, we provide comparative insights into how models like Orca-2-7b Microsoft (2023), Llama-3.1-8B Llama Team (2024), and Phi-3.5-mini Microsoft (2024) perform under sequential fine-tuning.

Our results show that models such as Phi-3.5-mini Microsoft (2024) continue to exhibit minimal forgetting while maintaining strong learning capabilities, making them ideal for continual learning environments. Similarly, Orca-2-7b Microsoft (2023) and Qwen2.5-7B Hui et al. (2024) demonstrate impressive learning abilities and overall performance after fine-tuning.

In summary, this work contributes by:

  • •Evaluating catastrophic forgetting and task learning across diverse LLMs using sequential fine-tuning on specific NLU tasks.
  • •Highlighting the importance of prompt engineering and fine-tuning strategies for optimizing model performance.
  • •Providing insights into the performance of models with different parameter sizes, identifying those best suited for continual learning.
  • •Proposing continual fine-tuning as a key strategy for future LLM agents to handle multiple tasks without sacrificing accuracy.

2 Related Works

Catastrophic forgetting, a phenomenon first identified by McCloskey and Cohen (1989), remains a fundamental challenge in sequential learning tasks for neural networks. When models are trained on multiple tasks in sequence, they tend to overwrite previously acquired knowledge, leading to significant performance degradation on earlier tasks. Various methods have been proposed to mitigate this issue, such as Elastic Weight Consolidation (EWC) Kirkpatrick et al. (2017), which regularizes weight updates to protect crucial parameters learned from previous tasks. Similarly, memory-based methods like Gradient Episodic Memory (GEM) Lopez-Paz and Ranzato (2017) address forgetting by storing and replaying examples from past tasks during training, thereby reducing interference.

While recent LLM models show strong zero-shot performance, they often struggle with tasks outside their training and evaluation sets. To address this, Scialom et al. (2022) propose Continual-T0 (CT0), a fine-tuned LLM capable of learning new tasks while retaining prior knowledge, largely due to the self-supervision pre-training process. Kemker et al. (2017) showed that Deep neural networks struggle to learn new tasks without forgetting old ones, and various methods have been proposed to mitigate this, but their effectiveness varies depending on the training paradigm and data type. Huang et al. (2024) conducted an empirical study on catastrophic forgetting in LLMs, finding that forgetting becomes more severe as model size increases, especially in models ranging from 1B to 7B parameters, during continual fine-tuning across domains like reasoning and reading comprehension.

3 Methodology

We employed a Continual Instruction Fine-tuning Huang et al. (2024) approach, shown in the Figure1, sequentially adapting the base model M 0 subscript 𝑀 0 M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on tasks {T 1,T 2,…,T n}subscript 𝑇 1 subscript 𝑇 2…subscript 𝑇 𝑛{T_{1},T_{2},\dots,T_{n}}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } from the GLUE benchmark Wang et al. (2018). The goal was to evaluate how well the model M i subscript 𝑀 𝑖 M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, fine-tuned on task T i subscript 𝑇 𝑖 T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, retained knowledge from previous tasks {T 1,…,T i−1}subscript 𝑇 1…subscript 𝑇 𝑖 1{T_{1},\dots,T_{i-1}}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT }.

Image 1: Refer to caption

Figure 1: The figure illustrates the Continual Finetuning workflow. M 0 subscript 𝑀 0 M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the base language model, and subsequent models M 1,M 2,…,M n subscript 𝑀 1 subscript 𝑀 2…subscript 𝑀 𝑛 M_{1},M_{2},\dots,M_{n}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the fine-tuned versions after training on tasks T 1,T 2,…,T n subscript 𝑇 1 subscript 𝑇 2…subscript 𝑇 𝑛 T_{1},T_{2},\dots,T_{n}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The figure also highlights the process of generating task-specific prompts and the continual evaluation to assess the model’s retention.

The methodology comprised two main stages. First, we prepared the dataset for each task T i subscript 𝑇 𝑖 T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using prompt engineering (PE) Chen et al. (2024). Let X 𝑋 X italic_X be the original dataset for T i subscript 𝑇 𝑖 T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, transformed into a structured prompt dataset X′superscript 𝑋′X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as follows:

X′=PE⁢(X)superscript 𝑋′PE 𝑋 X^{\prime}=\text{PE}(X)italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = PE ( italic_X )

where PE⁢(⋅)PE⋅\text{PE}(\cdot)PE ( ⋅ ) represents the prompt engineering function. Prompts were designed to guide the model to perform task-specific instructions and the exact prompts and their expected outputs are shown in Table1

Table 1: This table presents the instructions used for fine-tuning models on specific tasks such as sentiment analysis (SST-2), paraphrase detection (MRPC), grammatical acceptability (CoLA), and natural language inference (MNLI).

4 Evaluation and Measurement of Catastrophic Forgetting and Learning

After each fine-tuning episode, model M i subscript 𝑀 𝑖 M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was evaluated on all previous tasks {T 1,T 2,…,T i}subscript 𝑇 1 subscript 𝑇 2…subscript 𝑇 𝑖{T_{1},T_{2},\dots,T_{i}}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } to assess catastrophic forgetting and learning. Accuracy was used as the evaluation metric, comparing post-fine-tuning performance to the base performance to detect any degradation or improvement. Forgetting was quantified as the difference between the maximum accuracy on each task during fine-tuning and the final accuracy:

Forgetting for Task⁢t=max 0≤k≤T⁡(a k,t)−a T,t Forgetting for Task 𝑡 subscript 0 𝑘 𝑇 subscript 𝑎 𝑘 𝑡 subscript 𝑎 𝑇 𝑡\text{Forgetting for Task }t=\max_{0\leq k\leq T}(a_{k,t})-a_{T,t}Forgetting for Task italic_t = roman_max start_POSTSUBSCRIPT 0 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT italic_T , italic_t end_POSTSUBSCRIPT

where a k,t subscript 𝑎 𝑘 𝑡 a_{k,t}italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT is the accuracy after fine-tuning on task k 𝑘 k italic_k and a T,t subscript 𝑎 𝑇 𝑡 a_{T,t}italic_a start_POSTSUBSCRIPT italic_T , italic_t end_POSTSUBSCRIPT is the final accuracy.

Learning was calculated as the improvement over the base performance:

Learning for Task⁢t=max k≤T⁡(a k,t)−a 0,t Learning for Task 𝑡 subscript 𝑘 𝑇 subscript 𝑎 𝑘 𝑡 subscript 𝑎 0 𝑡\text{Learning for Task }t=\max_{k\leq T}(a_{k,t})-a_{0,t}Learning for Task italic_t = roman_max start_POSTSUBSCRIPT italic_k ≤ italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT

where a 0,t subscript 𝑎 0 𝑡 a_{0,t}italic_a start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT is the base accuracy and max k≤T⁡(a k,t)subscript 𝑘 𝑇 subscript 𝑎 𝑘 𝑡\max_{k\leq T}(a_{k,t})roman_max start_POSTSUBSCRIPT italic_k ≤ italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) is the highest accuracy during fine-tuning. Both metrics provide insight into the model’s behavior across tasks.

4.1 Selected Tasks

The selected tasks include:

  1. 1.SST-2 (Stanford Sentiment Treebank): A binary classification task for determining whether a sentence has a positive or negative sentiment.
  2. 2.MRPC (Microsoft Research Paraphrase Corpus): A binary classification task to predict whether two sentences are paraphrases of each other.
  3. 3.CoLA (Corpus of Linguistic Acceptability): A binary classification task for determining whether a sentence is grammatically correct.
  4. 4.MNLI (Multi-Genre Natural Language Inference): A three-class classification task where the goal is to determine whether a premise sentence entails, contradicts, or is neutral with respect to a hypothesis sentence across multiple genres of text.

5 Experimental Results

Table2 presents a comparison of various models’ performance across four key metrics: Pretrained Performance, Forgetting, Learning, and overall Training Performance after sequential task fine-tuning.

Table 2: Comparison of the pre-trained performance, forgetting, learning capabilities, and training performance of various language models. Parameter Size (B) indicates the number of parameters (in billions) of each model. The Pretrained Performance refers to the initial performance of each model before any task-specific training.

Orca-2-7b Microsoft (2023) achieved the highest pretraining performance at 0.76, demonstrating strong initial capabilities, closely followed by Qwen2.5-14B Hui et al. (2024) at 0.71, indicating a solid foundation before fine-tuning. In terms of catastrophic forgetting, Phi-3.5-mini Microsoft (2024) and Phi-2 showed minimal forgetting with values of 0.02 and 0.1, respectively. On the other hand, Qwen2.5-14B and Llama-3.1-8B Llama Team (2024) exhibited high forgetting rates of 0.935 and 0.59. After fine-tuning, Orca-2-7b stood out with the best average performance of 0.81, followed by Qwen2.5-14B at 0.80. Models like Phi-3.5-mini and Orca-2-7b performed better overall, balancing low forgetting and high learning rates, offering valuable insights into mitigating catastrophic forgetting.

Results from continual fine-tuning on the task SST2 (Figure2) show Qwen2.5-7B Hui et al. (2024) leading in accuracy and learning, while Orca-2-7B Microsoft (2023) exhibited the least forgetting across tasks.

Image 2: Refer to caption

Figure 2: Performance of various models across continual fine-tuning episodes for the task SST2. The solid blue line highlights the model with the highest overall performance, while the solid orange line represents the model with the least amount of forgetting (the smallest drop in performance between tasks). Dashed lines indicate the performance of other models. This diagram illustrates both the learning capacity and retention ability of each model over successive tasks.

Figure3 shows a clear pattern where larger models, like Qwen2.5-7B and Llama-3.1-8B, exhibit higher learning rates, often at the cost of increased forgetting. In contrast, smaller models, such as Phi-3.5-mini and Phi-2, manage to balance low forgetting with moderate learning gains. This suggests a trade-off between capacity and stability across models.

Image 3: Refer to caption

Figure 3: Bar graph displaying model performance for the task SST2 on catastrophic forgetting (reverse) and learning rates, with higher models showing more significant trade-offs. Phi-3.5-mini stands out with minimal forgetting and moderate learning.

6 Conclusion

This study explored catastrophic forgetting in large language models during sequential fine-tuning on multiple NLU tasks. We found that smaller models like Phi-3.5-mini effectively minimize forgetting while maintaining learning capabilities. Prompt engineering and fine-tuning strategies significantly impact model performance in continual learning settings. Models such as Orca-2-7b and Qwen2.5-7B showed strong learning abilities but varied in forgetting. Careful model selection and tuning can enhance handling multiple tasks without sacrificing accuracy, which is crucial for developing autonomous LLM-based agents. Future work should explore more advanced continual learning techniques to mitigate catastrophic forgetting.

7 Limitations

This study focused on models under 10 billion parameters and specific NLU tasks from the GLUE benchmark, so results may not generalize to larger models or other tasks. We only used sequential fine-tuning without exploring other continual learning strategies like memory replay or regularization methods. Relying on prompt engineering may introduce biases affecting performance comparisons. We also didn’t consider computational constraints of continual fine-tuning, which could impact practical deployment. Finally, using accuracy as the sole metric might not capture all aspects of model performance.

References

Appendix A Example Appendix

This is an appendix.

Xet Storage Details

Size:
23.6 kB
·
Xet hash:
5ed9a99fbde743187d9b5918f6802c16f12b99350911cfb23c24fdc9a318b5e2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.