23.6 kB

Title: Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

URL Source: https://arxiv.org/html/2504.01241

Markdown Content: Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

Naimul Haque Alfred University New York, USA naimul011@gmail.com

Abstract

Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU) tasks. As we progress toward an agentic world where LLM-based agents autonomously handle specialized tasks, it becomes crucial for these models to adapt to new tasks without forgetting previously learned information—a challenge known as catastrophic forgetting. This study evaluates the continual fine-tuning of various open-source LLMs with different parameter sizes (specifically models under 10 billion parameters) on key NLU tasks from the GLUE benchmark, including SST-2, MRPC, CoLA, and MNLI. By employing prompt engineering and task-specific adjustments, we assess and compare the models’ abilities to retain prior knowledge while learning new tasks. Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities, making them well-suited for continual learning environments. Additionally, models like Orca-2-7b and Qwen2.5-7B demonstrate impressive learning abilities and overall performance after fine-tuning. This work contributes to understanding catastrophic forgetting in LLMs and highlights prompting engineering to optimize model performance for continual learning scenarios.

1 Introduction

Large Language Models (LLMs) Vaswani et al. (2017) have transformed Natural Language Processing (NLP), delivering state-of-the-art performance on various tasks like sentiment analysis, paraphrase detection, and natural language inference Bubeck et al. (2023). Open-source models such as Llama 3 Llama Team (2024) have become essential in advancing Natural Language Understanding (NLU) Wang et al. (2018) tasks, which are critical for interpreting and generating human language.

As we move towards an agentic world Kapoor et al. (2024) where LLM-based agents autonomously handle specialized tasks, the ability to fine-tune models on multiple tasks without losing accuracy or forgetting previously learned information is crucial. Continual fine-tuning (CF) Fawi (2024) enables models to adapt and excel in varied environments while maintaining performance across tasks.

While prior research has focused on text generation capabilities Luo et al. (2024), less attention has been given to addressing CF in NLU tasks Wang et al. (2018), which are vital for real-world applications. Furthermore, there has been limited exploration of how models with different parameter sizes handle CF during continual fine-tuning, especially models under 10 billion parameters.

This study aims to fill these gaps by evaluating forgetting across various models with different parameter sizes on key NLU tasks from the GLUE benchmark Wang et al. (2018), including SST-2, MRPC, CoLA, and MNLI. By conducting extensive experimentation with prompt engineering and task-specific adjustments, we provide comparative insights into how models like Orca-2-7b Microsoft (2023), Llama-3.1-8B Llama Team (2024), and Phi-3.5-mini Microsoft (2024) perform under sequential fine-tuning.

Our results show that models such as Phi-3.5-mini Microsoft (2024) continue to exhibit minimal forgetting while maintaining strong learning capabilities, making them ideal for continual learning environments. Similarly, Orca-2-7b Microsoft (2023) and Qwen2.5-7B Hui et al. (2024) demonstrate impressive learning abilities and overall performance after fine-tuning.

In summary, this work contributes by:

•Evaluating catastrophic forgetting and task learning across diverse LLMs using sequential fine-tuning on specific NLU tasks.
•Highlighting the importance of prompt engineering and fine-tuning strategies for optimizing model performance.
•Providing insights into the performance of models with different parameter sizes, identifying those best suited for continual learning.
•Proposing continual fine-tuning as a key strategy for future LLM agents to handle multiple tasks without sacrificing accuracy.

2 Related Works

Catastrophic forgetting, a phenomenon first identified by McCloskey and Cohen (1989), remains a fundamental challenge in sequential learning tasks for neural networks. When models are trained on multiple tasks in sequence, they tend to overwrite previously acquired knowledge, leading to significant performance degradation on earlier tasks. Various methods have been proposed to mitigate this issue, such as Elastic Weight Consolidation (EWC) Kirkpatrick et al. (2017), which regularizes weight updates to protect crucial parameters learned from previous tasks. Similarly, memory-based methods like Gradient Episodic Memory (GEM) Lopez-Paz and Ranzato (2017) address forgetting by storing and replaying examples from past tasks during training, thereby reducing interference.

While recent LLM models show strong zero-shot performance, they often struggle with tasks outside their training and evaluation sets. To address this, Scialom et al. (2022) propose Continual-T0 (CT0), a fine-tuned LLM capable of learning new tasks while retaining prior knowledge, largely due to the self-supervision pre-training process. Kemker et al. (2017) showed that Deep neural networks struggle to learn new tasks without forgetting old ones, and various methods have been proposed to mitigate this, but their effectiveness varies depending on the training paradigm and data type. Huang et al. (2024) conducted an empirical study on catastrophic forgetting in LLMs, finding that forgetting becomes more severe as model size increases, especially in models ranging from 1B to 7B parameters, during continual fine-tuning across domains like reasoning and reading comprehension.

3 Methodology

We employed a Continual Instruction Fine-tuning Huang et al. (2024) approach, shown in the Figure1, sequentially adapting the base model M 0 subscript 𝑀 0 M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on tasks {T 1,T 2,…,T n}subscript 𝑇 1 subscript 𝑇 2…subscript 𝑇 𝑛{T_{1},T_{2},\dots,T_{n}}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } from the GLUE benchmark Wang et al. (2018). The goal was to evaluate how well the model M i subscript 𝑀 𝑖 M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, fine-tuned on task T i subscript 𝑇 𝑖 T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, retained knowledge from previous tasks {T 1,…,T i−1}subscript 𝑇 1…subscript 𝑇 𝑖 1{T_{1},\dots,T_{i-1}}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT }.

Figure 1: The figure illustrates the Continual Finetuning workflow. M 0 subscript 𝑀 0 M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the base language model, and subsequent models M 1,M 2,…,M n subscript 𝑀 1 subscript 𝑀 2…subscript 𝑀 𝑛 M_{1},M_{2},\dots,M_{n}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the fine-tuned versions after training on tasks T 1,T 2,…,T n subscript 𝑇 1 subscript 𝑇 2…subscript 𝑇 𝑛 T_{1},T_{2},\dots,T_{n}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The figure also highlights the process of generating task-specific prompts and the continual evaluation to assess the model’s retention.

The methodology comprised two main stages. First, we prepared the dataset for each task T i subscript 𝑇 𝑖 T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using prompt engineering (PE) Chen et al. (2024). Let X 𝑋 X italic_X be the original dataset for T i subscript 𝑇 𝑖 T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, transformed into a structured prompt dataset X′superscript 𝑋′X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as follows:

X′=PE⁢(X)superscript 𝑋′PE 𝑋 X^{\prime}=\text{PE}(X)italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = PE ( italic_X )

where PE⁢(⋅)PE⋅\text{PE}(\cdot)PE ( ⋅ ) represents the prompt engineering function. Prompts were designed to guide the model to perform task-specific instructions and the exact prompts and their expected outputs are shown in Table1

Table 1: This table presents the instructions used for fine-tuning models on specific tasks such as sentiment analysis (SST-2), paraphrase detection (MRPC), grammatical acceptability (CoLA), and natural language inference (MNLI).

4 Evaluation and Measurement of Catastrophic Forgetting and Learning

After each fine-tuning episode, model M i subscript 𝑀 𝑖 M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was evaluated on all previous tasks {T 1,T 2,…,T i}subscript 𝑇 1 subscript 𝑇 2…subscript 𝑇 𝑖{T_{1},T_{2},\dots,T_{i}}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } to assess catastrophic forgetting and learning. Accuracy was used as the evaluation metric, comparing post-fine-tuning performance to the base performance to detect any degradation or improvement. Forgetting was quantified as the difference between the maximum accuracy on each task during fine-tuning and the final accuracy:

Forgetting for Task⁢t=max 0≤k≤T⁡(a k,t)−a T,t Forgetting for Task 𝑡 subscript 0 𝑘 𝑇 subscript 𝑎 𝑘 𝑡 subscript 𝑎 𝑇 𝑡\text{Forgetting for Task }t=\max_{0\leq k\leq T}(a_{k,t})-a_{T,t}Forgetting for Task italic_t = roman_max start_POSTSUBSCRIPT 0 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT italic_T , italic_t end_POSTSUBSCRIPT

where a k,t subscript 𝑎 𝑘 𝑡 a_{k,t}italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT is the accuracy after fine-tuning on task k 𝑘 k italic_k and a T,t subscript 𝑎 𝑇 𝑡 a_{T,t}italic_a start_POSTSUBSCRIPT italic_T , italic_t end_POSTSUBSCRIPT is the final accuracy.

Learning was calculated as the improvement over the base performance:

Learning for Task⁢t=max k≤T⁡(a k,t)−a 0,t Learning for Task 𝑡 subscript 𝑘 𝑇 subscript 𝑎 𝑘 𝑡 subscript 𝑎 0 𝑡\text{Learning for Task }t=\max_{k\leq T}(a_{k,t})-a_{0,t}Learning for Task italic_t = roman_max start_POSTSUBSCRIPT italic_k ≤ italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT

where a 0,t subscript 𝑎 0 𝑡 a_{0,t}italic_a start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT is the base accuracy and max k≤T⁡(a k,t)subscript 𝑘 𝑇 subscript 𝑎 𝑘 𝑡\max_{k\leq T}(a_{k,t})roman_max start_POSTSUBSCRIPT italic_k ≤ italic_T end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) is the highest accuracy during fine-tuning. Both metrics provide insight into the model’s behavior across tasks.

4.1 Selected Tasks

The selected tasks include:

1.SST-2 (Stanford Sentiment Treebank): A binary classification task for determining whether a sentence has a positive or negative sentiment.
2.MRPC (Microsoft Research Paraphrase Corpus): A binary classification task to predict whether two sentences are paraphrases of each other.
3.CoLA (Corpus of Linguistic Acceptability): A binary classification task for determining whether a sentence is grammatically correct.
4.MNLI (Multi-Genre Natural Language Inference): A three-class classification task where the goal is to determine whether a premise sentence entails, contradicts, or is neutral with respect to a hypothesis sentence across multiple genres of text.

5 Experimental Results

Table2 presents a comparison of various models’ performance across four key metrics: Pretrained Performance, Forgetting, Learning, and overall Training Performance after sequential task fine-tuning.

Table 2: Comparison of the pre-trained performance, forgetting, learning capabilities, and training performance of various language models. Parameter Size (B) indicates the number of parameters (in billions) of each model. The Pretrained Performance refers to the initial performance of each model before any task-specific training.

Orca-2-7b Microsoft (2023) achieved the highest pretraining performance at 0.76, demonstrating strong initial capabilities, closely followed by Qwen2.5-14B Hui et al. (2024) at 0.71, indicating a solid foundation before fine-tuning. In terms of catastrophic forgetting, Phi-3.5-mini Microsoft (2024) and Phi-2 showed minimal forgetting with values of 0.02 and 0.1, respectively. On the other hand, Qwen2.5-14B and Llama-3.1-8B Llama Team (2024) exhibited high forgetting rates of 0.935 and 0.59. After fine-tuning, Orca-2-7b stood out with the best average performance of 0.81, followed by Qwen2.5-14B at 0.80. Models like Phi-3.5-mini and Orca-2-7b performed better overall, balancing low forgetting and high learning rates, offering valuable insights into mitigating catastrophic forgetting.

Results from continual fine-tuning on the task SST2 (Figure2) show Qwen2.5-7B Hui et al. (2024) leading in accuracy and learning, while Orca-2-7B Microsoft (2023) exhibited the least forgetting across tasks.

Figure 2: Performance of various models across continual fine-tuning episodes for the task SST2. The solid blue line highlights the model with the highest overall performance, while the solid orange line represents the model with the least amount of forgetting (the smallest drop in performance between tasks). Dashed lines indicate the performance of other models. This diagram illustrates both the learning capacity and retention ability of each model over successive tasks.

Figure3 shows a clear pattern where larger models, like Qwen2.5-7B and Llama-3.1-8B, exhibit higher learning rates, often at the cost of increased forgetting. In contrast, smaller models, such as Phi-3.5-mini and Phi-2, manage to balance low forgetting with moderate learning gains. This suggests a trade-off between capacity and stability across models.

Figure 3: Bar graph displaying model performance for the task SST2 on catastrophic forgetting (reverse) and learning rates, with higher models showing more significant trade-offs. Phi-3.5-mini stands out with minimal forgetting and moderate learning.

6 Conclusion

This study explored catastrophic forgetting in large language models during sequential fine-tuning on multiple NLU tasks. We found that smaller models like Phi-3.5-mini effectively minimize forgetting while maintaining learning capabilities. Prompt engineering and fine-tuning strategies significantly impact model performance in continual learning settings. Models such as Orca-2-7b and Qwen2.5-7B showed strong learning abilities but varied in forgetting. Careful model selection and tuning can enhance handling multiple tasks without sacrificing accuracy, which is crucial for developing autonomous LLM-based agents. Future work should explore more advanced continual learning techniques to mitigate catastrophic forgetting.

7 Limitations

This study focused on models under 10 billion parameters and specific NLU tasks from the GLUE benchmark, so results may not generalize to larger models or other tasks. We only used sequential fine-tuning without exploring other continual learning strategies like memory replay or regularization methods. Relying on prompt engineering may introduce biases affecting performance comparisons. We also didn’t consider computational constraints of continual fine-tuning, which could impact practical deployment. Finally, using accuracy as the sole metric might not capture all aspects of model performance.

References

Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, and et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv preprint.
Chen et al. (2024) Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. 2024. Unleashing the potential of prompt engineering in large language models: a comprehensive review. Preprint, arXiv:2310.14735.
Fawi (2024) Muhammad Fawi. 2024. Curlora: Stable llm continual fine-tuning and catastrophic forgetting mitigation.
Huang et al. (2024) Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. 2024. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. Preprint, arXiv:2403.01244.
Hui et al. (2024) Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-coder technical report. Preprint, arXiv:2409.12186.
Kapoor et al. (2024) Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. 2024. Ai agents that matter. Preprint, arXiv:2407.01502.
Kemker et al. (2017) Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2017. Measuring catastrophic forgetting in neural networks. Preprint, arXiv:1708.02072.
Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwińska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
Llama Team (2024) AI Llama Team. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783.
Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, volume 30, pages 6467–6476.
Luo et al. (2024) Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2024. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. Preprint, arXiv:2308.08747.
McCloskey and Cohen (1989) Michael McCloskey and Neal J. Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24:109–165.
Microsoft (2023) Microsoft. 2023. Orca 2: Teaching small language models how to reason. Preprint, arXiv:2311.11045.
Microsoft (2024) Microsoft. 2024. Phi-3 technical report: A highly capable language model locally on your phone. Preprint, arXiv:2404.14219.
Scialom et al. (2022) Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. 2022. Fine-tuned language models are continual learners. arXiv preprint arXiv:2210.05653.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR, abs/1706.03762.
Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461.
Wang (2021) Ben Wang. 2021. Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX. https://github.com/kingoflolz/mesh-transformer-jax.

Appendix A Example Appendix

This is an appendix.

Xet Storage Details

Size:: 23.6 kB
Xet hash:: 5ed9a99fbde743187d9b5918f6802c16f12b99350911cfb23c24fdc9a318b5e2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.