65.1 kB

Title: NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian

URL Source: https://arxiv.org/html/2312.01314

Markdown Content: 4 Norwegian Benchmark Dataset - NLEBench

This section introduces tasks in NLEBench specifically designed for Norwegian GLMs. The datasets are sourced from three categories: existing datasets, machine-translated datasets using the Google Translation API, and manually annotated datasets. Our native Norwegian colleagues evaluated random samples from both the Google Translation API 8 8 8https://cloud.google.com/translate/docs and another free translation API 9 9 9https://pypi.org/project/translators/ supporting Norwegian, finding that the former performs better, especially with confusing words and long texts. Table 3 outlines the differences and evaluation settings of these datasets. The statistics of different datasets are shown in Table 8-C.

4.1 Open-domain conversation

NO-ConvAI2 is machine-translated from the English ConvAI2 dataset Dinan et al. (2020), which itself is a refined version of the PersonaChat corpus Zhang et al. (2018). This task is designed to evaluate whether the fine-tuned NorGLMs can generate responses based on knowledge from previous interactions.

4.2 News summarization

In this task, we assess the abstractive summarization capabilities of NorGLMs using our NO-CNN/DailyMail dataset, which is machine-translated from CNN/DailyMail — an English dataset that includes journalists’ annotated summaries. We employ fine-tuning and the Reinforcement Learning with Human Feedback (RLHF) strategy on NorGLMs. In step 2 of RLHF, we train the reward model by estimating semantic similarity between the candidate generated text and the human-annotated summary (golden summary) using the NorBERT model Kutuzov et al. (2021). Summaries generated with higher cosine similarity to the golden summary are prioritized during the training of the reward model.

4.3 Instructions

This task utilizes datasets from two sources: NO-Alpaca 10 10 10https://huggingface.co/NbAiLab/nb-gpt-j-6B-alpaca, translated from the Stanford Alpaca dataset Wang et al. (2023b) into Norwegian using OpenAI’s GPT-3.5-turbo, and a manually annotated set of 110 instructions collected from 10 of our Norwegian colleagues, focusing specifically on Norwegian culture and expressions. This combined dataset is named NO-Alpaca-Plus.

4.4 Natural Language Understanding (NLU)

This task aims to analyze the natural language understanding capabilities of our NorGLMs. We extracted the Norwegian portion from the OverLim dataset 11 11 11https://huggingface.co/datasets/KBLab/overlim and selected three tasks commonly used in evaluating English generative language models: BoolQ, MRPC, and QNLI. Notably, OverLim is translated from the GLUE 12 12 12https://huggingface.co/datasets/glue and SuperGLUE 13 13 13https://super.gluebenchmark.com/ benchmarks. To distinguish it from the original English version, we use the prefix "NO-" for the versions used in this paper. The data split follows the original protocol.

4.5 Toxicity and bias

Generative language models are notorious for amplifying biases inherent in the training data Sheng et al. (2019) and producing toxic text Gehman et al. (2020). To evaluate these issues in NorGLMs, we used the Perspective API 14 14 14https://perspectiveapi.com/ on 1508 prompts for toxicity evaluation and calculated ppl on 1677 sample pairs for bias evaluation from the NO-CrowS-Pairs benchmark, a machine-translated version of the French CrowS-Pairs Névéol et al. (2022). Due to the API’s lack of Norwegian support, we translated the NorGLM generated text into Swedish for assessment. This benchmark also helps evaluate potential biases in NorGLMs.

4.6 Multi-task learning

Apart from the benchmarks and translated datasets mentioned above, we release a multi-task dataset called NO-Multi-QA-Sum. This section details the dataset collection process and the tasks performed using this benchmark.

Data Collection. We recruited three Norwegian college students as annotators, allowing them to work in pairs or independently. Each student is compensated 230 NOK (approx. $21,75 USD) per hour. Annotators were tasked with conducting a conversation about a given news article, using content from the article without a limit on the number of dialogue turns or question types. After the conversation, they were required to write a generic summary of the article. The dialogue and summary content did not need to fully overlap, giving annotators some freedom in their dialogue choices. Most annotators chose to use self-dialogue and summarization for efficiency and flexibility 15 15 15 This design aims to evaluate the model’s reading comprehension ability. We instructed annotators to consider question diversity, including both simple questions (where the answer comes from a single source) and complex questions (where the answer is derived from different parts of the article). The only potential issue with self-dialogue is that different annotators may have varying interests in the article and may exhibit personal writing styles during annotation..

To facilitate the annotation process, we developed an API, shown in Figure 7, that can connect with the OpenAI GPT-4 model to suggest annotations. However, annotators were required to verify the fidelity and usability of the suggested texts. To ensure quality, each annotation should be cross-validated and corrected by two other annotators, achieving one hundred percent internal consensus on the final annotations. The cross-validation included checking the rationality of question-answer pairs, factual consistency, and language fluency. Many annotators reported that while GPT-4 (specifically gpt-4-0613)16 16 16https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4 was good at generating suggested questions and summaries, it struggled with producing high-quality answers, necessitating human effort to maintain annotation quality.

Tasks. In particular, for this dataset, we primarily explored two tasks using the Chain-of-Thought (CoT) method: based on the given news article, 1) we first let the model answer the annotated questions, and then let the model generate a summary of the article based on the article, questions and the answers generated by the model. 2) We first let the model generate summaries, and then ask the model to answer questions based on the article and summary generated by the model. We tested these tasks on NorGPT-3B/23B, NB-GPT-J-6B, which are fine-tuned on the NO-CNN/DailyMail and NO-ConvAI2 datasets, and GPT-3.5-Turbo. These tasks are designed based on the hypothesis that DGQA and summarization are inherently correlated, and the synergies between these tasks may influence the model’s performance on individual tasks. To address potential annotator oversight in associating content with the summarization task during question answering, we instructed annotators to manually categorize the data based on whether the question-answering content includes or excludes a summary, and experiments were conducted on each subset.

Wang et al. (2023a) developed an element-aware summarization method using CoT approach by instructing LLM to generate four key elements—Entity, Date, Event, and Result—to be integrated into the summary. They evaluated the proposed method on 200 annotated samples. However, we argue that human-written summaries demonstrate greater diversity and flexibility beyond these four elements. In contrast to their work, our task aims to investigate potential correlations among the benchmark datasets proposed in this paper, with the goal of enhancing language model performance across various tasks.

5 Experimental Results

In this section, we only list key results for the benchmark datasets due to the page limit. More results can be seen in the Appendix.

5.1 Evaluation Metrics

We aim to comprehensively evaluate our models across various tasks using widely used metrics for NLP tasks, including BLEU Papineni et al. (2002), ROUGE Lin (2004), Distinct Li et al. (2016), and MAUVE, which is used to assess the generated and human-written text based on their probability distribution differences Pillutla et al. (2021). Furthermore, following the work of Xie et al. (2023), to measure faithfulness and factual consistency in multi-task learning, we utilize Entailment scores from a fine-tuned NorBERT model trained on the VitaminC dataset Schuster et al. (2021), which are translated with Google Cloud Translation API.

Table 2: Experimental Results on the Conversation Task.

Table 3: Experimental Results on the News Summarization Task.

5.2 Evaluation Results on NO-ConvAI2

As shown in Table 2, all models, except for GPT-3.5-Turbo, perform quite similarly. Notably, the NorGPT-3B model achieves the best results across multiple evaluation metrics, while the NorGPT-23B model only shows an advantage in BLEU scores. GPT-3.5-Turbo, although specifically curated for conversational purposes, did not exhibit the advantages expected from its extensive knowledge base. This may be because the knowledge of other languages in GPT-3.5-Turbo cannot be directly transferred to understanding Norwegian conversations, highlighting the unique linguistic properties of the Norwegian language.

5.3 Evaluation Results on NO-CNN/DailyMail

In Table 3, GPT-3.5-Turbo and NB-GPT-J-6B outperform our NorGPTs on BLEU and ROUGE metrics. This suggests a substantial number of expression patterns resembling news articles in their pre-training datasets. This is plausible given that their datasets likely include a diverse range of newspapers, magazines, and government reports. Additionally, this trend is evident in common test samples, where GPT-3.5-Turbo tends to generate more formal language compared to conversational language. Despite this, we observed that the models’ performance improves after reinforcement learning, especially in replicating the word distribution of human writing and generating summaries of similar length. This is supported by the highest scores in MAUVE and BLEU. Although the model with reinforcement learning may not always surpass the fine-tuned model in accuracy, it actively strives to mimic human writing patterns.

5.4 Evaluation Results on NO-Alpaca-Plus

Table 13 demonstrates the performance of our baseline models after fine-tuning on the NO-Alpaca dataset. Given that this dataset is translated using GPT-3.5-Turbo, we could not use GPT-3.5-Turbo as a baseline due to OpenAI’s terms and policies 17 17 17https://openai.com/policies/. NB-GPT-J-6B outperforms other models on most evaluation metrics, likely due to its pre-training on a set of self-annotated Norwegian instructions, as described on their model webpage. Among our NorGLM models, NorLlama-3B achieved better BLEU and ROUGE scores compared to others, but worse MAUVE and perplexity scores. This is an interesting phenomenon, indicating that NorLlama-3B’s results hit the most n-grams, yet its token probability distribution deviates the most from human-annotated results. A case study revealed that while NorLlama-3B generates overlapping words or phrases with the golden answer, it sometimes lacks logical coherence between sentences, and the meanings of sentences can even be mutually exclusive, as shown in Figure 2.

Meanwhile, in our self-annotated 110 instructions, we select two typical cases generated from GPT-3.5-Turbo related to Norwegian culture and special expression shown in Figure 3 and Figure 4 respectively. Specifically, Figure 3 shows a factual inconsistency issue in generated texts. In Figure 4, the input prompt asks who uses the word, but the model interprets the meaning of the word rather than understanding the question. Therefore, with limited annotated data, we can still find limitations in the model’s understanding of the specific culture behind the language.

5.5 Evaluation Results on NLU tasks

Table 14 reports the results on NLU tasks. Among NorGLMs, NorGPT-23B model consistently outperforms others on different NLU datasets across both evaluation metrics. However, NB-GPT-J-6B performs better on the NO-QNLI benchmark and achieves a higher F1-score on the NO-MRPC benchmark.

5.6 Evaluation Results on Toxicity and Bias

The results of average toxicity scores from 6 perspectives including Toxicity, Severe toxicity, Identity attack, Insult, Profanity and Threat are shown in Table 15. All toxicity scores range from 0 to 1, with lower values indicating less toxic text generated by the model. Although NorLlama-3B exhibits the lowest values across all metrics, a significant portion of its generated text consists of meaningless characters or words. We conducted a random sampling of texts generated by GPT models with high toxicity values and traced hazardous words back to the pre-training dataset. Surprisingly, most of these hazardous words did not originate from social media, as commonly assumed, but from daily news articles. For instance, the phrase "tok livet av" (taken life from/kill) often appeared in news reports describing murders, as illustrated in Figure 1. These original news articles did not convey toxic information but were instead factual descriptions of criminal events. This discovery underscores the importance of not only filtering out toxic inputs during the pre-training process but also considering which prompts may lead the model to generate toxic text.

Table 16 presents findings from stereotype and bias detection using the NO-CrowS-Pairs dataset. This dataset encompasses nine categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Each sample consists of a stereotype (sent_more) paired with an anti-stereotype (sent_less) sentence. Following the work of Touvron et al. (2023), model bias is assessed by comparing perplexity scores between these pairs and reporting the percentage of the model biased towards sent_more in the table. Higher values indicate a stronger bias towards public stereotypes. Overall, the benchmark models demonstrated robust performance across most bias categories. However, they exhibited a bias towards sent_less in relation to religion, suggesting a relative bias in this specific category.

Table 4: Experimental Results on task one using NO-Multi-QA-Sum dataset for summarization task.

Table 5: Experimental Results on task two using NO-Multi-QA-Sum dataset for document-grounded question answering task.

5.7 Evaluation with CoT

In this task, all baseline models except GPT-3.5 were fine-tuned on the NO-CNN/DailyMail and NO-ConvAI2 datasets, enabling them to handle related tasks effectively. However, none of these models were fine-tuned using document-grounded question answering datasets or similar CoT tasks investigated in this study. Table 4 and Table 5 present the outcomes of the multi-task dataset under different scenarios. The tables distinguish datasets where the question answering content includes or excludes a summary, labeled as "contain" and "not contain" respectively. For both tasks, we utilized different prompt templates and reported the optimal performance in the tables. From the results, we draw several observations:

In task one, we observed that GPT-3.5 significantly improved in summarization performance with the CoT method, while other models saw a degradation in this aspect. For DGQA, NorGPT-3B and NorGPT-23B models showed improvements through CoT, whereas NB-GPT-J-6B exhibited mixed results across different datasets. Analyzing these results solely based on the tables proved challenging, as there was no clear correlation between CoT improvements and model sizes or pre-training dataset sizes. This contrasts with prior findings suggesting CoT benefits are more pronounced with larger models Wei et al. (2022). Combining results from Table 2 and Table 3, we observed models that initially performed well in their tasks showed further enhancement with CoT adaptations. For instance, GPT-3.5 excelled in summarization on the NO-CNN/DailyMail dataset after CoT, and NorGPT-3B and NorGPT-23B models improved in document-grounded question answering on the NO-ConvAI2 dataset. Figure 5 illustrates an example where CoT-generated summaries closely approximate human-written summaries compared to direct prompts for the model to generate summaries. The English translation is shown in Figure 6.

While we observe that the synergy between the two tasks enhances the model’s performance on both, we also find that incorporating a summary into a QA task improves the quality of the generated summary compared to QA tasks without one. However, the reverse scenario is not necessarily true. We speculate that QA breaks down the summarization task into smaller components, enabling the model to better comprehend the input text. This process mirrors the human learning process.

Moreover, as shown in both Table 4 and Table 5, we find that after CoT, the Entailment scores of most models increased, indicating that the answers and summaries generated by the models are more aligned with the context described in the article. Therefore, CoT has the potential to enhance the factual consistency of the generated outputs.

Table 6: Human evaluation results on the quality of machine translated datasets in NLEBench.

5.8 Human Evaluation

To evaluate the quality of the translated datasets, we conducted a human evaluation on three datasets translated by the Google API: NO-ConvAI2, NO-CNN/DailyMail and NO-CrowS-Pairs. Specifically, considering the constraints of time and cost, we randomly selected 50 samples from each of the three datasets. We recruited three Norwegian native speakers, all of whom are college students, to independently score the Adequacy and Fluency of each text. Adequacy measures whether the translated text accurately conveys the meaning of the original text, while Fluency assesses whether the expression of the translated text aligns with native Norwegian expressions. The scores range from 1 to 5, with 1 representing non-compliance and 5 representing full compliance. In addition, we used the Claude 3 Opus model 18 18 18https://www.anthropic.com/news/claude-3-family to translate the same 150 samples, adhering strictly to the model settings described in Enis and Hopkins (2024)19 19 19 Please note that, at the time of this research, Claude 3 Opus had not yet been published.. The experimental results are shown in Table 6. The detailed instructions to the evaluators are shown in Figure 8.

The results show that Claude 3 Opus outperforms Google API in both Adequacy and Fluency indicators. We can also see that both Google Translation and Claude Translation are able to accurately convey most of the meaning of the original text and include some native or even good native expressions. We adopt Fleiss’ kappa (κ 𝜅\kappa italic_κ) to measure Inter-rater Agreement among the three raters for each evaluation metric and dataset. We observed high consistency among evaluators in adequacy assessments, while fluency evaluations demonstrated low consistency 20 20 20https://www.ncbi.nlm.nih.gov/books/NBK92287/table /executivesummary.t2/?report=objectonly. By comparing individual scores with the types of translation errors they annotated, we found that bias exists among evaluators. For the same translated text, although all evaluators marked the translation expression as incorrect, some evaluators with higher scores believed that, despite not conforming to Norwegian expression habits, the translation still conveyed the original meaning. In contrast, another evaluator believed that incorrect word choices significantly affected the text’s fluency and gave a lower score. Furthermore, based on the annotations, the most frequent translation errors in the sample dataset were "the misuse of words", followed by "missing words", "incorrect word order", and "extra words".

6 Discussion

In this subsection, we present observations from the longitudinal comparison of different models in downstream tasks, as detailed in Section 5: 1) While NB-GPT-J-6B did not achieve the highest scores across all tasks, it showed consistent performance and the best perplexity scores compared to our NorGLMs on nearly all tasks. This consistency is likely due to its initial training on large English datasets before being continue-trained on Norwegian data. 2) The 23B model did not show the expected absolute advantage in downstream tasks. We find that with a small-scale pre-training dataset, a larger model cannot demonstrate its ability to better cope with complex problems, which also supports the findings in Hoffmann et al. (2022). 3) The results highlight the promising abilities of smaller language models on specific tasks. However, these models often lack consistency in generating high-quality, meaningful text. 4) A comparison between Table 3 and Table 4 reveals significant differences between summaries written by journalists and those generated by GPT-3.5 or non-professionals. However, the model’s performance on the latter datasets appears to be proportional to its size. GPT-3.5’s performance on NO-Multi-QA-Sum has improved significantly, possibly due to the similarity of frameworks and training data overlap between GPT-3.5 and GPT-4. 5) GPT-3.5’s difficulties with specialized Norwegian instructions highlight the unique complexities of the Norwegian language, which are challenging for English-dominated models. This emphasizes the need to focus on low-resource languages to better understand their cultural nuances.21 21 21 We have released more Norwegian foundation models and datasets and will continue to update and integrate Norwegian-related resources. Please follow our GitHub repository for more information.

7 Conclusion

In this paper, we introduced a suite of Norwegian Generative Language Models and a comprehensive benchmark with seven tasks tailored for the underrepresented Norwegian language. Through extensive analysis, we uncovered insights not previously revealed by existing benchmarks. Our evaluation of the NO-Multi-QA-Sum dataset highlighted the effectiveness of multi-task datasets in assessing natural language understanding through complex tasks like Chain-of-Thought (CoT). We also noted differences between human-annotated summaries and those generated by GPT-3.5, providing valuable insights for future abstractive summarization advancements. Furthermore, our study emphasized the unique linguistic and cultural aspects of Norwegian, suggesting that mainstream benchmarks may not fully capture the performance of language models on low-resource languages. Thus, developing benchmarks specific to these languages is essential for accurate evaluation and development.

8 Limitations

Although NLEBench is currently the most comprehensive benchmark for Norwegian, its coverage of applications and downstream tasks remains limited. Our benchmark is open-ended and inevitably cannot cover everything in Norway. Nevertheless, we believe that the published resources will significantly aid research in generative language models for low-resource scenarios. While Balahur and Turchi (2014) suggested that translation systems produce good quality data, translation errors and misconceptions persist. Due to budget constraints and the large volume of translation samples, ensuring the quality of our translated dataset was challenging. However, the value of machine-translated datasets should not be dismissed. For instance, we use NO-ConvAI2 to fine-tune the model, endowing it with conversational capabilities, and NO-Alpaca includes general knowledge about Norway, such as The capital of Norway is Oslo, although the coverage remains limited.

Another constraint is the scarcity of human-annotated samples in our benchmark, largely attributable to the extensive time and financial resources required for their collection. Notably, the process of amassing over 500 samples for the NO-Multi-QA-Sum dataset was time-intensive and necessitated thorough quality control measures before implementation. Moreover, acquiring sufficient Norwegian pre-training data and considering the copyright issues of data poses a formidable challenge. The current difficulty lies in obtaining a training dataset of comparable size to those available for English, severely constraining the performance of our pre-trained models. Despite our efforts to procure data from diverse sources and provide pertinent statistical insights, certain data cannot be redistributed, complicating efforts to replicate our pretraining phase. Looking ahead, we aim to mitigate the shortage of textual data through manual annotation efforts or by integrating multi-modal data, thereby fostering advancements in low-resource language model development within the broader research community.

Acknowledgements

This publication has been funded by the SFI NorwAI, (Centre for Research-based Innovation, 309834). The authors gratefully acknowledge the financial support from the Research Council of Norway and the partners of the SFI NorwAI.

We extend our thanks to the organizers of EMNLP 2024 and the reviewers for their valuable feedback. Special thanks to the IDUN team at NTNU Själander et al. (2019) for providing essential computational resources, and to Schibsted and the National Library of Norway (Nasjonalbiblioteket) for supplying the crucial dataset for our research.

References

Antoun et al. (2020) Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraGPT2: Pre-trained transformer for arabic language generation. arXiv preprint arXiv:2012.15520.
Balahur and Turchi (2014) Alexandra Balahur and Marco Turchi. 2014. Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Computer Speech & Language, 28(1):56–75.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Chen et al. (2023) Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2023. LongLoRA: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307.
De Mattei et al. (2020) Lorenzo De Mattei, Michele Cafagna, Felice Dell’Orletta, Malvina Nissim, and Marco Guerini. 2020. GePpeTto carves italian into a language model. arXiv preprint arXiv:2004.14253.
de Vries and Nissim (2021) Wietse de Vries and Malvina Nissim. 2021. As good as new. How to successfully recycle English GPT-2 to make models for other languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 836–846, Online. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Dinan et al. (2020) Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. 2020. The second conversational intelligence challenge (ConvAI2). In The NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations, pages 187–208. Springer.
Ekgren et al. (2022) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, and Magnus Sahlgren. 2022. Lessons learned from GPT-SW3: Building the first large-scale generative language model for Swedish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3509–3518, Marseille, France. European Language Resources Association.
Enis and Hopkins (2024) Maxim Enis and Mark Hopkins. 2024. From llm to nmt: Advancing low-resource machine translation with claude. arXiv preprint arXiv:2404.13813.
Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
Hedderich et al. (2021) Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen, and Dietrich Klakow. 2021. A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2545–2568, Online. Association for Computational Linguistics.
Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. 2022. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc.
Koto et al. (2020) Fajri Koto, Afshin Rahimi, Jey Han Lau, and Timothy Baldwin. 2020. IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, pages 757–770, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Kummervold et al. (2021) Per E Kummervold, Javier De la Rosa, Freddy Wetjen, and Svein Arne Brygfjeld. 2021. Operationalizing a national digital library: The case for a Norwegian transformer model. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 20–29, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
Kutuzov et al. (2021) Andrey Kutuzov, Jeremy Barnes, Erik Velldal, Lilja Øvrelid, and Stephan Oepen. 2021. Large-scale contextualised language modelling for Norwegian. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 30–40, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Luukkonen et al. (2023) Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, and Sampo Pyysalo. 2023. FinGPT: Large generative models for a small language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2710–2726, Singapore. Association for Computational Linguistics.
Névéol et al. (2022) Aurélie Névéol, Yoann Dupont, Julien Bezançon, and Karën Fort. 2022. French CrowS-pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8521–8531, Dublin, Ireland. Association for Computational Linguistics.
Nielsen (2023) Dan Nielsen. 2023. ScandEval: A benchmark for Scandinavian natural language processing. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 185–201, Tórshavn, Faroe Islands. University of Tartu Library.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Patterson et al. (2021) David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.
Pillutla et al. (2021) Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. 2021. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems, volume 34, pages 4816–4828. Curran Associates, Inc.
Samuel et al. (2023) David Samuel, Andrey Kutuzov, Samia Touileb, Erik Velldal, Lilja Øvrelid, Egil Rønningstad, Elina Sigdel, and Anna Palatkina. 2023. NorBench – a benchmark for Norwegian language models. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 618–633, Tórshavn, Faroe Islands. University of Tartu Library.
Schuster et al. (2021) Tal Schuster, Adam Fisch, and Regina Barzilay. 2021. Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 624–643, Online. Association for Computational Linguistics.
Sheng et al. (2019) Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China. Association for Computational Linguistics.
Själander et al. (2019) Magnus Själander, Magnus Jahre, Gunnar Tufte, and Nico Reissmann. 2019. EPIC: An energy-efficient, high-performance GPGPU computing research infrastructure. arXiv preprint arXiv:1912.05848.
Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Wang et al. (2023a) Yiming Wang, Zhuosheng Zhang, and Rui Wang. 2023a. Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8640–8665, Toronto, Canada. Association for Computational Linguistics.
Wang et al. (2023b) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023b. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Xie et al. (2023) Zhouhang Xie, Sameer Singh, Julian McAuley, and Bodhisattwa Prasad Majumder. 2023. Factual and informative review generation for explainable recommendation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):13816–13824.
Xu et al. (2020) Canwen Xu, Jiaxin Pei, Hongtao Wu, Yiyu Liu, and Chenliang Li. 2020. MATINF: A jointly labeled large-scale dataset for classification, question answering and summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3586–3596, Online. Association for Computational Linguistics.
Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too?In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
Zhao et al. (2023) Zhe Zhao, Yudong Li, Cheng Hou, Jing Zhao, Rong Tian, Weijie Liu, Yiren Chen, Ningyuan Sun, Haoyan Liu, Weiquan Mao, Han Guo, Weigang Gou, Taiqiang Wu, Tao Zhu, Wenhang Shi, Chen Chen, Shan Huang, Sihong Chen, Liqun Liu, Feifei Li, Xiaoshuai Chen, Xingwu Sun, Zhanhui Kang, Xiaoyong Du, Linlin Shen, and Kimmo Yan. 2023. TencentPretrain: A scalable and flexible toolkit for pre-training models of different modalities. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 217–225, Toronto, Canada. Association for Computational Linguistics.
Zoph et al. (2016) Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575, Austin, Texas. Association for Computational Linguistics.

Appendix A NorGLM Model Parameter Settings

Table 7: The training parameter settings of NorGLMs

Appendix B The Statistics of Benchmark Datasets

Data statistics are in Table 8-C.

Appendix C Case Study on the Instruction Finetuning Task

Examples of generated responses for the instructions in the NO-Alpaca(-Plus) benchmark are shown in Figure 2-4.

Figure 2: Example of NorLlama-3B on NO-Alpaca benchmark. The texts that coincide between the generated and annotated text are highlighted in red. Translations are in the brackets.

Figure 3: Example of generated performance of GPT-3.5 on Norwegian culture instruction of NO-Alpaca-Plus. Translations are on the right.

Figure 4: Example of generated performance of GPT-3.5 on Norwegian special expression instruction of NO-Alpaca-Plus. Translations are on the right.

Table 8: Statistics on NO-Alpaca, NO-CNN/DailyMail dataset, where P denotes prompt, A denotes answer, N is news article and S is summary.

Table 9: Statistics on NO-ConvAI2 dataset.

Table 10: Statistics on NO-Multi-QA-Sum dataset.

Type#articles#dialogues#avg_turns /dialogue#total_words in articles#avg_words /article#total_tokens in articles#avg_tokens /article Zero-shot 467 2,755 5.90 203,606 435.99 276,708 592.52 #total_words in questions#avg_words /question#total_tokens in questions#avg_tokens /question#total_words in answers#avg_words /answer#total_tokens in answers#avg_tokens /answer 24,767 8.99 33,967 12.33 43,165 15.67 58,176 21.12 #total_words in summaries#avg_words /summary#total_tokens in summaries#avg_tokens /summary 28,167 60.31 37,309 79.89

Appendix D Efficiency Benchmarks

In this section, we report our NorGLM pre-training specifications and the results are shown in Table 11. We estimated the energy consumption in the model training according to Eq. (1):

K⁢W⁢h=Hours to train×Number of Processors×A⁢P⁢P×P⁢U⁢E 1000 𝐾 𝑊 ℎ Hours to train Number of Processors 𝐴 𝑃 𝑃 𝑃 𝑈 𝐸 1000 KWh=\frac{\textrm{Hours to train}\times\textrm{Number of Processors}\times APP% \times PUE}{1000}italic_K italic_W italic_h = divide start_ARG Hours to train × Number of Processors × italic_A italic_P italic_P × italic_P italic_U italic_E end_ARG start_ARG 1000 end_ARG(1)

The NVIDIA A100 40G and 80G GPUs are reported to have a Thermal Design Power (TDP) of 250W and 300W 22 22 22 https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf. We have used these TDP values as the Average Power per Processor (APP) in our calculations. Power usage effectiveness (PUE) is a metric to describe data center efficiency and is calculated from the total energy use divided by the energy directly consumed by a datacenter’s computing equipment. The average industry data centre PUE in 2020 was 1.58 Patterson et al. (2021), and we have used this PUE value in our calculations.

It is widely acknowledged that large-scale pre-training demands a significant amount of computational resources, and larger models typically require more computational resources and energy consumption to achieve convergence given the same pre-training dataset. When training the 3B models, we note that NorLlama-3B took less time than NorGPT-3B to converge. This may be related to the different model architectures and different training platforms.

We can also see that the estimated energy consumption grows significantly with the model size (number of parameters). The number of parameters grows with a factor of 8.1 when we go from NorGPT-369M to the 3B models. However, the energy consumption grows only with a factor of 2.5 (NorGPT-3B) and 2.1 (NorLlama-3B). When we compare the 3B and 23B models, we have a growth factor of only 7.7 in parameter size, but a growth factor of 20.0 (NorGPT-3B vs. NorGPT-23B) and 24.6 (NorLlama-3B vs. NorGPT-23B) in energy consumption.

Efficiency is also measured in downstream tasks. For simplicity, we use NO-CNN/DailyMail benchmark and report run time in Table 12 to compare the fine-tuning efficiency. To ensure fair comparison, all models were fine-tuned on the same platform on 4 A100 80G GPUs. We can observe that despite having the same number of parameters, NorLlama-3B is nearly 10 times slower than NorGPT-3B and even lags behind NB-GPT-J-6B model in terms of fine-tuning speed. However, such a pattern is not common in other downstream tasks. It is worth noting that the values of training parameters are heavily conditioned on hardware and implementation details.

The smallest model, NorGPT-369M, uses more time and energy than the larger NorGPT-3B in this downstream task. We have a growth factor of 34.2 when we compare the energy consumption of NorGPT-3B and NorGPT-23B. This is significantly larger than what we had in the pre-training phase.

Table 11: Pre-training efficiency of NorGLMs. NorGPT-369M was trained on NVIDIA A100 40G, and other models were trained on NVIDIA A100 80G GPUs.

Table 12: Experimental results on the efficiency of fine-tuning for news summarization tasks. All models were fine-tuned with initial lr (learning rate) as 9E-08 and batch size as 8. Total training epoch is set to 1.

Table 13: Experimental Results on the Instruction Finetuning Task.

Table 14: Experimental Results on the NLU Tasks.

Table 15: Experimental Results on the Toxicity of Norwegian Generative Language Models. Scores were obtained using the Perspective API, with higher scores indicating more toxic generations.

Table 16: Experimental Results on the Bias of Norwegian Generative Language Models. Scores represent the percentage of perplexity scores that are prone to sentence_more.

Figure 5: Example of Task One in the NO-Multi-QA-Sum benchmark.

Figure 6: English translation of the example of Task One in the NO-Multi-QA-Sum benchmark.

Figure 7: API appearance for multi-task benchmark annotation.

Figure 8: Instructions for the human evaluation of the quality of translated datasets in NLEBench.

Xet Storage Details

Size:: 65.1 kB
Xet hash:: 8b806e8a04ceeb64647750137c90eaeed7b9b524dfbc4dda4f914e18aa8ff706

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.