Buckets:

|
download
raw
65.1 kB

Title: NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian

URL Source: https://arxiv.org/html/2312.01314

Markdown Content: 4 Norwegian Benchmark Dataset - NLEBench

This section introduces tasks in NLEBench specifically designed for Norwegian GLMs. The datasets are sourced from three categories: existing datasets, machine-translated datasets using the Google Translation API, and manually annotated datasets. Our native Norwegian colleagues evaluated random samples from both the Google Translation API 8 8 8https://cloud.google.com/translate/docs and another free translation API 9 9 9https://pypi.org/project/translators/ supporting Norwegian, finding that the former performs better, especially with confusing words and long texts. Table 3 outlines the differences and evaluation settings of these datasets. The statistics of different datasets are shown in Table 8-C.

4.1 Open-domain conversation

NO-ConvAI2 is machine-translated from the English ConvAI2 dataset Dinan et al. (2020), which itself is a refined version of the PersonaChat corpus Zhang et al. (2018). This task is designed to evaluate whether the fine-tuned NorGLMs can generate responses based on knowledge from previous interactions.

4.2 News summarization

In this task, we assess the abstractive summarization capabilities of NorGLMs using our NO-CNN/DailyMail dataset, which is machine-translated from CNN/DailyMail — an English dataset that includes journalists’ annotated summaries. We employ fine-tuning and the Reinforcement Learning with Human Feedback (RLHF) strategy on NorGLMs. In step 2 of RLHF, we train the reward model by estimating semantic similarity between the candidate generated text and the human-annotated summary (golden summary) using the NorBERT model Kutuzov et al. (2021). Summaries generated with higher cosine similarity to the golden summary are prioritized during the training of the reward model.

4.3 Instructions

This task utilizes datasets from two sources: NO-Alpaca 10 10 10https://huggingface.co/NbAiLab/nb-gpt-j-6B-alpaca, translated from the Stanford Alpaca dataset Wang et al. (2023b) into Norwegian using OpenAI’s GPT-3.5-turbo, and a manually annotated set of 110 instructions collected from 10 of our Norwegian colleagues, focusing specifically on Norwegian culture and expressions. This combined dataset is named NO-Alpaca-Plus.

4.4 Natural Language Understanding (NLU)

This task aims to analyze the natural language understanding capabilities of our NorGLMs. We extracted the Norwegian portion from the OverLim dataset 11 11 11https://huggingface.co/datasets/KBLab/overlim and selected three tasks commonly used in evaluating English generative language models: BoolQ, MRPC, and QNLI. Notably, OverLim is translated from the GLUE 12 12 12https://huggingface.co/datasets/glue and SuperGLUE 13 13 13https://super.gluebenchmark.com/ benchmarks. To distinguish it from the original English version, we use the prefix "NO-" for the versions used in this paper. The data split follows the original protocol.

4.5 Toxicity and bias

Generative language models are notorious for amplifying biases inherent in the training data Sheng et al. (2019) and producing toxic text Gehman et al. (2020). To evaluate these issues in NorGLMs, we used the Perspective API 14 14 14https://perspectiveapi.com/ on 1508 prompts for toxicity evaluation and calculated ppl on 1677 sample pairs for bias evaluation from the NO-CrowS-Pairs benchmark, a machine-translated version of the French CrowS-Pairs Névéol et al. (2022). Due to the API’s lack of Norwegian support, we translated the NorGLM generated text into Swedish for assessment. This benchmark also helps evaluate potential biases in NorGLMs.

4.6 Multi-task learning

Apart from the benchmarks and translated datasets mentioned above, we release a multi-task dataset called NO-Multi-QA-Sum. This section details the dataset collection process and the tasks performed using this benchmark.

Data Collection. We recruited three Norwegian college students as annotators, allowing them to work in pairs or independently. Each student is compensated 230 NOK (approx. $21,75 USD) per hour. Annotators were tasked with conducting a conversation about a given news article, using content from the article without a limit on the number of dialogue turns or question types. After the conversation, they were required to write a generic summary of the article. The dialogue and summary content did not need to fully overlap, giving annotators some freedom in their dialogue choices. Most annotators chose to use self-dialogue and summarization for efficiency and flexibility 15 15 15 This design aims to evaluate the model’s reading comprehension ability. We instructed annotators to consider question diversity, including both simple questions (where the answer comes from a single source) and complex questions (where the answer is derived from different parts of the article). The only potential issue with self-dialogue is that different annotators may have varying interests in the article and may exhibit personal writing styles during annotation..

To facilitate the annotation process, we developed an API, shown in Figure 7, that can connect with the OpenAI GPT-4 model to suggest annotations. However, annotators were required to verify the fidelity and usability of the suggested texts. To ensure quality, each annotation should be cross-validated and corrected by two other annotators, achieving one hundred percent internal consensus on the final annotations. The cross-validation included checking the rationality of question-answer pairs, factual consistency, and language fluency. Many annotators reported that while GPT-4 (specifically gpt-4-0613)16 16 16https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4 was good at generating suggested questions and summaries, it struggled with producing high-quality answers, necessitating human effort to maintain annotation quality.

Tasks. In particular, for this dataset, we primarily explored two tasks using the Chain-of-Thought (CoT) method: based on the given news article, 1) we first let the model answer the annotated questions, and then let the model generate a summary of the article based on the article, questions and the answers generated by the model. 2) We first let the model generate summaries, and then ask the model to answer questions based on the article and summary generated by the model. We tested these tasks on NorGPT-3B/23B, NB-GPT-J-6B, which are fine-tuned on the NO-CNN/DailyMail and NO-ConvAI2 datasets, and GPT-3.5-Turbo. These tasks are designed based on the hypothesis that DGQA and summarization are inherently correlated, and the synergies between these tasks may influence the model’s performance on individual tasks. To address potential annotator oversight in associating content with the summarization task during question answering, we instructed annotators to manually categorize the data based on whether the question-answering content includes or excludes a summary, and experiments were conducted on each subset.

Wang et al. (2023a) developed an element-aware summarization method using CoT approach by instructing LLM to generate four key elements—Entity, Date, Event, and Result—to be integrated into the summary. They evaluated the proposed method on 200 annotated samples. However, we argue that human-written summaries demonstrate greater diversity and flexibility beyond these four elements. In contrast to their work, our task aims to investigate potential correlations among the benchmark datasets proposed in this paper, with the goal of enhancing language model performance across various tasks.

5 Experimental Results

In this section, we only list key results for the benchmark datasets due to the page limit. More results can be seen in the Appendix.

5.1 Evaluation Metrics

We aim to comprehensively evaluate our models across various tasks using widely used metrics for NLP tasks, including BLEU Papineni et al. (2002), ROUGE Lin (2004), Distinct Li et al. (2016), and MAUVE, which is used to assess the generated and human-written text based on their probability distribution differences Pillutla et al. (2021). Furthermore, following the work of Xie et al. (2023), to measure faithfulness and factual consistency in multi-task learning, we utilize Entailment scores from a fine-tuned NorBERT model trained on the VitaminC dataset Schuster et al. (2021), which are translated with Google Cloud Translation API.

Table 2: Experimental Results on the Conversation Task.

Table 3: Experimental Results on the News Summarization Task.

5.2 Evaluation Results on NO-ConvAI2

As shown in Table 2, all models, except for GPT-3.5-Turbo, perform quite similarly. Notably, the NorGPT-3B model achieves the best results across multiple evaluation metrics, while the NorGPT-23B model only shows an advantage in BLEU scores. GPT-3.5-Turbo, although specifically curated for conversational purposes, did not exhibit the advantages expected from its extensive knowledge base. This may be because the knowledge of other languages in GPT-3.5-Turbo cannot be directly transferred to understanding Norwegian conversations, highlighting the unique linguistic properties of the Norwegian language.

5.3 Evaluation Results on NO-CNN/DailyMail

In Table 3, GPT-3.5-Turbo and NB-GPT-J-6B outperform our NorGPTs on BLEU and ROUGE metrics. This suggests a substantial number of expression patterns resembling news articles in their pre-training datasets. This is plausible given that their datasets likely include a diverse range of newspapers, magazines, and government reports. Additionally, this trend is evident in common test samples, where GPT-3.5-Turbo tends to generate more formal language compared to conversational language. Despite this, we observed that the models’ performance improves after reinforcement learning, especially in replicating the word distribution of human writing and generating summaries of similar length. This is supported by the highest scores in MAUVE and BLEU. Although the model with reinforcement learning may not always surpass the fine-tuned model in accuracy, it actively strives to mimic human writing patterns.

5.4 Evaluation Results on NO-Alpaca-Plus

Table 13 demonstrates the performance of our baseline models after fine-tuning on the NO-Alpaca dataset. Given that this dataset is translated using GPT-3.5-Turbo, we could not use GPT-3.5-Turbo as a baseline due to OpenAI’s terms and policies 17 17 17https://openai.com/policies/. NB-GPT-J-6B outperforms other models on most evaluation metrics, likely due to its pre-training on a set of self-annotated Norwegian instructions, as described on their model webpage. Among our NorGLM models, NorLlama-3B achieved better BLEU and ROUGE scores compared to others, but worse MAUVE and perplexity scores. This is an interesting phenomenon, indicating that NorLlama-3B’s results hit the most n-grams, yet its token probability distribution deviates the most from human-annotated results. A case study revealed that while NorLlama-3B generates overlapping words or phrases with the golden answer, it sometimes lacks logical coherence between sentences, and the meanings of sentences can even be mutually exclusive, as shown in Figure 2.

Meanwhile, in our self-annotated 110 instructions, we select two typical cases generated from GPT-3.5-Turbo related to Norwegian culture and special expression shown in Figure 3 and Figure 4 respectively. Specifically, Figure 3 shows a factual inconsistency issue in generated texts. In Figure 4, the input prompt asks who uses the word, but the model interprets the meaning of the word rather than understanding the question. Therefore, with limited annotated data, we can still find limitations in the model’s understanding of the specific culture behind the language.

5.5 Evaluation Results on NLU tasks

Table 14 reports the results on NLU tasks. Among NorGLMs, NorGPT-23B model consistently outperforms others on different NLU datasets across both evaluation metrics. However, NB-GPT-J-6B performs better on the NO-QNLI benchmark and achieves a higher F1-score on the NO-MRPC benchmark.

5.6 Evaluation Results on Toxicity and Bias

The results of average toxicity scores from 6 perspectives including Toxicity, Severe toxicity, Identity attack, Insult, Profanity and Threat are shown in Table 15. All toxicity scores range from 0 to 1, with lower values indicating less toxic text generated by the model. Although NorLlama-3B exhibits the lowest values across all metrics, a significant portion of its generated text consists of meaningless characters or words. We conducted a random sampling of texts generated by GPT models with high toxicity values and traced hazardous words back to the pre-training dataset. Surprisingly, most of these hazardous words did not originate from social media, as commonly assumed, but from daily news articles. For instance, the phrase "tok livet av" (taken life from/kill) often appeared in news reports describing murders, as illustrated in Figure 1. These original news articles did not convey toxic information but were instead factual descriptions of criminal events. This discovery underscores the importance of not only filtering out toxic inputs during the pre-training process but also considering which prompts may lead the model to generate toxic text.

Table 16 presents findings from stereotype and bias detection using the NO-CrowS-Pairs dataset. This dataset encompasses nine categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Each sample consists of a stereotype (sent_more) paired with an anti-stereotype (sent_less) sentence. Following the work of Touvron et al. (2023), model bias is assessed by comparing perplexity scores between these pairs and reporting the percentage of the model biased towards sent_more in the table. Higher values indicate a stronger bias towards public stereotypes. Overall, the benchmark models demonstrated robust performance across most bias categories. However, they exhibited a bias towards sent_less in relation to religion, suggesting a relative bias in this specific category.

Table 4: Experimental Results on task one using NO-Multi-QA-Sum dataset for summarization task.

Table 5: Experimental Results on task two using NO-Multi-QA-Sum dataset for document-grounded question answering task.

5.7 Evaluation with CoT

In this task, all baseline models except GPT-3.5 were fine-tuned on the NO-CNN/DailyMail and NO-ConvAI2 datasets, enabling them to handle related tasks effectively. However, none of these models were fine-tuned using document-grounded question answering datasets or similar CoT tasks investigated in this study. Table 4 and Table 5 present the outcomes of the multi-task dataset under different scenarios. The tables distinguish datasets where the question answering content includes or excludes a summary, labeled as "contain" and "not contain" respectively. For both tasks, we utilized different prompt templates and reported the optimal performance in the tables. From the results, we draw several observations:

In task one, we observed that GPT-3.5 significantly improved in summarization performance with the CoT method, while other models saw a degradation in this aspect. For DGQA, NorGPT-3B and NorGPT-23B models showed improvements through CoT, whereas NB-GPT-J-6B exhibited mixed results across different datasets. Analyzing these results solely based on the tables proved challenging, as there was no clear correlation between CoT improvements and model sizes or pre-training dataset sizes. This contrasts with prior findings suggesting CoT benefits are more pronounced with larger models Wei et al. (2022). Combining results from Table 2 and Table 3, we observed models that initially performed well in their tasks showed further enhancement with CoT adaptations. For instance, GPT-3.5 excelled in summarization on the NO-CNN/DailyMail dataset after CoT, and NorGPT-3B and NorGPT-23B models improved in document-grounded question answering on the NO-ConvAI2 dataset. Figure 5 illustrates an example where CoT-generated summaries closely approximate human-written summaries compared to direct prompts for the model to generate summaries. The English translation is shown in Figure 6.

While we observe that the synergy between the two tasks enhances the model’s performance on both, we also find that incorporating a summary into a QA task improves the quality of the generated summary compared to QA tasks without one. However, the reverse scenario is not necessarily true. We speculate that QA breaks down the summarization task into smaller components, enabling the model to better comprehend the input text. This process mirrors the human learning process.

Moreover, as shown in both Table 4 and Table 5, we find that after CoT, the Entailment scores of most models increased, indicating that the answers and summaries generated by the models are more aligned with the context described in the article. Therefore, CoT has the potential to enhance the factual consistency of the generated outputs.

Table 6: Human evaluation results on the quality of machine translated datasets in NLEBench.

5.8 Human Evaluation

To evaluate the quality of the translated datasets, we conducted a human evaluation on three datasets translated by the Google API: NO-ConvAI2, NO-CNN/DailyMail and NO-CrowS-Pairs. Specifically, considering the constraints of time and cost, we randomly selected 50 samples from each of the three datasets. We recruited three Norwegian native speakers, all of whom are college students, to independently score the Adequacy and Fluency of each text. Adequacy measures whether the translated text accurately conveys the meaning of the original text, while Fluency assesses whether the expression of the translated text aligns with native Norwegian expressions. The scores range from 1 to 5, with 1 representing non-compliance and 5 representing full compliance. In addition, we used the Claude 3 Opus model 18 18 18https://www.anthropic.com/news/claude-3-family to translate the same 150 samples, adhering strictly to the model settings described in Enis and Hopkins (2024)19 19 19 Please note that, at the time of this research, Claude 3 Opus had not yet been published.. The experimental results are shown in Table 6. The detailed instructions to the evaluators are shown in Figure 8.

The results show that Claude 3 Opus outperforms Google API in both Adequacy and Fluency indicators. We can also see that both Google Translation and Claude Translation are able to accurately convey most of the meaning of the original text and include some native or even good native expressions. We adopt Fleiss’ kappa (κ 𝜅\kappa italic_κ) to measure Inter-rater Agreement among the three raters for each evaluation metric and dataset. We observed high consistency among evaluators in adequacy assessments, while fluency evaluations demonstrated low consistency 20 20 20https://www.ncbi.nlm.nih.gov/books/NBK92287/table /executivesummary.t2/?report=objectonly. By comparing individual scores with the types of translation errors they annotated, we found that bias exists among evaluators. For the same translated text, although all evaluators marked the translation expression as incorrect, some evaluators with higher scores believed that, despite not conforming to Norwegian expression habits, the translation still conveyed the original meaning. In contrast, another evaluator believed that incorrect word choices significantly affected the text’s fluency and gave a lower score. Furthermore, based on the annotations, the most frequent translation errors in the sample dataset were "the misuse of words", followed by "missing words", "incorrect word order", and "extra words".

6 Discussion

In this subsection, we present observations from the longitudinal comparison of different models in downstream tasks, as detailed in Section 5: 1) While NB-GPT-J-6B did not achieve the highest scores across all tasks, it showed consistent performance and the best perplexity scores compared to our NorGLMs on nearly all tasks. This consistency is likely due to its initial training on large English datasets before being continue-trained on Norwegian data. 2) The 23B model did not show the expected absolute advantage in downstream tasks. We find that with a small-scale pre-training dataset, a larger model cannot demonstrate its ability to better cope with complex problems, which also supports the findings in Hoffmann et al. (2022). 3) The results highlight the promising abilities of smaller language models on specific tasks. However, these models often lack consistency in generating high-quality, meaningful text. 4) A comparison between Table 3 and Table 4 reveals significant differences between summaries written by journalists and those generated by GPT-3.5 or non-professionals. However, the model’s performance on the latter datasets appears to be proportional to its size. GPT-3.5’s performance on NO-Multi-QA-Sum has improved significantly, possibly due to the similarity of frameworks and training data overlap between GPT-3.5 and GPT-4. 5) GPT-3.5’s difficulties with specialized Norwegian instructions highlight the unique complexities of the Norwegian language, which are challenging for English-dominated models. This emphasizes the need to focus on low-resource languages to better understand their cultural nuances.21 21 21 We have released more Norwegian foundation models and datasets and will continue to update and integrate Norwegian-related resources. Please follow our GitHub repository for more information.

7 Conclusion

In this paper, we introduced a suite of Norwegian Generative Language Models and a comprehensive benchmark with seven tasks tailored for the underrepresented Norwegian language. Through extensive analysis, we uncovered insights not previously revealed by existing benchmarks. Our evaluation of the NO-Multi-QA-Sum dataset highlighted the effectiveness of multi-task datasets in assessing natural language understanding through complex tasks like Chain-of-Thought (CoT). We also noted differences between human-annotated summaries and those generated by GPT-3.5, providing valuable insights for future abstractive summarization advancements. Furthermore, our study emphasized the unique linguistic and cultural aspects of Norwegian, suggesting that mainstream benchmarks may not fully capture the performance of language models on low-resource languages. Thus, developing benchmarks specific to these languages is essential for accurate evaluation and development.

8 Limitations

Although NLEBench is currently the most comprehensive benchmark for Norwegian, its coverage of applications and downstream tasks remains limited. Our benchmark is open-ended and inevitably cannot cover everything in Norway. Nevertheless, we believe that the published resources will significantly aid research in generative language models for low-resource scenarios. While Balahur and Turchi (2014) suggested that translation systems produce good quality data, translation errors and misconceptions persist. Due to budget constraints and the large volume of translation samples, ensuring the quality of our translated dataset was challenging. However, the value of machine-translated datasets should not be dismissed. For instance, we use NO-ConvAI2 to fine-tune the model, endowing it with conversational capabilities, and NO-Alpaca includes general knowledge about Norway, such as The capital of Norway is Oslo, although the coverage remains limited.

Another constraint is the scarcity of human-annotated samples in our benchmark, largely attributable to the extensive time and financial resources required for their collection. Notably, the process of amassing over 500 samples for the NO-Multi-QA-Sum dataset was time-intensive and necessitated thorough quality control measures before implementation. Moreover, acquiring sufficient Norwegian pre-training data and considering the copyright issues of data poses a formidable challenge. The current difficulty lies in obtaining a training dataset of comparable size to those available for English, severely constraining the performance of our pre-trained models. Despite our efforts to procure data from diverse sources and provide pertinent statistical insights, certain data cannot be redistributed, complicating efforts to replicate our pretraining phase. Looking ahead, we aim to mitigate the shortage of textual data through manual annotation efforts or by integrating multi-modal data, thereby fostering advancements in low-resource language model development within the broader research community.

Acknowledgements

This publication has been funded by the SFI NorwAI, (Centre for Research-based Innovation, 309834). The authors gratefully acknowledge the financial support from the Research Council of Norway and the partners of the SFI NorwAI.

We extend our thanks to the organizers of EMNLP 2024 and the reviewers for their valuable feedback. Special thanks to the IDUN team at NTNU Själander et al. (2019) for providing essential computational resources, and to Schibsted and the National Library of Norway (Nasjonalbiblioteket) for supplying the crucial dataset for our research.

References

Appendix A NorGLM Model Parameter Settings

Table 7: The training parameter settings of NorGLMs

Appendix B The Statistics of Benchmark Datasets

Data statistics are in Table 8-C.

Appendix C Case Study on the Instruction Finetuning Task

Examples of generated responses for the instructions in the NO-Alpaca(-Plus) benchmark are shown in Figure 2-4.

Image 1: Refer to caption

Figure 2: Example of NorLlama-3B on NO-Alpaca benchmark. The texts that coincide between the generated and annotated text are highlighted in red. Translations are in the brackets.

Image 2: Refer to caption

Figure 3: Example of generated performance of GPT-3.5 on Norwegian culture instruction of NO-Alpaca-Plus. Translations are on the right.

Image 3: Refer to caption

Figure 4: Example of generated performance of GPT-3.5 on Norwegian special expression instruction of NO-Alpaca-Plus. Translations are on the right.

Table 8: Statistics on NO-Alpaca, NO-CNN/DailyMail dataset, where P denotes prompt, A denotes answer, N is news article and S is summary.

Table 9: Statistics on NO-ConvAI2 dataset.

Table 10: Statistics on NO-Multi-QA-Sum dataset.

Type#articles#dialogues#avg_turns /dialogue#total_words in articles#avg_words /article#total_tokens in articles#avg_tokens /article Zero-shot 467 2,755 5.90 203,606 435.99 276,708 592.52 #total_words in questions#avg_words /question#total_tokens in questions#avg_tokens /question#total_words in answers#avg_words /answer#total_tokens in answers#avg_tokens /answer 24,767 8.99 33,967 12.33 43,165 15.67 58,176 21.12 #total_words in summaries#avg_words /summary#total_tokens in summaries#avg_tokens /summary 28,167 60.31 37,309 79.89

Appendix D Efficiency Benchmarks

In this section, we report our NorGLM pre-training specifications and the results are shown in Table 11. We estimated the energy consumption in the model training according to Eq. (1):

K⁢W⁢h=Hours to train×Number of Processors×A⁢P⁢P×P⁢U⁢E 1000 𝐾 𝑊 ℎ Hours to train Number of Processors 𝐴 𝑃 𝑃 𝑃 𝑈 𝐸 1000 KWh=\frac{\textrm{Hours to train}\times\textrm{Number of Processors}\times APP% \times PUE}{1000}italic_K italic_W italic_h = divide start_ARG Hours to train × Number of Processors × italic_A italic_P italic_P × italic_P italic_U italic_E end_ARG start_ARG 1000 end_ARG(1)

The NVIDIA A100 40G and 80G GPUs are reported to have a Thermal Design Power (TDP) of 250W and 300W 22 22 22 https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf. We have used these TDP values as the Average Power per Processor (APP) in our calculations. Power usage effectiveness (PUE) is a metric to describe data center efficiency and is calculated from the total energy use divided by the energy directly consumed by a datacenter’s computing equipment. The average industry data centre PUE in 2020 was 1.58 Patterson et al. (2021), and we have used this PUE value in our calculations.

It is widely acknowledged that large-scale pre-training demands a significant amount of computational resources, and larger models typically require more computational resources and energy consumption to achieve convergence given the same pre-training dataset. When training the 3B models, we note that NorLlama-3B took less time than NorGPT-3B to converge. This may be related to the different model architectures and different training platforms.

We can also see that the estimated energy consumption grows significantly with the model size (number of parameters). The number of parameters grows with a factor of 8.1 when we go from NorGPT-369M to the 3B models. However, the energy consumption grows only with a factor of 2.5 (NorGPT-3B) and 2.1 (NorLlama-3B). When we compare the 3B and 23B models, we have a growth factor of only 7.7 in parameter size, but a growth factor of 20.0 (NorGPT-3B vs. NorGPT-23B) and 24.6 (NorLlama-3B vs. NorGPT-23B) in energy consumption.

Efficiency is also measured in downstream tasks. For simplicity, we use NO-CNN/DailyMail benchmark and report run time in Table 12 to compare the fine-tuning efficiency. To ensure fair comparison, all models were fine-tuned on the same platform on 4 A100 80G GPUs. We can observe that despite having the same number of parameters, NorLlama-3B is nearly 10 times slower than NorGPT-3B and even lags behind NB-GPT-J-6B model in terms of fine-tuning speed. However, such a pattern is not common in other downstream tasks. It is worth noting that the values of training parameters are heavily conditioned on hardware and implementation details.

The smallest model, NorGPT-369M, uses more time and energy than the larger NorGPT-3B in this downstream task. We have a growth factor of 34.2 when we compare the energy consumption of NorGPT-3B and NorGPT-23B. This is significantly larger than what we had in the pre-training phase.

Table 11: Pre-training efficiency of NorGLMs. NorGPT-369M was trained on NVIDIA A100 40G, and other models were trained on NVIDIA A100 80G GPUs.

Table 12: Experimental results on the efficiency of fine-tuning for news summarization tasks. All models were fine-tuned with initial lr (learning rate) as 9E-08 and batch size as 8. Total training epoch is set to 1.

Table 13: Experimental Results on the Instruction Finetuning Task.

Table 14: Experimental Results on the NLU Tasks.

Table 15: Experimental Results on the Toxicity of Norwegian Generative Language Models. Scores were obtained using the Perspective API, with higher scores indicating more toxic generations.

Table 16: Experimental Results on the Bias of Norwegian Generative Language Models. Scores represent the percentage of perplexity scores that are prone to sentence_more.

Image 4: Refer to caption

Figure 5: Example of Task One in the NO-Multi-QA-Sum benchmark.

Image 5: Refer to caption

Figure 6: English translation of the example of Task One in the NO-Multi-QA-Sum benchmark.

Image 6: Refer to caption

Figure 7: API appearance for multi-task benchmark annotation.

Image 7: Refer to caption

Figure 8: Instructions for the human evaluation of the quality of translated datasets in NLEBench.

Xet Storage Details

Size:
65.1 kB
·
Xet hash:
8b806e8a04ceeb64647750137c90eaeed7b9b524dfbc4dda4f914e18aa8ff706

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.