File size: 1,700 Bytes

0f5fdac
 
 
 
 
959e3b7
0f5fdac
 
959e3b7
 
 
 
 
 
 
0f5fdac
 
 
 
959e3b7
 
 
0f5fdac
 
 
 
 
 
959e3b7
 
 
 
0f5fdac
 
 
 
 
959e3b7
0f5fdac
 
959e3b7
0f5fdac
959e3b7
0f5fdac
 
 
 
 
959e3b7
0f5fdac
959e3b7
0f5fdac
959e3b7

---
library_name: transformers
tags: []
---

# Model D2


This model uses a causal language modeling approach during training.
This approach modifies the way the model accesses and processes words that
precede the current token in the input sequence. Unlike masked language
modeling in a sequence-to-sequence model, casual language modeling focuses
on predicting the single next token. It does this by conditioning on all previous
tokens in the sequence, ensuring that the model only has access to prior tokens
and not future ones.


## Model Details

When performing experiments with a decoder-only model, we selected BLOOM
as the architecture.

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Ronny Paul
- **Model type:** BLOOM
- **Language(s) (NLP):** Northern Sami
- **Finetuned from model:** TurkuNLP/gpt3-finnish-xl



## Uses

The model serves as a foundational model, and is used in a plagiarism detection. It can support fine-tuning on a down-stream task with Northern Sami data. 


## Dataset

The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.



## How to Get Started with the Model

model = BloomForCausalLM.from_pretrained("rpa020/D2")

## Performance

CE Loss: 4.27
Perplexity: 71.6
SELF-BLEU: 0.32