|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
|
|
|
# Model D1 |
|
|
|
|
|
|
|
|
This model uses a causal language modeling approach during training. |
|
|
This approach modifies the way the model accesses and processes words that |
|
|
precede the current token in the input sequence. Unlike masked language |
|
|
modeling in a sequence-to-sequence model, casual language modeling focuses |
|
|
on predicting the single next token. It does this by conditioning on all previous |
|
|
tokens in the sequence, ensuring that the model only has access to prior tokens |
|
|
and not future ones. |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
When performing experiments with a decoder-only model, we selected BLOOM |
|
|
as the architecture. |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
|
|
- **Developed by:** Ronny Paul |
|
|
- **Model type:** BLOOM |
|
|
- **Language(s) (NLP):** Northern Sami |
|
|
|
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
This model was used in an experiment to determine which architecture is favourable in a low-resource-setting with Northern Sami. |
|
|
|
|
|
|
|
|
## Dataset |
|
|
|
|
|
The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development. |
|
|
|
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
model = BloomForCausalLM.from_pretrained("rpa020/D1") |
|
|
|
|
|
## Performance |
|
|
|
|
|
CE Loss: 7.66 |
|
|
Perplexity: 2130 |
|
|
SELF-BLEU: 0.40 |