Upload 2 files
Browse files
README.md
CHANGED
|
@@ -6,9 +6,9 @@ license: mit
|
|
| 6 |
library_name: transformers
|
| 7 |
---
|
| 8 |
|
| 9 |
-

|
| 19 |
-
- **Model type:** The `SpiralAI RetNet-3b-
|
| 20 |
- **Languages:** Japanese, English.
|
| 21 |
- **License:** MIT
|
| 22 |
- **Training:** Trained on 80b tokens.
|
|
@@ -98,15 +98,15 @@ Here we show the result of the last layer.
|
|
| 98 |
|
| 99 |
## Test loss comparison
|
| 100 |
|
| 101 |
-
We compared the test loss of `Spiral-AI/RetNet-3b-
|
| 102 |
The first 100 examples are extracted from `wikipedia-ja` for the test dataset.
|
| 103 |
|
| 104 |

|
| 105 |
|
| 106 |
Key findings are:
|
| 107 |
|
| 108 |
-
- The test loss of `Spiral-AI/RetNet-3b-
|
| 109 |
-
- The explosion of test loss is suppressed in `Spiral-AI/RetNet-3b-
|
| 110 |
|
| 111 |
# Training Datasets
|
| 112 |
|
|
|
|
| 6 |
library_name: transformers
|
| 7 |
---
|
| 8 |
|
| 9 |
+

|
| 10 |
|
| 11 |
+
# SpiralAI Spiral-RetNet-3b-base
|
| 12 |
|
| 13 |
We have conducted pre-training from scratch on the RetNet (https://arxiv.org/abs/2307.08621) architecture model 3b using a mixed dataset of Japanese and English.
|
| 14 |
This model is released primarily for the basic research of "retention mechanism".
|
|
|
|
| 16 |
# Model Description
|
| 17 |
|
| 18 |
- **Developed by:** [SpiralAI](https://go-spiral.ai/)
|
| 19 |
+
- **Model type:** The `SpiralAI Spiral-RetNet-3b-base` is a language model equipped with a retention mechanism. It uses the `cyberagent/calm2-7b-chat` tokenizer.
|
| 20 |
- **Languages:** Japanese, English.
|
| 21 |
- **License:** MIT
|
| 22 |
- **Training:** Trained on 80b tokens.
|
|
|
|
| 98 |
|
| 99 |
## Test loss comparison
|
| 100 |
|
| 101 |
+
We compared the test loss of `Spiral-AI/Spiral-RetNet-3b-base` and `cyberagent/open-calm-3b` on different length of tokens.
|
| 102 |
The first 100 examples are extracted from `wikipedia-ja` for the test dataset.
|
| 103 |
|
| 104 |

|
| 105 |
|
| 106 |
Key findings are:
|
| 107 |
|
| 108 |
+
- The test loss of `Spiral-AI/Spiral-RetNet-3b-base` goes as low as `cyberagent/open-calm-3b`, showing the effectiveness of the retention mechanism.
|
| 109 |
+
- The explosion of test loss is suppressed in `Spiral-AI/Spiral-RetNet-3b-base` when the context length goes longer than 2,048 tokens (the maximum context length of training data; Note that `cyberagent/open-calm-3b` is trained on the same context length.).
|
| 110 |
|
| 111 |
# Training Datasets
|
| 112 |
|
logo.png
ADDED
|