Spiral-AI
/

Spiral-RetNet-3b-base

Text Generation

Model card Files Files and versions

ksterx commited on May 1, 2024

Commit

6acd15c

·

verified ·

1 Parent(s): a3f70ad

Upload 2 files

Files changed (2) hide show

README.md +6 -6
logo.png +0 -0

README.md CHANGED Viewed

@@ -6,9 +6,9 @@ license: mit
 library_name: transformers
 ---
-![SpiralAI RetNet-3b-ja-base](logo.jpg)
-# SpiralAI RetNet-3b-ja-base
 We have conducted pre-training from scratch on the RetNet (https://arxiv.org/abs/2307.08621) architecture model 3b using a mixed dataset of Japanese and English.
 This model is released primarily for the basic research of "retention mechanism".
@@ -16,7 +16,7 @@ This model is released primarily for the basic research of "retention mechanism"
 # Model Description
 - **Developed by:** [SpiralAI](https://go-spiral.ai/)
-- **Model type:** The `SpiralAI RetNet-3b-ja-base` is a language model equipped with a retention mechanism. It uses the `cyberagent/calm2-7b-chat` tokenizer.
 - **Languages:** Japanese, English.
 - **License:** MIT
 - **Training:** Trained on 80b tokens.
@@ -98,15 +98,15 @@ Here we show the result of the last layer.
 ## Test loss comparison
-We compared the test loss of `Spiral-AI/RetNet-3b-ja-base` and `cyberagent/open-calm-3b` on different length of tokens.
 The first 100 examples are extracted from `wikipedia-ja` for the test dataset.
 ![test_loss](loss_comparison.png)
 Key findings are:
-- The test loss of `Spiral-AI/RetNet-3b-ja-base` goes as low as `cyberagent/open-calm-3b`, showing the effectiveness of the retention mechanism.
-- The explosion of test loss is suppressed in `Spiral-AI/RetNet-3b-ja-base` when the context length goes longer than 2,048 tokens (the maximum context length of training data; Note that `cyberagent/open-calm-3b` is trained on the same context length.).
 # Training Datasets

 library_name: transformers
 ---
+![SpiralAI Spiral-RetNet-3b-base](logo.png)
+# SpiralAI Spiral-RetNet-3b-base
 We have conducted pre-training from scratch on the RetNet (https://arxiv.org/abs/2307.08621) architecture model 3b using a mixed dataset of Japanese and English.
 This model is released primarily for the basic research of "retention mechanism".
 # Model Description
 - **Developed by:** [SpiralAI](https://go-spiral.ai/)
+- **Model type:** The `SpiralAI Spiral-RetNet-3b-base` is a language model equipped with a retention mechanism. It uses the `cyberagent/calm2-7b-chat` tokenizer.
 - **Languages:** Japanese, English.
 - **License:** MIT
 - **Training:** Trained on 80b tokens.
 ## Test loss comparison
+We compared the test loss of `Spiral-AI/Spiral-RetNet-3b-base` and `cyberagent/open-calm-3b` on different length of tokens.
 The first 100 examples are extracted from `wikipedia-ja` for the test dataset.
 ![test_loss](loss_comparison.png)
 Key findings are:
+- The test loss of `Spiral-AI/Spiral-RetNet-3b-base` goes as low as `cyberagent/open-calm-3b`, showing the effectiveness of the retention mechanism.
+- The explosion of test loss is suppressed in `Spiral-AI/Spiral-RetNet-3b-base` when the context length goes longer than 2,048 tokens (the maximum context length of training data; Note that `cyberagent/open-calm-3b` is trained on the same context length.).
 # Training Datasets

logo.png ADDED Viewed