Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,18 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
# Base checkpoint
|
| 7 |
+
augmxnt/shisa-7b-v1
|
| 8 |
+
* Mistral-7B base
|
| 9 |
+
* Pre-trained on 8B of MADLAD-Ja
|
| 10 |
+
* Finetuned on Japanese instructions
|
| 11 |
+
* Highest scoring 7B model on conversation benchmark (JA MT-Bench)
|
| 12 |
+
|
| 13 |
+
# Training datasets (total ~7B)
|
| 14 |
+
* Aozora Bunko
|
| 15 |
+
* Japanese Law Precedent Dataset
|
| 16 |
+
* Japanese Wikipedia
|
| 17 |
+
* .lg.jp, .go.jp, .ac.jp domain webscrapes from CulturaX (Any documents with same first 25 characters were de-duplicated)
|
| 18 |
+
* English Ultrachat200K-gen (So that it doesn't forget English and chatting ability learned in the base checkpoint)
|