Update README.md
Browse files
README.md
CHANGED
|
@@ -9,12 +9,13 @@ license: apache-2.0
|
|
| 9 |
---
|
| 10 |
|
| 11 |
### Overview
|
| 12 |
-
- ModernBertMultilingual is a multilingual model trained from scratch, using the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture
|
| 13 |
- It supports four languages and their variants, including `Simplified Chinese`, `Traditional Chinese`, `English`, `Japanese`, and `Korean`
|
| 14 |
-
- And can effectively handle mixed-text tasks in East Asian languages
|
| 15 |
|
| 16 |
### Technical Metrics
|
| 17 |
-
-
|
|
|
|
| 18 |
- Main training parameters:
|
| 19 |
- Batch Size: 1792
|
| 20 |
- Learning Rate: 5e-04
|
|
@@ -22,13 +23,13 @@ license: apache-2.0
|
|
| 22 |
- Optimizer: adamw_torch
|
| 23 |
- LR Scheduler: warmup_stable_decay
|
| 24 |
- Train Precision: bf16 mix
|
| 25 |
-
- For additional technical metrics, please refer to the original release information and papers of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
|
| 26 |
|
| 27 |
### Release Versions
|
| 28 |
- Three different weight versions are provided:
|
| 29 |
-
- base: The version trained with general base data, suitable for various domain texts (default)
|
| 30 |
-
- nodecay: The checkpoint before the annealing phase begins, which allows you to add domain-specific data for annealing to better adapt to the target domain
|
| 31 |
-
- keyword_gacha_multilingual: The version annealed with ACGN-related texts (e.g., `light novels`, `game scripts`, `comic scripts`, etc.)
|
| 32 |
|
| 33 |
| Model | Version | Description |
|
| 34 |
| :--: | :--: | :--: |
|
|
@@ -37,7 +38,7 @@ license: apache-2.0
|
|
| 37 |
| [keyword_gacha_base_multilingual](https://huggingface.co/neavo/keyword_gacha_base_multilingual) | 20250128 | keyword_gacha_multilingual |
|
| 38 |
|
| 39 |
### Others
|
| 40 |
-
- Training script available on [Github](https://github.com/neavo/KeywordGachaModel)
|
| 41 |
|
| 42 |
### 综述
|
| 43 |
- ModernBertMultilingual 是一个从零开始训练的多语言模型
|
|
@@ -46,6 +47,7 @@ license: apache-2.0
|
|
| 46 |
- 可以很好处理东亚语言混合文本任务
|
| 47 |
|
| 48 |
### 技术指标
|
|
|
|
| 49 |
- 在 `L40*7` 的设备上训练了大约 `100` 个小时,训练量大约 `60B` Token
|
| 50 |
- 主要训练参数
|
| 51 |
- Batch Size : 1792
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
### Overview
|
| 12 |
+
- ModernBertMultilingual is a multilingual model trained from scratch, using the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture
|
| 13 |
- It supports four languages and their variants, including `Simplified Chinese`, `Traditional Chinese`, `English`, `Japanese`, and `Korean`
|
| 14 |
+
- And can effectively handle mixed-text tasks in East Asian languages
|
| 15 |
|
| 16 |
### Technical Metrics
|
| 17 |
+
- Using a slightly modified vocabulary of the Qwen2.5 series to support multilingual capabilities
|
| 18 |
+
- Trained for approximately `100` hours on `L40*7` devices, with a training volume of about `60B` tokens
|
| 19 |
- Main training parameters:
|
| 20 |
- Batch Size: 1792
|
| 21 |
- Learning Rate: 5e-04
|
|
|
|
| 23 |
- Optimizer: adamw_torch
|
| 24 |
- LR Scheduler: warmup_stable_decay
|
| 25 |
- Train Precision: bf16 mix
|
| 26 |
+
- For additional technical metrics, please refer to the original release information and papers of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
|
| 27 |
|
| 28 |
### Release Versions
|
| 29 |
- Three different weight versions are provided:
|
| 30 |
+
- base: The version trained with general base data, suitable for various domain texts (default)
|
| 31 |
+
- nodecay: The checkpoint before the annealing phase begins, which allows you to add domain-specific data for annealing to better adapt to the target domain
|
| 32 |
+
- keyword_gacha_multilingual: The version annealed with ACGN-related texts (e.g., `light novels`, `game scripts`, `comic scripts`, etc.)
|
| 33 |
|
| 34 |
| Model | Version | Description |
|
| 35 |
| :--: | :--: | :--: |
|
|
|
|
| 38 |
| [keyword_gacha_base_multilingual](https://huggingface.co/neavo/keyword_gacha_base_multilingual) | 20250128 | keyword_gacha_multilingual |
|
| 39 |
|
| 40 |
### Others
|
| 41 |
+
- Training script available on [Github](https://github.com/neavo/KeywordGachaModel)
|
| 42 |
|
| 43 |
### 综述
|
| 44 |
- ModernBertMultilingual 是一个从零开始训练的多语言模型
|
|
|
|
| 47 |
- 可以很好处理东亚语言混合文本任务
|
| 48 |
|
| 49 |
### 技术指标
|
| 50 |
+
- 使用略微调整后的 `Qwen2.5` 系列的词表以支持多语言
|
| 51 |
- 在 `L40*7` 的设备上训练了大约 `100` 个小时,训练量大约 `60B` Token
|
| 52 |
- 主要训练参数
|
| 53 |
- Batch Size : 1792
|