neavo
/

modern_bert_multilingual

Fill-Mask

Safetensors

modernbert

Model card Files Files and versions

xet

Community

neavo commited on Jan 31, 2025

Commit

6c867b4

verified ·

1 Parent(s): e99e560

Update README.md

Browse files

Files changed (1) hide show

README.md +10 -8

README.md CHANGED Viewed

@@ -9,12 +9,13 @@ license: apache-2.0
 ---
 ### Overview
-- ModernBertMultilingual is a multilingual model trained from scratch, using the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture.
 - It supports four languages and their variants, including `Simplified Chinese`, `Traditional Chinese`, `English`, `Japanese`, and `Korean`
-- And can effectively handle mixed-text tasks in East Asian languages.
 ### Technical Metrics
-- Trained for approximately `100` hours on `L40*7` devices, with a training volume of about `60B` tokens.
 - Main training parameters:
   - Batch Size: 1792
   - Learning Rate: 5e-04
@@ -22,13 +23,13 @@ license: apache-2.0
   - Optimizer: adamw_torch
   - LR Scheduler: warmup_stable_decay
   - Train Precision: bf16 mix
-- For additional technical metrics, please refer to the original release information and papers of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).
 ### Release Versions
 - Three different weight versions are provided:
-  - base: The version trained with general base data, suitable for various domain texts (default).
-  - nodecay: The checkpoint before the annealing phase begins, which allows you to add domain-specific data for annealing to better adapt to the target domain.
-  - keyword_gacha_multilingual: The version annealed with ACGN-related texts (e.g., `light novels`, `game scripts`, `comic scripts`, etc.).
 | Model | Version | Description |
 | :--: | :--: | :--: |
@@ -37,7 +38,7 @@ license: apache-2.0
 | [keyword_gacha_base_multilingual](https://huggingface.co/neavo/keyword_gacha_base_multilingual) | 20250128 | keyword_gacha_multilingual |
 ### Others
-- Training script available on [Github](https://github.com/neavo/KeywordGachaModel).
 ### 综述
 - ModernBertMultilingual 是一个从零开始训练的多语言模型
@@ -46,6 +47,7 @@ license: apache-2.0
 - 可以很好处理东亚语言混合文本任务
 ### 技术指标
 - 在 `L40*7` 的设备上训练了大约 `100` 个小时，训练量大约 `60B` Token
 - 主要训练参数
   - Batch Size : 1792

 ---
 ### Overview
+- ModernBertMultilingual is a multilingual model trained from scratch, using the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture
 - It supports four languages and their variants, including `Simplified Chinese`, `Traditional Chinese`, `English`, `Japanese`, and `Korean`
+- And can effectively handle mixed-text tasks in East Asian languages
 ### Technical Metrics
+- Using a slightly modified vocabulary of the Qwen2.5 series to support multilingual capabilities
+- Trained for approximately `100` hours on `L40*7` devices, with a training volume of about `60B` tokens
 - Main training parameters:
   - Batch Size: 1792
   - Learning Rate: 5e-04
   - Optimizer: adamw_torch
   - LR Scheduler: warmup_stable_decay
   - Train Precision: bf16 mix
+- For additional technical metrics, please refer to the original release information and papers of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
 ### Release Versions
 - Three different weight versions are provided:
+  - base: The version trained with general base data, suitable for various domain texts (default)
+  - nodecay: The checkpoint before the annealing phase begins, which allows you to add domain-specific data for annealing to better adapt to the target domain
+  - keyword_gacha_multilingual: The version annealed with ACGN-related texts (e.g., `light novels`, `game scripts`, `comic scripts`, etc.)
 | Model | Version | Description |
 | :--: | :--: | :--: |
 | [keyword_gacha_base_multilingual](https://huggingface.co/neavo/keyword_gacha_base_multilingual) | 20250128 | keyword_gacha_multilingual |
 ### Others
+- Training script available on [Github](https://github.com/neavo/KeywordGachaModel)
 ### 综述
 - ModernBertMultilingual 是一个从零开始训练的多语言模型
 - 可以很好处理东亚语言混合文本任务
 ### 技术指标
+- 使用略微调整后的 `Qwen2.5` 系列的词表以支持多语言
 - 在 `L40*7` 的设备上训练了大约 `100` 个小时，训练量大约 `60B` Token
 - 主要训练参数
   - Batch Size : 1792