Mxode
/

NanoLM-365M-Base

@@ -1,39 +1,45 @@
----
-license: gpl-3.0
-language:
-- en
----
-# NanoLM-365M-base
-English | [简体中文](README_zh-CN.md)
-## Introduction
-Based on [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B), the tokenizer has been replaced with [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer) to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.
-## Details
-To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) and [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k).
-|                             |                            Value                             |
-| :-------------------------: | :----------------------------------------------------------: |
-|        Total Params         |                            365 M                             |
-|      Trainable Params       |                            < 10 M                            |
-|       Trainable Parts       |                     `model.embed_tokens`                     |
-|       Training Steps        |                            40,000                            |
-|      Training Dataset       | [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) |
-|          Optimizer          |                         adamw_torch                          |
-|        Learning Rate        |                             2e-4                             |
-|        LR Scheduler         |                            cosine                            |
-|        Weight Decay         |                             0.1                              |
-|        Warm-up Ratio        |                             0.03                             |
-|         Batch Size          |                              16                              |
-| Gradient Accumulation Steps |                              1                               |
-|           Seq Len           |                             4096                             |
-|            Dtype            |                             bf16                             |
-|       Peak GPU Memory       |                           < 48 GB                            |
-|           Device            |                    NVIDIA A100-SXM4-80GB                     |
-The specific training records are as follows:
-![result](static/results.png)

+---
+license: gpl-3.0
+language:
+- en
+datasets:
+- HuggingFaceTB/cosmopedia-100k
+- pleisto/wikipedia-cn-20230720-filtered
+pipeline_tag: text-generation
+tags:
+- text-generation-inference
+---
+# NanoLM-365M-base
+English | [简体中文](README_zh-CN.md)
+## Introduction
+Based on [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B), the tokenizer has been replaced with [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer) to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.
+## Details
+To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) and [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k).
+|                             |                            Value                             |
+| :-------------------------: | :----------------------------------------------------------: |
+|        Total Params         |                            365 M                             |
+|      Trainable Params       |                            < 10 M                            |
+|       Trainable Parts       |                     `model.embed_tokens`                     |
+|       Training Steps        |                            40,000                            |
+|      Training Dataset       | [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) |
+|          Optimizer          |                         adamw_torch                          |
+|        Learning Rate        |                             2e-4                             |
+|        LR Scheduler         |                            cosine                            |
+|        Weight Decay         |                             0.1                              |
+|        Warm-up Ratio        |                             0.03                             |
+|         Batch Size          |                              16                              |
+| Gradient Accumulation Steps |                              1                               |
+|           Seq Len           |                             4096                             |
+|            Dtype            |                             bf16                             |
+|       Peak GPU Memory       |                           < 48 GB                            |
+|           Device            |                    NVIDIA A100-SXM4-80GB                     |
+The specific training records are as follows:
+![result](static/results.png)