Update README.md
Browse files
README.md
CHANGED
|
@@ -4,18 +4,11 @@ datasets:
|
|
| 4 |
- nvidia/Nemotron-Post-Training-Dataset-v1
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
library_name: transformers
|
| 7 |
-
base_model:
|
| 8 |
-
- Qwen/Qwen3-Next-80B-A3B-Instruct
|
| 9 |
---
|
| 10 |
**Model Introduction**
|
| 11 |
|
| 12 |
EPT-ZeRo is a sLM designed by Research Project ICT I team from Singapore Korean International School for on-device/edge environment, prioritizing lower memory usage and efficient inference.
|
| 13 |
-
To achieve this, the
|
| 14 |
-
EPT-ZeRo and its derivatives(i.g. EPT-I) are created by modifying
|
| 15 |
|
| 16 |
-
EPT-ZeRo is the prototype of the EPT family, which is the base model that was only pretrained and did not undergo post-training including SFT and Alignment.
|
| 17 |
-
|
| 18 |
-
**Caution**
|
| 19 |
-
|
| 20 |
-
Note that EPT series may not support conventional optimization kernels such as FlashAttention, due to implementing Power Retention instead of Scaled Dot Product Attention.
|
| 21 |
-
Therefore, users should not attempt to pass ```attn_implementation``` parameter when loading the model with ```AutoModelForCausalLM```. Though not tested, there is a chance of using FlashAttention or SDPA to load ISAC may cause an error.
|
|
|
|
| 4 |
- nvidia/Nemotron-Post-Training-Dataset-v1
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
library_name: transformers
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
**Model Introduction**
|
| 9 |
|
| 10 |
EPT-ZeRo is a sLM designed by Research Project ICT I team from Singapore Korean International School for on-device/edge environment, prioritizing lower memory usage and efficient inference.
|
| 11 |
+
To achieve this, the EPT series implements Rotary Positional Embeddings(RoPE), SwiGLU activation combined with causal convolution based FFNs, Weight tying, and RMS Layer Normalization, along with Multi-Head Latent Attention(MLA) for better expressive capability per parameter and lower memory footprint.
|
| 12 |
+
EPT-ZeRo and its derivatives(i.g. EPT-I) are created by modifying DeepSeek-V3's modeling code, converting the model into a dense model instead of a Mixture of Experts(MoE) model, reducing the total parameters to the same number as the original model's active parameters and modifying the configuration to suit the model's new architecture.
|
| 13 |
|
| 14 |
+
EPT-ZeRo is the prototype of the EPT family, which is the base model that was only pretrained and did not undergo post-training including SFT and Alignment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|