pathcosmos commited on
Commit
42ca925
·
verified ·
1 Parent(s): 6a4023f

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +162 -0
  2. config.json +28 -0
  3. generation_config.json +10 -0
  4. model.safetensors +3 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +10 -0
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ license: mit
6
+ tags:
7
+ - mamba2
8
+ - hybrid
9
+ - korean
10
+ - causal-lm
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # EVAFRILL-Mo-3B
15
+
16
+ EVAFRILL-Mo-3B is a 2.94B-parameter **hybrid Mamba-2 + Transformer** language
17
+ model optimised for Korean, trained from scratch on 55 billion tokens of
18
+ Korean-dominant multilingual text.
19
+
20
+ > **EVAFRILL-Mo** stands for *Efficient Variably-Architected Fusion of
21
+ > Recurrent and Integrated Linear Layers for Language Model-based Output* — a
22
+ > custom architecture inspired by [Nemotron-H](https://arxiv.org/abs/2501.14587)
23
+ > that replaces most self-attention layers with Mamba-2 SSM blocks, achieving
24
+ > linear-time inference without sacrificing generation quality.
25
+
26
+ ---
27
+
28
+ ## Architecture
29
+
30
+ | Property | Value |
31
+ |---|---|
32
+ | Total parameters | ~2.94 B |
33
+ | Layers | 26 (24 × Mamba-2 + 2 × Attention) |
34
+ | Hidden size | 3072 |
35
+ | Attention heads | 24 (GQA, 8 KV heads) |
36
+ | FFN size | 9 216 |
37
+ | Mamba-2 state dim | 128 |
38
+ | Mamba-2 head dim | 64 |
39
+ | Vocab size | 64 000 |
40
+ | Max sequence length | 4 096 |
41
+ | RoPE theta | 500 000 |
42
+
43
+ The layer pattern places attention blocks at positions 9 and 18 (zero-indexed),
44
+ mirroring the Nemotron-H 8B dense design scaled to 3B parameters. All other
45
+ layers use Mamba-2 with SwiGLU FFN (mamba_d_ffn = 4 608). Attention layers use
46
+ full SwiGLU FFN (d_ffn = 9 216).
47
+
48
+ ---
49
+
50
+ ## Training
51
+
52
+ ### Pretraining
53
+ - **Tokens**: 55 B (319 772 steps, effective batch ≈ 172 K tokens)
54
+ - **Hardware**: 8× NVIDIA B200 (183 GB each), ~62 hours
55
+ - **Optimizer**: AdamW, lr=2e-4, cosine decay, warmup 2 000 steps
56
+ - **Precision**: FP8 (TransformerEngine MXFP8) + BF16 embedding
57
+ - **Data**: Korean web corpus, Wikipedia, books, code (Korean-dominant)
58
+
59
+ ### Supervised Fine-Tuning (SFT)
60
+ - **Steps**: 65 000 (≈ 1 epoch on 2.44M instruction samples)
61
+ - **Effective batch**: 56 (2 per GPU × 7 GPU × 4 grad_accum)
62
+ - **LR**: 1e-5 (pretrain/30, catastrophic-forgetting guard)
63
+ - **NEFTune alpha**: 5.0 (repetition degeneracy mitigation)
64
+ - **Data**: Combined Korean instruction set (filtered, 2.44M samples)
65
+
66
+ ### Direct Preference Optimisation (DPO)
67
+ - **Rounds**: 2-round DPO (Nemotron-H style)
68
+ - Round 1: 3 000 steps, beta=0.1, lr=5e-7, LoRA rank=32
69
+ - Round 2: 2 000 steps, beta=0.05, lr=1e-7, LoRA rank=32
70
+ - **Hardware**: 1× NVIDIA H100 MIG 3g.40gb (~42 GB VRAM)
71
+ - **Method**: Native LoRA DPO (no TRL dependency)
72
+
73
+ ### SLERP Merge
74
+ The final checkpoint is produced by **spherical linear interpolation (SLERP)**
75
+ between the SFT-v2 and DPO-round-2 checkpoints (ratio 0.5), combining the
76
+ instruction-following strengths of both stages.
77
+
78
+ ---
79
+
80
+ ## Evaluation (SLERP checkpoint, lm-eval-harness)
81
+
82
+ | Benchmark | Metric | Score |
83
+ |---|---|---|
84
+ | HellaSwag | acc_norm | 0.42 |
85
+ | ARC-Challenge | acc_norm | 0.22 |
86
+ | ARC-Easy | acc_norm | 0.28 |
87
+ | Belebele (kor_Hang) | acc | 0.30 |
88
+ | Global-MMLU-ko (full) | acc | 0.233 |
89
+ | — Humanities | acc | 0.242 |
90
+ | — STEM | acc | 0.237 |
91
+ | — Social Sciences | acc | 0.221 |
92
+ | — Other | acc | 0.229 |
93
+
94
+ *Evaluated on 100-sample subsets per task. Numbers reflect the final
95
+ SLERP-merged checkpoint.*
96
+
97
+ ---
98
+
99
+ ## Usage
100
+
101
+ ```python
102
+ # Requires: pip install transformers tokenizers safetensors
103
+ from transformers import AutoTokenizer, AutoModelForCausalLM
104
+ import torch
105
+
106
+ model_id = "pathcosmos/EVAFRILL-Mo-3B"
107
+
108
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
109
+ model = AutoModelForCausalLM.from_pretrained(
110
+ model_id,
111
+ torch_dtype=torch.bfloat16,
112
+ device_map="auto",
113
+ )
114
+
115
+ # Chat-style prompt
116
+ prompt = "<|user|>\n안녕하세요! 자기소개를 해 주세요.\n<|assistant|>\n"
117
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
118
+
119
+ with torch.no_grad():
120
+ output = model.generate(
121
+ **inputs,
122
+ max_new_tokens=256,
123
+ temperature=0.8,
124
+ top_p=0.9,
125
+ do_sample=True,
126
+ repetition_penalty=1.1,
127
+ )
128
+
129
+ print(tokenizer.decode(output[0], skip_special_tokens=False))
130
+ ```
131
+
132
+ ---
133
+
134
+ ## Limitations
135
+
136
+ - This model is an **experimental research checkpoint**, not a production system.
137
+ - Korean is the dominant language; English and other languages are secondary.
138
+ - The custom architecture (`evafrill-mo`) requires either
139
+ (a) the original project code for full inference, or
140
+ (b) a compatible HuggingFace integration that understands Mamba-2 hybrid layers.
141
+ The exported `model.safetensors` preserves the native weight layout.
142
+ - Benchmark numbers were evaluated on small (100-sample) subsets and should be
143
+ treated as rough estimates.
144
+
145
+ ---
146
+
147
+ ## Citation
148
+
149
+ ```bibtex
150
+ @misc{evafrill-mo-3b-2026,
151
+ title = {EVAFRILL-Mo-3B: A Hybrid Mamba-2 + Transformer LLM for Korean},
152
+ author = {pathcosmos},
153
+ year = {2026},
154
+ url = {https://huggingface.co/pathcosmos/EVAFRILL-Mo-3B},
155
+ }
156
+ ```
157
+
158
+ ---
159
+
160
+ ## License
161
+
162
+ [MIT](LICENSE)
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 64000,
3
+ "d_model": 3072,
4
+ "n_layers": 26,
5
+ "n_heads": 24,
6
+ "n_kv_heads": 8,
7
+ "d_ffn": 9216,
8
+ "max_seq_len": 4096,
9
+ "rope_theta": 500000.0,
10
+ "dropout": 0.0,
11
+ "bias": false,
12
+ "use_flash_attn": true,
13
+ "use_fp8": false,
14
+ "use_hybrid": true,
15
+ "hybrid_pattern": "M M M M M M M M M M M M A M M M M M M M M M M M A M",
16
+ "mamba_d_state": 128,
17
+ "mamba_head_dim": 64,
18
+ "mamba_expand": 2,
19
+ "mamba_conv_kernel": 4,
20
+ "mamba_n_groups": 8,
21
+ "mamba_d_ffn": 4608,
22
+ "mamba_chunk_size": 256,
23
+ "model_type": "evafrill-mo",
24
+ "architectures": [
25
+ "EvafrillMoForCausalLM"
26
+ ],
27
+ "torch_dtype": "bfloat16"
28
+ }
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "eos_token_id": 2,
4
+ "pad_token_id": 0,
5
+ "max_new_tokens": 512,
6
+ "temperature": 0.8,
7
+ "top_p": 0.9,
8
+ "repetition_penalty": 1.1,
9
+ "do_sample": true
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b7fedbd0d0f8e33a1fb5e6c4e8e9393f729cc77b364d431e522857ce6a1c8d56
3
+ size 6301164272
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "evafrill-mo",
3
+ "tokenizer_class": "PreTrainedTokenizerFast",
4
+ "bos_token": "<s>",
5
+ "eos_token": "</s>",
6
+ "unk_token": "<unk>",
7
+ "pad_token": "<pad>",
8
+ "clean_up_tokenization_spaces": false,
9
+ "chat_template": "<|user|>\n{{ message }}\n<|assistant|>\n"
10
+ }