Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +162 -0
config.json +28 -0
generation_config.json +10 -0
model.safetensors +3 -0
tokenizer.json +0 -0
tokenizer_config.json +10 -0

README.md ADDED Viewed

	@@ -0,0 +1,162 @@

+---
+language:
+- ko
+- en
+license: mit
+tags:
+- mamba2
+- hybrid
+- korean
+- causal-lm
+pipeline_tag: text-generation
+---
+# EVAFRILL-Mo-3B
+EVAFRILL-Mo-3B is a 2.94B-parameter **hybrid Mamba-2 + Transformer** language
+model optimised for Korean, trained from scratch on 55 billion tokens of
+Korean-dominant multilingual text.
+> **EVAFRILL-Mo** stands for *Efficient Variably-Architected Fusion of
+> Recurrent and Integrated Linear Layers for Language Model-based Output* — a
+> custom architecture inspired by [Nemotron-H](https://arxiv.org/abs/2501.14587)
+> that replaces most self-attention layers with Mamba-2 SSM blocks, achieving
+> linear-time inference without sacrificing generation quality.
+---
+## Architecture
+| Property | Value |
+|---|---|
+| Total parameters | ~2.94 B |
+| Layers | 26 (24 × Mamba-2 + 2 × Attention) |
+| Hidden size | 3072 |
+| Attention heads | 24 (GQA, 8 KV heads) |
+| FFN size | 9 216 |
+| Mamba-2 state dim | 128 |
+| Mamba-2 head dim | 64 |
+| Vocab size | 64 000 |
+| Max sequence length | 4 096 |
+| RoPE theta | 500 000 |
+The layer pattern places attention blocks at positions 9 and 18 (zero-indexed),
+mirroring the Nemotron-H 8B dense design scaled to 3B parameters. All other
+layers use Mamba-2 with SwiGLU FFN (mamba_d_ffn = 4 608). Attention layers use
+full SwiGLU FFN (d_ffn = 9 216).
+---
+## Training
+### Pretraining
+- **Tokens**: 55 B (319 772 steps, effective batch ≈ 172 K tokens)
+- **Hardware**: 8× NVIDIA B200 (183 GB each), ~62 hours
+- **Optimizer**: AdamW, lr=2e-4, cosine decay, warmup 2 000 steps
+- **Precision**: FP8 (TransformerEngine MXFP8) + BF16 embedding
+- **Data**: Korean web corpus, Wikipedia, books, code (Korean-dominant)
+### Supervised Fine-Tuning (SFT)
+- **Steps**: 65 000 (≈ 1 epoch on 2.44M instruction samples)
+- **Effective batch**: 56 (2 per GPU × 7 GPU × 4 grad_accum)
+- **LR**: 1e-5 (pretrain/30, catastrophic-forgetting guard)
+- **NEFTune alpha**: 5.0 (repetition degeneracy mitigation)
+- **Data**: Combined Korean instruction set (filtered, 2.44M samples)
+### Direct Preference Optimisation (DPO)
+- **Rounds**: 2-round DPO (Nemotron-H style)
+  - Round 1: 3 000 steps, beta=0.1, lr=5e-7, LoRA rank=32
+  - Round 2: 2 000 steps, beta=0.05, lr=1e-7, LoRA rank=32
+- **Hardware**: 1× NVIDIA H100 MIG 3g.40gb (~42 GB VRAM)
+- **Method**: Native LoRA DPO (no TRL dependency)
+### SLERP Merge
+The final checkpoint is produced by **spherical linear interpolation (SLERP)**
+between the SFT-v2 and DPO-round-2 checkpoints (ratio 0.5), combining the
+instruction-following strengths of both stages.
+---
+## Evaluation (SLERP checkpoint, lm-eval-harness)
+| Benchmark | Metric | Score |
+|---|---|---|
+| HellaSwag | acc_norm | 0.42 |
+| ARC-Challenge | acc_norm | 0.22 |
+| ARC-Easy | acc_norm | 0.28 |
+| Belebele (kor_Hang) | acc | 0.30 |
+| Global-MMLU-ko (full) | acc | 0.233 |
+|  — Humanities | acc | 0.242 |
+|  — STEM | acc | 0.237 |
+|  — Social Sciences | acc | 0.221 |
+|  — Other | acc | 0.229 |
+*Evaluated on 100-sample subsets per task. Numbers reflect the final
+SLERP-merged checkpoint.*
+---
+## Usage
+```python
+# Requires: pip install transformers tokenizers safetensors
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "pathcosmos/EVAFRILL-Mo-3B"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+# Chat-style prompt
+prompt = "<|user|>\n안녕하세요! 자기소개를 해 주세요.\n<|assistant|>\n"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output = model.generate(
+        **inputs,
+        max_new_tokens=256,
+        temperature=0.8,
+        top_p=0.9,
+        do_sample=True,
+        repetition_penalty=1.1,
+    )
+print(tokenizer.decode(output[0], skip_special_tokens=False))
+```
+---
+## Limitations
+- This model is an **experimental research checkpoint**, not a production system.
+- Korean is the dominant language; English and other languages are secondary.
+- The custom architecture (`evafrill-mo`) requires either
+  (a) the original project code for full inference, or
+  (b) a compatible HuggingFace integration that understands Mamba-2 hybrid layers.
+  The exported `model.safetensors` preserves the native weight layout.
+- Benchmark numbers were evaluated on small (100-sample) subsets and should be
+  treated as rough estimates.
+---
+## Citation
+```bibtex
+@misc{evafrill-mo-3b-2026,
+  title  = {EVAFRILL-Mo-3B: A Hybrid Mamba-2 + Transformer LLM for Korean},
+  author = {pathcosmos},
+  year   = {2026},
+  url    = {https://huggingface.co/pathcosmos/EVAFRILL-Mo-3B},
+}
+```
+---
+## License
+[MIT](LICENSE)

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "vocab_size": 64000,
+  "d_model": 3072,
+  "n_layers": 26,
+  "n_heads": 24,
+  "n_kv_heads": 8,
+  "d_ffn": 9216,
+  "max_seq_len": 4096,
+  "rope_theta": 500000.0,
+  "dropout": 0.0,
+  "bias": false,
+  "use_flash_attn": true,
+  "use_fp8": false,
+  "use_hybrid": true,
+  "hybrid_pattern": "M M M M M M M M M M M M A M M M M M M M M M M M A M",
+  "mamba_d_state": 128,
+  "mamba_head_dim": 64,
+  "mamba_expand": 2,
+  "mamba_conv_kernel": 4,
+  "mamba_n_groups": 8,
+  "mamba_d_ffn": 4608,
+  "mamba_chunk_size": 256,
+  "model_type": "evafrill-mo",
+  "architectures": [
+    "EvafrillMoForCausalLM"
+  ],
+  "torch_dtype": "bfloat16"
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 0,
+  "max_new_tokens": 512,
+  "temperature": 0.8,
+  "top_p": 0.9,
+  "repetition_penalty": 1.1,
+  "do_sample": true
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b7fedbd0d0f8e33a1fb5e6c4e8e9393f729cc77b364d431e522857ce6a1c8d56
+size 6301164272

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "model_type": "evafrill-mo",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>",
+  "clean_up_tokenization_spaces": false,
+  "chat_template": "<|user|>\n{{ message }}\n<|assistant|>\n"
+}