anthonym21 commited on
Commit
02130a3
·
verified ·
1 Parent(s): 2cb3102

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +108 -44
  2. configuration_eve.py +50 -0
  3. modeling_eve.py +1 -43
  4. push_to_hub.py +17 -0
  5. tokenizer_config.json +1 -1
README.md CHANGED
@@ -1,83 +1,147 @@
1
  ---
2
  license: mit
3
- language:
4
- - en
5
- pipeline_tag: text-generation
6
  tags:
 
 
 
 
7
  - pytorch
8
- - safetensors
9
  - text-generation
10
- - moe
11
- - peft
12
  - lora
13
- - instruct
14
- - custom-architecture
15
- - trust_remote_code
16
  base_model: anthonym21/Eve-2-MoE-272M
17
- datasets: []
 
 
 
18
  ---
19
 
20
- # Model Card for Eve-2-MoE-IT-272M
 
 
21
 
22
- <!-- Provide a quick summary of what the model is/does. -->
23
- Eve-2-MoE-IT-272M is an instruction-tuned (IT) variant of Eve-2-MoE-272M, packaged as a merged checkpoint for direct inference.
24
 
25
- ## Model Details
26
 
27
- ### Model Description
28
 
29
- <!-- Provide a longer summary of what this model is. -->
30
- This repository contains a custom MoE causal language model implemented with `transformers` remote code (see `modeling_eve.py`) and weights in `model.safetensors`.
31
 
32
- - **Developed by:** Anthony Maio / Making Minds AI (Independent)
33
- - **Model type:** Causal language model, Mixture-of-Experts (MoE)
34
- - **Language(s) (NLP):** English
35
- - **License:** MIT
36
- - **Finetuned from model [optional]:** `anthonym21/Eve-2-MoE-272M`
37
 
38
- ### Model Sources [optional]
 
 
 
 
 
 
39
 
40
- - **Repository:** https://huggingface.co/anthonym21/Eve-2-MoE-IT-272M
41
- - **Base model:** https://huggingface.co/anthonym21/Eve-2-MoE-272M
42
 
43
- ## Uses
44
 
45
- ### Direct Use
46
 
47
- - Instruction-following text generation (experimental).
48
- - Research on small MoE models and adapter-based specialization.
49
 
50
- ### Downstream Use [optional]
51
 
52
- - Train fresh LoRA adapters on top of this IT checkpoint for narrow tasks (e.g., coding, agent-style tool use, classification-style prompting).
53
 
54
- ### Out-of-Scope Use
 
 
 
 
 
 
 
55
 
56
- - Safety-critical or high-stakes domains (medical, legal, financial advice).
57
- - Any use requiring guarantees about factuality, bias, or safety alignment.
58
 
59
- ## Bias, Risks, and Limitations
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- This is a small model and may hallucinate, produce incorrect information, or follow instructions unreliably. It is not presented as safety-aligned, and users should implement their own safety and validation layers.
62
 
63
- ### Recommendations
 
 
 
 
 
64
 
65
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model and evaluate it for their specific use case.
66
 
67
- ## How to Get Started with the Model
 
 
 
 
68
 
69
- Use the code below to get started with the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ```python
72
  from transformers import AutoTokenizer, AutoModelForCausalLM
73
 
74
  model_id = "anthonym21/Eve-2-MoE-IT-272M"
75
-
76
  tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
77
  model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map="auto")
78
 
79
- prompt = "Write a short function that reverses a string in Python."
80
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
- out = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
83
- print(tokenizer.decode(out, skip_special_tokens=True))
 
1
  ---
2
  license: mit
 
 
 
3
  tags:
4
+ - moe
5
+ - deepseek
6
+ - instruction-tuned
7
+ - nvidia-h200
8
  - pytorch
 
9
  - text-generation
10
+ - nano-lm
11
+ - edge-ai
12
  - lora
13
+ - sft
14
+ datasets:
15
+ - mlabonne/open-perfectblend
16
  base_model: anthonym21/Eve-2-MoE-272M
17
+ language:
18
+ - en
19
+ pipeline_tag: text-generation
20
+ library_name: transformers
21
  ---
22
 
23
+ # Eve-2-MoE-IT-272M
24
+
25
+ **Instruction-tuned** version of [Eve-2-MoE-272M](https://huggingface.co/anthonym21/Eve-2-MoE-272M), fine-tuned via **heavy LoRA** and **merged** into a standalone model.
26
 
27
+ This is the foundation for **Eve specialist adapters**—narrow, measurable transforms that run on CPU/low VRAM.
 
28
 
29
+ **Author:** [Anthony Maio](https://making-minds.ai/) / Making Minds AI Research
30
 
31
+ ## Specialist Use Cases
32
 
33
+ The community adopts small models when (a) the task is **narrow**, (b) quality is **measurable**, and (c) deployment is **easy** (CPU/low VRAM). The best targets are "deterministic-ish transforms" with clear pass/fail.
 
34
 
35
+ ### Top 5 Eve Adapters (train in ~20 min each on RTX 4080)
 
 
 
 
36
 
37
+ | Adapter | Task | Metrics |
38
+ |---------|------|---------|
39
+ | **Eve-JSON** | Strict structured output (function calling lite) | Parse rate, schema-valid rate, field accuracy |
40
+ | **Eve-Extract** | Text → structured extraction (receipts, tickets, logs → JSON) | Exact match per field, F1 entities, parse+schema rate |
41
+ | **Eve-Repair** | Fix invalid JSON, CSV quoting, normalize formats | Parse success, diff-to-gold, validator pass rate |
42
+ | **Eve-Format** | Constraint obeyer (one paragraph, max N chars, bullet lists) | Constraint compliance %, length adherence |
43
+ | **Eve-Router** | Intent classifier (which specialist to call + confidence) | Accuracy, calibration (ECE), abstain correctness |
44
 
45
+ **Why these?** Crisp evals, immediate usefulness, CPU deployment. Train `r=16` LoRAs on top of this IT base.
 
46
 
47
+ ## Training Details
48
 
49
+ ### Training Data
50
 
51
+ [mlabonne/open-perfectblend](https://huggingface.co/datasets/mlabonne/open-perfectblend) ~1.2M instruction examples (math, code, chat, reasoning).
 
52
 
53
+ ### Training Procedure
54
 
55
+ **Supervised fine-tuning (SFT)** via heavy LoRA, then merged.
56
 
57
+ | Parameter | Value |
58
+ |-----------|-------|
59
+ | **Base Model** | [Eve-2-MoE-272M](https://huggingface.co/anthonym21/Eve-2-MoE-272M) |
60
+ | **LoRA Rank** | 128 |
61
+ | **LoRA Alpha** | 256 |
62
+ | **LoRA Dropout** | 0.05 |
63
+ | **Target Modules** | `c_attn`, `c_proj`, `w1`, `w2`, `router` |
64
+ | **NOT Targeted** | `lm_head` |
65
 
66
+ ### Training Hyperparameters
 
67
 
68
+ | Parameter | Value |
69
+ |-----------|-------|
70
+ ```markdown
71
+ | **Hardware** | NVIDIA H200 SXM (141 GB VRAM) |
72
+ | **Precision** | BFloat16 |
73
+ | **Epochs** | 1 |
74
+ | **Batch Size** | 128 (per device, no grad accum) |
75
+ | **Learning Rate** | 5e-5 |
76
+ | **LR Schedule** | Cosine (3% warmup) |
77
+ | **Weight Decay** | 0.01 |
78
+ | **Optimizer** | AdamW |
79
+ | **Total Steps** | ~37,000 |
80
 
81
+ ### Speeds & Sizes
82
 
83
+ | Metric | Value |
84
+ |--------|-------|
85
+ | **Training Time** | ~1.7 hours (1× H200) |
86
+ | **Throughput** | ~6 it/s (batch 128) |
87
+ | **Model Size** | 272M params (1.09 GB bf16) |
88
+ | **Cost** | ~$5 RunPod |
89
 
90
+ ## Lessons Learned
91
 
92
+ - **Don't target `lm_head`**: Causes blank outputs despite low loss.
93
+ - **r=128 @ 2e-4 LR**: Loss → 0.01 but learns nothing. Use 5e-5.
94
+ - **Custom model must expose embeddings hooks**: PEFT checkpointing requires `get_input_embeddings()` / `get_output_embeddings()`.
95
+ - **RunPod: Don't `pip install torch`**: Silently breaks CUDA.
96
+ - **Broken tokenizer = fake zero loss**: Verify vocab size matches `config.vocab_size`.
97
 
98
+ ## Architecture
99
+
100
+ | Parameter | Value |
101
+ |-----------|-------|
102
+ | **Params** | 272M |
103
+ | **MoE** | 8 routed + 1 shared (top-2) |
104
+ | **Active/Token** | ~80M |
105
+ | **Layers** | 12 |
106
+ | **Hidden** | 512 |
107
+ | **Heads** | 8×64 (RoPE) |
108
+ | **FFN** | SwiGLU 1408 |
109
+ | **Context** | 2048 |
110
+ | **Vocab** | 50,304 (GPT-2) |
111
+
112
+ ## Usage
113
 
114
  ```python
115
  from transformers import AutoTokenizer, AutoModelForCausalLM
116
 
117
  model_id = "anthonym21/Eve-2-MoE-IT-272M"
 
118
  tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
119
  model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map="auto")
120
 
121
+ prompt = "User: Extract name and amount from: 'Paid John Doe $150.23'\nAssistant:"
122
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
123
+ out = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
124
+ print(tokenizer.decode(out))
125
+ ```
126
+
127
+ **Prompt format:** `User: ... \nAssistant:`
128
+
129
+ ## Limitations
130
+
131
+ 272M model: factual errors, no complex reasoning, limited knowledge. **Specialist base only.**
132
+
133
+ ## Citation
134
+
135
+ ```bibtex
136
+ @misc{maio2026eve2moeit,
137
+ author = {Maio, Anthony},
138
+ title = {Eve-2-MoE-IT-272M: Nano-MoE for Measurable Specialist Tasks},
139
+ year = {2026},
140
+ url = {https://huggingface.co/anthonym21/Eve-2-MoE-IT-272M}
141
+ }
142
+ ```
143
+
144
+ ## License
145
 
146
+ MIT
147
+ ```
configuration_eve.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # configuration_eve.py
2
+ from __future__ import annotations
3
+
4
+ from typing import Any, Optional
5
+ from transformers import PretrainedConfig
6
+
7
+
8
+ class EveConfig(PretrainedConfig):
9
+ model_type = "eve_moe"
10
+ attribute_map = {
11
+ "num_hidden_layers": "n_layer",
12
+ "num_attention_heads": "n_head",
13
+ "hidden_size": "n_embd",
14
+ "max_position_embeddings": "block_size",
15
+ }
16
+
17
+ def __init__(
18
+ self,
19
+ vocab_size: int = 50304,
20
+ n_layer: int = 12,
21
+ n_embd: int = 512,
22
+ n_head: int = 8,
23
+ head_dim: int = 64,
24
+ block_size: int = 2048,
25
+ num_experts: int = 8,
26
+ top_k: int = 2,
27
+ expert_intermediate_size: int = 1408,
28
+ shared_expert_intermediate_size: int = 1408,
29
+ router_aux_loss_coef: float = 0.01,
30
+ use_checkpointing: bool = False,
31
+ rope_theta: float = 10000.0,
32
+ **kwargs: Any,
33
+ ):
34
+ self.vocab_size = vocab_size
35
+ self.n_layer = n_layer
36
+ self.n_embd = n_embd
37
+ self.n_head = n_head
38
+ self.head_dim = head_dim
39
+ self.block_size = block_size
40
+ self.num_experts = num_experts
41
+ self.top_k = top_k
42
+ self.expert_intermediate_size = expert_intermediate_size
43
+ self.shared_expert_intermediate_size = shared_expert_intermediate_size
44
+ self.router_aux_loss_coef = router_aux_loss_coef
45
+ self.use_checkpointing = use_checkpointing
46
+ self.rope_theta = rope_theta
47
+ super().__init__(**kwargs)
48
+
49
+
50
+ __all__ = ["EveConfig"]
modeling_eve.py CHANGED
@@ -25,49 +25,7 @@ from transformers import PreTrainedModel, PretrainedConfig, GenerationMixin
25
  from transformers.modeling_outputs import CausalLMOutputWithPast
26
 
27
 
28
- class EveConfig(PretrainedConfig):
29
- model_type = "eve_moe"
30
- # Mapping for Transformers compatibility
31
- attribute_map = {
32
- "num_hidden_layers": "n_layer",
33
- "num_attention_heads": "n_head",
34
- "hidden_size": "n_embd",
35
- "max_position_embeddings": "block_size",
36
- }
37
-
38
- def __init__(
39
- self,
40
- vocab_size: int = 50304,
41
- n_layer: int = 12,
42
- n_embd: int = 512,
43
- n_head: int = 8,
44
- head_dim: int = 64,
45
- block_size: int = 2048,
46
- num_experts: int = 8,
47
- top_k: int = 2,
48
- expert_intermediate_size: int = 1408,
49
- shared_expert_intermediate_size: int = 1408,
50
- router_aux_loss_coef: float = 0.01,
51
- use_checkpointing: bool = False,
52
- rope_theta: float = 10000.0,
53
- **kwargs: Any,
54
- ):
55
- self.vocab_size = vocab_size
56
- self.n_layer = n_layer
57
- self.n_embd = n_embd
58
- self.n_head = n_head
59
- self.head_dim = head_dim
60
- self.block_size = block_size
61
-
62
- self.num_experts = num_experts
63
- self.top_k = top_k
64
- self.expert_intermediate_size = expert_intermediate_size
65
- self.shared_expert_intermediate_size = shared_expert_intermediate_size
66
- self.router_aux_loss_coef = router_aux_loss_coef
67
-
68
- self.use_checkpointing = use_checkpointing
69
- self.rope_theta = rope_theta
70
- super().__init__(**kwargs)
71
 
72
 
73
  class RMSNorm(nn.Module):
 
25
  from transformers.modeling_outputs import CausalLMOutputWithPast
26
 
27
 
28
+ from .configuration_eve import EveConfig
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
 
31
  class RMSNorm(nn.Module):
push_to_hub.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from huggingface_hub import HfApi
2
+
3
+ api = HfApi()
4
+
5
+ repo_id = "anthonym21/Eve-2-MoE-IT-272M"
6
+ folder_path = "."
7
+
8
+ print(f"Uploading {folder_path} to {repo_id}...")
9
+
10
+ api.upload_folder(
11
+ folder_path=folder_path,
12
+ repo_id=repo_id,
13
+ repo_type="model",
14
+ ignore_patterns=[".git", ".cache", "__pycache__", "*.ipynb", "*.lock", ".DS_Store"],
15
+ )
16
+
17
+ print("Upload complete! You can now reload the model in your notebook.")
tokenizer_config.json CHANGED
@@ -5,7 +5,7 @@
5
  "eos_token": "<|endoftext|>",
6
  "errors": "replace",
7
  "is_local": false,
8
- "model_max_length": 1024,
9
  "pad_token": "<|endoftext|>",
10
  "tokenizer_class": "GPT2Tokenizer",
11
  "unk_token": "<|endoftext|>"
 
5
  "eos_token": "<|endoftext|>",
6
  "errors": "replace",
7
  "is_local": false,
8
+ "model_max_length": 2048,
9
  "pad_token": "<|endoftext|>",
10
  "tokenizer_class": "GPT2Tokenizer",
11
  "unk_token": "<|endoftext|>"