deepanshupillm commited on
Commit
b487e83
·
verified ·
1 Parent(s): 8ec4403

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -32
README.md CHANGED
@@ -44,24 +44,26 @@ Alpie-Core is one of the world's first fine-tuned 4-bit reasoning models, provin
44
 
45
  ## 5. Benchmark Results
46
 
47
- | Benchmark | Alpie-Core (32B-4bit) | DeepSeek-V2 (236B) | Qwen2.5 72B | Llama 3.1 405B | Llama 3.1 70B | Gemma-3 27B-PT | Category |
48
- |-----------|----------------------|-------------------|-------------|---------------|---------------|----------------|----------|
49
- | MMLU (5-shot) | **81.28%** | 78.4% | 85.0% | 84.4% | 79.3% | 78.6% | General Knowledge |
50
- | GSM8K (8-shot) | **92.75%** | 81.6% | 88.3% | 83.5% | - | 82.2% | Mathematical Reasoning |
51
- | BBH (3-shot) | **85.12%** | 78.8% | 79.8% | 82.9% | 81.6% | 77.7% | Complex Reasoning |
52
- | MMLU-Pro (5-shot) | **64.78%** | 51.4% | 58.3% | 52.8% | 53.8% | 52.2% | Advanced Reasoning |
53
- | MBPP (pass@1) | **75.20%** | 65.0% | 72.6% | 68.4% | - | 65.6% | Code Generation |
54
- | HumanEval (pass@1) | **57.23%** | 43.3% | 53.0% | 54.9% | - | 48.8% | Code Generation |
55
- | SWE-Bench Verified | **57.8%** | - | - | - | - | - | Software Engineering |
56
- | AIME | **47.34%** | - | - | - | - | - | Advanced Mathematics |
57
- | GPQA (Diamond) | **40.91%** | - | - | - | - | - | Graduate-level QA |
58
- | TruthfulQA (MC2) | **60.05%** | - | - | - | - | - | Truthfulness |
59
- | HellaSwag | **84.66%** | - | - | - | - | - | Commonsense |
60
- | PIQA | **83.24%** | - | - | - | - | - | Physical Reasoning |
61
- | ARC Challenge | **67.58%** | - | - | - | - | - | Science QA |
62
- | CommonSenseQA | **87.06%** | - | - | - | - | - | Commonsense |
63
- | AGIEval | **64.98%** | - | - | - | - | - | General Intelligence |
64
- | Winogrande | **79.53%** | - | - | - | - | - | Commonsense Reasoning |
 
 
65
 
66
  ### Humanity's Last Exam Leaderboard Performance
67
 
@@ -76,6 +78,20 @@ Alpie-Core is one of the world's first fine-tuned 4-bit reasoning models, provin
76
  | 7 | DeepSeek V3 | 4.55 | Below Alpie |
77
  | 8 | Gemini 1.5 Pro 002 | 4.55 | Below Alpie |
78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  ## 6. Training Details
80
 
81
  - **Hardware**: 8× NVIDIA A100-80GB GPUs
@@ -101,7 +117,7 @@ Alpie-Core is one of the world's first fine-tuned 4-bit reasoning models, provin
101
  - Experimental design optimization
102
 
103
  ### Advanced Coding and Software Engineering
104
- - 57.8% SWE-Bench Verified score (12% above nearest competitor)
105
  - Automated bug detection and GitHub issue resolution
106
  - Competitive programming and algorithm design
107
  - Enterprise software development and architecture design
@@ -130,22 +146,86 @@ Unlike the base DeepSeek model, Alpie-Core provides factual, balanced responses
130
 
131
  ## 10. How to Use
132
 
133
- ### Installation
134
  ```python
135
- from transformers import AutoTokenizer, AutoModelForCausalLM
136
-
137
- model_id = "alpie/Alpie-Core-4bit"
138
- tokenizer = AutoTokenizer.from_pretrained(model_id)
139
- model = AutoModelForCausalLM.from_pretrained(
140
- model_id,
141
- device_map="auto",
142
- torch_dtype="auto"
 
 
 
 
 
143
  )
144
 
145
- messages = [{"role": "user", "content": "Solve 2x^2 + 3x + 5 = 0"}]
146
- inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
147
- outputs = model.generate(**inputs, max_new_tokens=512)
148
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  ```
150
 
151
  ### Deployment Options
 
44
 
45
  ## 5. Benchmark Results
46
 
47
+ | Benchmark | Alpie-Core (32B-4bit) | DeepSeek-V2 (236B) | Qwen2.5 72B | Llama 3.1 405B | Llama 3.1 70B | Gemma-3 27B-PT | Mistral-Small-24B-Base-2501 |
48
+ |-----------|----------------------|-------------------|-------------|---------------|---------------|----------------|----------------------------|
49
+ | MMLU (5-shot) | **81.28%** | 78.4% | 85.0% | 84.4% | 79.3% | 78.6% | 80.73% |
50
+ | GSM8K (8-shot) | **92.75%** | 81.6% | 88.3% | 83.5% | nan | 82.2% | 80.73% |
51
+ | BBH (3-shot) | **85.12%** | 78.8% | 79.8% | 82.9% | 81.6% | 77.7% | nan |
52
+ | MMLU-Pro (5-shot) | **64.78%** | 51.4% | 58.3% | 52.8% | 53.8% | 52.2% | 54.37% |
53
+ | MBPP (pass@1) | **75.20%** | 65.0% | 72.6% | 68.4% | nan | 65.6% | 69.64% |
54
+ | HumanEval (pass@1) | **57.23%** | 43.3% | 53.0% | 54.9% | nan | 48.8% | nan |
55
+
56
+ ### SWE-Bench Verified Performance
57
+
58
+ | Rank | Model | Accuracy (%) | Performance vs Alpie |
59
+ |------|-------|-------------|---------------------|
60
+ | **1** | **Alpie Core** | **57.8** | **Alpie** |
61
+ | 2 | Qwen3-Coder-30B-A3B-Instruct | 51.6 | Below Alpie |
62
+ | 3 | o1 | 48.9 | Below Alpie |
63
+ | 4 | o3-mini (high) | 49.3 | Below Alpie |
64
+ | 5 | Claude 3.5 Sonnet | 49.0 | Below Alpie |
65
+ | 6 | DeepSeek R1 | 49.2 | Below Alpie |
66
+ | 7 | Devstral | 46.8 | Below Alpie |
67
 
68
  ### Humanity's Last Exam Leaderboard Performance
69
 
 
78
  | 7 | DeepSeek V3 | 4.55 | Below Alpie |
79
  | 8 | Gemini 1.5 Pro 002 | 4.55 | Below Alpie |
80
 
81
+ ### Additional Benchmarks
82
+
83
+ | Benchmark | Alpie-Core (32B-4bit) | Category |
84
+ |-----------|----------------------|----------|
85
+ | AIME | **47.34%** | Advanced Mathematics |
86
+ | GPQA (Diamond) | **40.91%** | Graduate-level QA |
87
+ | TruthfulQA (MC2) | **60.05%** | Truthfulness |
88
+ | HellaSwag | **84.66%** | Commonsense |
89
+ | PIQA | **83.24%** | Physical Reasoning |
90
+ | ARC Challenge | **67.58%** | Science QA |
91
+ | CommonSenseQA | **87.06%** | Commonsense |
92
+ | AGIEval | **64.98%** | General Intelligence |
93
+ | Winogrande | **79.53%** | Commonsense Reasoning |
94
+
95
  ## 6. Training Details
96
 
97
  - **Hardware**: 8× NVIDIA A100-80GB GPUs
 
117
  - Experimental design optimization
118
 
119
  ### Advanced Coding and Software Engineering
120
+ - 57.8% SWE-Bench Verified score (8% above nearest competitor)
121
  - Automated bug detection and GitHub issue resolution
122
  - Competitive programming and algorithm design
123
  - Enterprise software development and architecture design
 
146
 
147
  ## 10. How to Use
148
 
149
+ ### Non-Streaming Inference
150
  ```python
151
+ from transformers import AutoModelForCausalLM, AutoTokenizer
152
+ from peft import PeftModel, PeftConfig
153
+ import torch
154
+
155
+ # Load LoRA adapter configuration to find the base model
156
+ peft_model_id = "169Pi/Alpie-core"
157
+ config = PeftConfig.from_pretrained(peft_model_id)
158
+
159
+ # Load the base model
160
+ base_model = AutoModelForCausalLM.from_pretrained(
161
+ config.base_model_name_or_path,
162
+ torch_dtype=torch.float16,
163
+ device_map="auto"
164
  )
165
 
166
+ # Load tokenizer
167
+ tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
168
+
169
+ # Load LoRA weights
170
+ model = PeftModel.from_pretrained(base_model, peft_model_id)
171
+
172
+ # Ensure evaluation mode
173
+ model.eval()
174
+
175
+ # Sample inference
176
+ prompt = "Solve the Riemann Hypothesis and provide a final answer?"
177
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
178
+
179
+ with torch.no_grad():
180
+ outputs = model.generate(**inputs, max_new_tokens=1000)
181
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
182
+
183
+ print("Response:\n", response)
184
+ ```
185
+
186
+ ### Streaming Inference
187
+ ```python
188
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
189
+ from peft import PeftModel, PeftConfig
190
+ import torch
191
+
192
+ # Load LoRA adapter configuration to find the base model
193
+ peft_model_id = "169Pi/Alpie-core"
194
+ config = PeftConfig.from_pretrained(peft_model_id)
195
+
196
+ # Load the base model
197
+ base_model = AutoModelForCausalLM.from_pretrained(
198
+ config.base_model_name_or_path,
199
+ torch_dtype=torch.float16,
200
+ device_map="auto"
201
+ )
202
+
203
+ # Load tokenizer
204
+ tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
205
+
206
+ # Load LoRA weights
207
+ model = PeftModel.from_pretrained(base_model, peft_model_id)
208
+
209
+ # Ensure evaluation mode
210
+ model.eval()
211
+
212
+ # Initialize streamer
213
+ streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
214
+
215
+ # Sample streaming inference
216
+ prompt = "Solve the Riemann Hypothesis and provide a final answer?"
217
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
218
+
219
+ print("Streaming Response:")
220
+ with torch.no_grad():
221
+ outputs = model.generate(
222
+ **inputs,
223
+ max_new_tokens=1000,
224
+ streamer=streamer,
225
+ do_sample=True,
226
+ temperature=0.7,
227
+ top_p=0.9
228
+ )
229
  ```
230
 
231
  ### Deployment Options