GSM8K Dense LLM v1.1 (700M Parameters)
A custom Dense LLM architecture pretrained for mathematical reasoning tasks, particularly GSM8K-style problems.
Model Highlights
| Parameters | ~796M |
| Training Data | ~5 billion tokens |
| Training Duration | 2 epochs |
| Focus | Mathematical word problems and reasoning |
| Proficiency Level | 9-10 (Advanced/Mastery) |
What's New in v1.1
After further training, this model has vastly improved performance over v1.0, demonstrating stronger mathematical reasoning capabilities and more reliable problem-solving across a wider range of GSM8K-style questions.
Benchmark Results
GSM8K Proficiency Evaluation
The model was evaluated on a 10-question graded assessment spanning difficulty levels 1-10, testing progressive mathematical reasoning skills.
| Level | Category | Problem Type | Result |
|---|---|---|---|
| 1 | Basic | Simple subtraction | β |
| 2 | Basic | Single multiplication | β |
| 3 | Elementary | Division word problems | β |
| 4 | Elementary | Multi-step addition/subtraction | β |
| 5 | Intermediate | Sequential fractions | β |
| 6 | Intermediate | Multi-item purchases with discounts | β |
| 7 | Advanced | Overtime wage calculations | β |
| 8 | Advanced | Combined rate problems (additive) | β |
| 9 | Expert | Sequential percentage markup/discount | β |
| 10 | Mastery | Combined work-rate (subtraction) | β |
| 10A | Mastery | Mixture problems / Systems of equations | β |
| 10C | Mastery | Round-trip rate/time/distance | β |
Overall Score: 11/12 (92%)
Proficiency Level: 9-10 (Advanced to Mastery)
Example Solutions
The following examples demonstrate the model's step-by-step reasoning capabilities on mastery-level problems.
Example 1: Mixture Problem (Level 10A - Mastery) β
Problem:
A coffee shop mixes two types of beans. Type A costs $12 per pound and Type B costs $18 per pound. The shop creates a 50-pound mixture that sells for $14.40 per pound. How many pounds of each type of bean are in the mixture?
Model Output:
Step 1: Let x be the pounds of Type A beans and y be the pounds of Type B beans.
Step 2: Total weight equation: x + y = 50.
Step 3: Cost equation: 12x + 18y = 14.40 Γ 50 = 720.
Step 4: Simplify cost equation: divide both sides by 6 β 2x + 3y = 120.
Step 5: Solve the system of equations:
x + y = 50
2x + 3y = 120
Subtract the first from the second: x = 30.
Then y = 20.
Step 6: Verify total cost: 12 Γ 30 + 18 Γ 20 = 360 + 360 = 720, which matches the given total.
Therefore, the mixture contains 30 pounds of Type A beans and 20 pounds of Type B beans.
#### 30
β Correct Answer: 30 pounds Type A, 20 pounds Type B
Skills Demonstrated:
- Variable setup with clear definitions
- System of equations construction
- Algebraic simplification (dividing by 6 for cleaner numbers)
- Correct solution method
- Self-verification step
Example 2: Round-Trip Rate Problem (Level 10C - Mastery) β
Problem:
Emma drives from City A to City B at 60 mph. She immediately returns from City B to City A at 40 mph. If the entire round trip took 5 hours, what is the distance between City A and City B?
Model Output:
1. Let d be the distance between cities in miles.
2. Time to go from A to B: d/60 hours.
3. Time to return from B to A: d/40 hours.
4. Total time: d/60 + d/40 = 5.
5. Find a common denominator (120): 2d/120 + 3d/120 = 5.
6. Combine like terms: 5d/120 = d/24.
7. So, d/24 = 5.
8. Multiply both sides by 24: d = 5 Γ 24 = 120.
Thus, the distance between cities is 120 miles.
#### 120
β Correct Answer: 120 miles
Skills Demonstrated:
- Correct rate/time/distance setup (time = distance/speed)
- Proper fraction arithmetic with common denominators
- Avoided the "average speed trap" (did NOT incorrectly average 60 and 40 to get 50 mph)
- Clean algebraic manipulation
- Correct final answer
β οΈ Note on the "Average Speed Trap": A common failure mode where models incorrectly compute: "Average speed = (60+40)/2 = 50 mph, so distance = 50 Γ 2.5 = 125 miles". This model correctly recognized that speeds cannot be directly averaged for equal distancesβdemonstrating true mathematical reasoning rather than pattern matching.
Demonstrated Capabilities
| Skill | Status |
|---|---|
| Multi-step arithmetic | β Strong |
| Fraction operations | β Strong |
| Percentage calculations | β Strong |
| Systems of linear equations | β Strong |
| Rate/time/distance problems | β Strong |
| Algebraic word problem setup | β Strong |
| Solution verification | β Strong |
| Avoiding common traps (e.g., speed averaging) | β Strong |
| Combined work-rate with subtraction | β οΈ Gap |
Key Findings
- True reasoning ability: Model correctly avoids common mathematical traps (e.g., incorrectly averaging speeds for round-trip problems)
- Clean solution formatting: Produces well-structured step-by-step solutions with verification steps
- Identified training gap: Narrow gap on "combined work-rate with subtraction" problem archetype (e.g., "Three workers together complete a task in X days, Worker A alone takes Y days...")βthis appears to be a data coverage issue rather than a reasoning limitation
Architecture
- Custom DenseLLM with Grouped-Query Attention (GQA)
- RoPE positional embeddings
- SwiGLU activation
- RMSNorm normalization
- Weight tying (embedding & output)
Technical Specifications
| Parameter | Value |
|---|---|
| d_model | 1280 |
| n_layers | 32 |
| n_heads | 20 |
| n_kv_heads | 4 |
| ff_mult | 4.0 |
| max_seq_len | 2048 |
| vocab_size | 32,064 |
| Total params | ~796,099,840 |
Training
This model was pretrained (not fine-tuned) on mathematical reasoning data including GSM8K-style problems. It performs text completion rather than instruction-following.
Usage
This model uses a custom architecture. See the repository for loading code.
Loading the Model
import torch
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("AlgoDriveAI/GSM8K-700M-v1.1")
# Load model (requires the modeling_dense_llm.py file)
from modeling_dense_llm import load_model
model = load_model("pytorch_model.bin", "config.json", device="cuda")
# Or manually:
from modeling_dense_llm import DenseLLM
import json
with open("config.json") as f:
config = json.load(f)
model = DenseLLM(
vocab_size=config["vocab_size"],
d_model=config["d_model"],
n_layers=config["n_layers"],
n_heads=config["n_heads"],
n_kv_heads=config["n_kv_heads"],
ff_hidden_mult=config["ff_hidden_mult"],
qk_norm=config["qk_norm"],
parallel_residual=config["parallel_residual"],
max_seq_len=config["max_seq_len"],
)
state_dict = torch.load("pytorch_model.bin", map_location="cuda")
model.load_state_dict(state_dict)
model = model.cuda().bfloat16().eval()
Generation Example
# Math problem completion
prompt = """Question: Sarah has 5 apples. She gives 2 to her friend and then buys 3 more. How many apples does Sarah have now?
Let's solve this step by step:"""
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
output = model.generate(
input_ids,
max_new_tokens=256,
temperature=0.7,
top_k=50,
top_p=0.9,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0]))
Intended Use
- Mathematical reasoning and problem solving
- Text completion for math-related content
- Research on efficient LLM architectures
- Benchmarking small-scale math reasoning models
Limitations
- This is a pretrained model, not instruction-tuned
- Best used for text completion, not chat/QA
- May require fine-tuning for specific downstream tasks
- Known gap: Combined work-rate problems involving subtraction of reciprocal rates (fixable with targeted training data)
Requirements
torch>=2.0
transformers
einops
Citation
If you use this model, please cite:
@misc{gsm8k-densellm-700m-v1.1,
author = {AlgoDriveAI, Christopher Smith},
title = {GSM8K Dense LLM 700M v1.1},
email = {csm4211@wgu.edu},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/AlgoDriveAI/GSM8K-700M-v1.1}
}
- Downloads last month
- 34
Evaluation results
- Proficiency Score on GSM8K Proficiency Quiz (10 levels)self-reported92.000
- Reasoning Level (1-10 scale) on GSM8K Proficiency Quiz (10 levels)self-reported9-10