GSM8K Dense LLM v1.1 (700M Parameters)

A custom Dense LLM architecture pretrained for mathematical reasoning tasks, particularly GSM8K-style problems.

Model Highlights


Parameters	~796M
Training Data	~5 billion tokens
Training Duration	2 epochs
Focus	Mathematical word problems and reasoning
Proficiency Level	9-10 (Advanced/Mastery)

What's New in v1.1

After further training, this model has vastly improved performance over v1.0, demonstrating stronger mathematical reasoning capabilities and more reliable problem-solving across a wider range of GSM8K-style questions.

Benchmark Results

GSM8K Proficiency Evaluation

The model was evaluated on a 10-question graded assessment spanning difficulty levels 1-10, testing progressive mathematical reasoning skills.

Level	Category	Problem Type	Result
1	Basic	Simple subtraction	✅
2	Basic	Single multiplication	✅
3	Elementary	Division word problems	✅
4	Elementary	Multi-step addition/subtraction	✅
5	Intermediate	Sequential fractions	✅
6	Intermediate	Multi-item purchases with discounts	✅
7	Advanced	Overtime wage calculations	✅
8	Advanced	Combined rate problems (additive)	✅
9	Expert	Sequential percentage markup/discount	✅
10	Mastery	Combined work-rate (subtraction)	❌
10A	Mastery	Mixture problems / Systems of equations	✅
10C	Mastery	Round-trip rate/time/distance	✅

Overall Score: 11/12 (92%)
Proficiency Level: 9-10 (Advanced to Mastery)

Example Solutions

The following examples demonstrate the model's step-by-step reasoning capabilities on mastery-level problems.

Example 1: Mixture Problem (Level 10A - Mastery) ✅

Problem:

A coffee shop mixes two types of beans. Type A costs $12 per pound and Type B costs $18 per pound. The shop creates a 50-pound mixture that sells for $14.40 per pound. How many pounds of each type of bean are in the mixture?

Model Output:

Step 1: Let x be the pounds of Type A beans and y be the pounds of Type B beans.
Step 2: Total weight equation: x + y = 50.
Step 3: Cost equation: 12x + 18y = 14.40 × 50 = 720.
Step 4: Simplify cost equation: divide both sides by 6 → 2x + 3y = 120.
Step 5: Solve the system of equations:
  x + y = 50
  2x + 3y = 120
Subtract the first from the second: x = 30.
Then y = 20.
Step 6: Verify total cost: 12 × 30 + 18 × 20 = 360 + 360 = 720, which matches the given total.
Therefore, the mixture contains 30 pounds of Type A beans and 20 pounds of Type B beans.
#### 30

✅ Correct Answer: 30 pounds Type A, 20 pounds Type B

Skills Demonstrated:

Variable setup with clear definitions
System of equations construction
Algebraic simplification (dividing by 6 for cleaner numbers)
Correct solution method
Self-verification step

Example 2: Round-Trip Rate Problem (Level 10C - Mastery) ✅

Problem:

Emma drives from City A to City B at 60 mph. She immediately returns from City B to City A at 40 mph. If the entire round trip took 5 hours, what is the distance between City A and City B?

Model Output:

1. Let d be the distance between cities in miles.
2. Time to go from A to B: d/60 hours.
3. Time to return from B to A: d/40 hours.
4. Total time: d/60 + d/40 = 5.
5. Find a common denominator (120): 2d/120 + 3d/120 = 5.
6. Combine like terms: 5d/120 = d/24.
7. So, d/24 = 5.
8. Multiply both sides by 24: d = 5 × 24 = 120.
Thus, the distance between cities is 120 miles.
#### 120

✅ Correct Answer: 120 miles

Skills Demonstrated:

Correct rate/time/distance setup (time = distance/speed)
Proper fraction arithmetic with common denominators
Avoided the "average speed trap" (did NOT incorrectly average 60 and 40 to get 50 mph)
Clean algebraic manipulation
Correct final answer

⚠️ Note on the "Average Speed Trap": A common failure mode where models incorrectly compute: "Average speed = (60+40)/2 = 50 mph, so distance = 50 × 2.5 = 125 miles". This model correctly recognized that speeds cannot be directly averaged for equal distances—demonstrating true mathematical reasoning rather than pattern matching.

Demonstrated Capabilities

Skill	Status
Multi-step arithmetic	✅ Strong
Fraction operations	✅ Strong
Percentage calculations	✅ Strong
Systems of linear equations	✅ Strong
Rate/time/distance problems	✅ Strong
Algebraic word problem setup	✅ Strong
Solution verification	✅ Strong
Avoiding common traps (e.g., speed averaging)	✅ Strong
Combined work-rate with subtraction	⚠️ Gap

Key Findings

True reasoning ability: Model correctly avoids common mathematical traps (e.g., incorrectly averaging speeds for round-trip problems)
Clean solution formatting: Produces well-structured step-by-step solutions with verification steps
Identified training gap: Narrow gap on "combined work-rate with subtraction" problem archetype (e.g., "Three workers together complete a task in X days, Worker A alone takes Y days...")—this appears to be a data coverage issue rather than a reasoning limitation

Architecture

Custom DenseLLM with Grouped-Query Attention (GQA)
RoPE positional embeddings
SwiGLU activation
RMSNorm normalization
Weight tying (embedding & output)

Technical Specifications

Parameter	Value
d_model	1280
n_layers	32
n_heads	20
n_kv_heads	4
ff_mult	4.0
max_seq_len	2048
vocab_size	32,064
Total params	~796,099,840

Training

This model was pretrained (not fine-tuned) on mathematical reasoning data including GSM8K-style problems. It performs text completion rather than instruction-following.

Usage

This model uses a custom architecture. See the repository for loading code.

Loading the Model

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("AlgoDriveAI/GSM8K-700M-v1.1")

# Load model (requires the modeling_dense_llm.py file)
from modeling_dense_llm import load_model
model = load_model("pytorch_model.bin", "config.json", device="cuda")

# Or manually:
from modeling_dense_llm import DenseLLM
import json

with open("config.json") as f:
    config = json.load(f)

model = DenseLLM(
    vocab_size=config["vocab_size"],
    d_model=config["d_model"],
    n_layers=config["n_layers"],
    n_heads=config["n_heads"],
    n_kv_heads=config["n_kv_heads"],
    ff_hidden_mult=config["ff_hidden_mult"],
    qk_norm=config["qk_norm"],
    parallel_residual=config["parallel_residual"],
    max_seq_len=config["max_seq_len"],
)

state_dict = torch.load("pytorch_model.bin", map_location="cuda")
model.load_state_dict(state_dict)
model = model.cuda().bfloat16().eval()

Generation Example

# Math problem completion
prompt = """Question: Sarah has 5 apples. She gives 2 to her friend and then buys 3 more. How many apples does Sarah have now?
Let's solve this step by step:"""

input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()

output = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.1,
    eos_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(output[0]))

Intended Use

Mathematical reasoning and problem solving
Text completion for math-related content
Research on efficient LLM architectures
Benchmarking small-scale math reasoning models

Limitations

This is a pretrained model, not instruction-tuned
Best used for text completion, not chat/QA
May require fine-tuning for specific downstream tasks
Known gap: Combined work-rate problems involving subtraction of reciprocal rates (fixable with targeted training data)

Requirements

torch>=2.0
transformers
einops

Citation

If you use this model, please cite:

@misc{gsm8k-densellm-700m-v1.1,
  author = {AlgoDriveAI, Christopher Smith},
  title = {GSM8K Dense LLM 700M v1.1},
  email = {csm4211@wgu.edu},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/AlgoDriveAI/GSM8K-700M-v1.1}
}

Downloads last month: 34

Evaluation results

Proficiency Score on GSM8K Proficiency Quiz (10 levels)
self-reported

92.000
Reasoning Level (1-10 scale) on GSM8K Proficiency Quiz (10 levels)
self-reported

9-10