GSM8K Dense LLM v1.1 (700M Parameters)

A custom Dense LLM architecture pretrained for mathematical reasoning tasks, particularly GSM8K-style problems.

Model Highlights

Parameters ~796M
Training Data ~5 billion tokens
Training Duration 2 epochs
Focus Mathematical word problems and reasoning
Proficiency Level 9-10 (Advanced/Mastery)

What's New in v1.1

After further training, this model has vastly improved performance over v1.0, demonstrating stronger mathematical reasoning capabilities and more reliable problem-solving across a wider range of GSM8K-style questions.


Benchmark Results

GSM8K Proficiency Evaluation

The model was evaluated on a 10-question graded assessment spanning difficulty levels 1-10, testing progressive mathematical reasoning skills.

Level Category Problem Type Result
1 Basic Simple subtraction βœ…
2 Basic Single multiplication βœ…
3 Elementary Division word problems βœ…
4 Elementary Multi-step addition/subtraction βœ…
5 Intermediate Sequential fractions βœ…
6 Intermediate Multi-item purchases with discounts βœ…
7 Advanced Overtime wage calculations βœ…
8 Advanced Combined rate problems (additive) βœ…
9 Expert Sequential percentage markup/discount βœ…
10 Mastery Combined work-rate (subtraction) ❌
10A Mastery Mixture problems / Systems of equations βœ…
10C Mastery Round-trip rate/time/distance βœ…

Overall Score: 11/12 (92%)
Proficiency Level: 9-10 (Advanced to Mastery)


Example Solutions

The following examples demonstrate the model's step-by-step reasoning capabilities on mastery-level problems.

Example 1: Mixture Problem (Level 10A - Mastery) βœ…

Problem:

A coffee shop mixes two types of beans. Type A costs $12 per pound and Type B costs $18 per pound. The shop creates a 50-pound mixture that sells for $14.40 per pound. How many pounds of each type of bean are in the mixture?

Model Output:

Step 1: Let x be the pounds of Type A beans and y be the pounds of Type B beans.
Step 2: Total weight equation: x + y = 50.
Step 3: Cost equation: 12x + 18y = 14.40 Γ— 50 = 720.
Step 4: Simplify cost equation: divide both sides by 6 β†’ 2x + 3y = 120.
Step 5: Solve the system of equations:
  x + y = 50
  2x + 3y = 120
Subtract the first from the second: x = 30.
Then y = 20.
Step 6: Verify total cost: 12 Γ— 30 + 18 Γ— 20 = 360 + 360 = 720, which matches the given total.
Therefore, the mixture contains 30 pounds of Type A beans and 20 pounds of Type B beans.
#### 30

βœ… Correct Answer: 30 pounds Type A, 20 pounds Type B

Skills Demonstrated:

  • Variable setup with clear definitions
  • System of equations construction
  • Algebraic simplification (dividing by 6 for cleaner numbers)
  • Correct solution method
  • Self-verification step

Example 2: Round-Trip Rate Problem (Level 10C - Mastery) βœ…

Problem:

Emma drives from City A to City B at 60 mph. She immediately returns from City B to City A at 40 mph. If the entire round trip took 5 hours, what is the distance between City A and City B?

Model Output:

1. Let d be the distance between cities in miles.
2. Time to go from A to B: d/60 hours.
3. Time to return from B to A: d/40 hours.
4. Total time: d/60 + d/40 = 5.
5. Find a common denominator (120): 2d/120 + 3d/120 = 5.
6. Combine like terms: 5d/120 = d/24.
7. So, d/24 = 5.
8. Multiply both sides by 24: d = 5 Γ— 24 = 120.
Thus, the distance between cities is 120 miles.
#### 120

βœ… Correct Answer: 120 miles

Skills Demonstrated:

  • Correct rate/time/distance setup (time = distance/speed)
  • Proper fraction arithmetic with common denominators
  • Avoided the "average speed trap" (did NOT incorrectly average 60 and 40 to get 50 mph)
  • Clean algebraic manipulation
  • Correct final answer

⚠️ Note on the "Average Speed Trap": A common failure mode where models incorrectly compute: "Average speed = (60+40)/2 = 50 mph, so distance = 50 Γ— 2.5 = 125 miles". This model correctly recognized that speeds cannot be directly averaged for equal distancesβ€”demonstrating true mathematical reasoning rather than pattern matching.


Demonstrated Capabilities

Skill Status
Multi-step arithmetic βœ… Strong
Fraction operations βœ… Strong
Percentage calculations βœ… Strong
Systems of linear equations βœ… Strong
Rate/time/distance problems βœ… Strong
Algebraic word problem setup βœ… Strong
Solution verification βœ… Strong
Avoiding common traps (e.g., speed averaging) βœ… Strong
Combined work-rate with subtraction ⚠️ Gap

Key Findings

  • True reasoning ability: Model correctly avoids common mathematical traps (e.g., incorrectly averaging speeds for round-trip problems)
  • Clean solution formatting: Produces well-structured step-by-step solutions with verification steps
  • Identified training gap: Narrow gap on "combined work-rate with subtraction" problem archetype (e.g., "Three workers together complete a task in X days, Worker A alone takes Y days...")β€”this appears to be a data coverage issue rather than a reasoning limitation

Architecture

  • Custom DenseLLM with Grouped-Query Attention (GQA)
  • RoPE positional embeddings
  • SwiGLU activation
  • RMSNorm normalization
  • Weight tying (embedding & output)

Technical Specifications

Parameter Value
d_model 1280
n_layers 32
n_heads 20
n_kv_heads 4
ff_mult 4.0
max_seq_len 2048
vocab_size 32,064
Total params ~796,099,840

Training

This model was pretrained (not fine-tuned) on mathematical reasoning data including GSM8K-style problems. It performs text completion rather than instruction-following.


Usage

This model uses a custom architecture. See the repository for loading code.

Loading the Model

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("AlgoDriveAI/GSM8K-700M-v1.1")

# Load model (requires the modeling_dense_llm.py file)
from modeling_dense_llm import load_model
model = load_model("pytorch_model.bin", "config.json", device="cuda")

# Or manually:
from modeling_dense_llm import DenseLLM
import json

with open("config.json") as f:
    config = json.load(f)

model = DenseLLM(
    vocab_size=config["vocab_size"],
    d_model=config["d_model"],
    n_layers=config["n_layers"],
    n_heads=config["n_heads"],
    n_kv_heads=config["n_kv_heads"],
    ff_hidden_mult=config["ff_hidden_mult"],
    qk_norm=config["qk_norm"],
    parallel_residual=config["parallel_residual"],
    max_seq_len=config["max_seq_len"],
)

state_dict = torch.load("pytorch_model.bin", map_location="cuda")
model.load_state_dict(state_dict)
model = model.cuda().bfloat16().eval()

Generation Example

# Math problem completion
prompt = """Question: Sarah has 5 apples. She gives 2 to her friend and then buys 3 more. How many apples does Sarah have now?
Let's solve this step by step:"""

input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()

output = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.1,
    eos_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(output[0]))

Intended Use

  • Mathematical reasoning and problem solving
  • Text completion for math-related content
  • Research on efficient LLM architectures
  • Benchmarking small-scale math reasoning models

Limitations

  • This is a pretrained model, not instruction-tuned
  • Best used for text completion, not chat/QA
  • May require fine-tuning for specific downstream tasks
  • Known gap: Combined work-rate problems involving subtraction of reciprocal rates (fixable with targeted training data)

Requirements

torch>=2.0
transformers
einops

Citation

If you use this model, please cite:

@misc{gsm8k-densellm-700m-v1.1,
  author = {AlgoDriveAI, Christopher Smith},
  title = {GSM8K Dense LLM 700M v1.1},
  email = {csm4211@wgu.edu},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/AlgoDriveAI/GSM8K-700M-v1.1}
}
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • Proficiency Score on GSM8K Proficiency Quiz (10 levels)
    self-reported
    92.000
  • Reasoning Level (1-10 scale) on GSM8K Proficiency Quiz (10 levels)
    self-reported
    9-10