GSM8K Dense LLM v1.2 (700M Parameters)

A custom Dense LLM architecture pretrained for mathematical reasoning tasks, particularly GSM8K-style problems.

Model Highlights

Parameters ~796M
Training Data ~5 billion tokens
Training Duration 2 epochs
Focus Mathematical word problems and reasoning
Proficiency Level 9-10 (Advanced/Mastery)

What's New in v1.2

This release maintains the mastery-level performance of v1.1 while adding additional edge case coverage for robustness. The model now handles a broader range of problem variations, ambiguous wording, and numerical edge cases more reliably. Training included targeted synthetic examples of:

  • Problems with awkward phrasing or unconventional wording
  • Edge cases with very small/large numbers
  • Scenarios requiring careful unit conversions
  • Problems with extraneous information
  • Boundary conditions in rate/time/distance problems

The result is a more robust model that maintains high accuracy even when problems deviate from standard formulations.


Benchmark Results

GSM8K Proficiency Evaluation

The model was evaluated on a 10-question graded assessment spanning difficulty levels 1-10, testing progressive mathematical reasoning skills.

Level Category Problem Type Result
1 Basic Simple subtraction βœ…
2 Basic Single multiplication βœ…
3 Elementary Division word problems βœ…
4 Elementary Multi-step addition/subtraction βœ…
5 Intermediate Sequential fractions βœ…
6 Intermediate Multi-item purchases with discounts βœ…
7 Advanced Overtime wage calculations βœ…
8 Advanced Combined rate problems (additive) βœ…
9 Expert Sequential percentage markup/discount βœ…
10 Mastery Combined work-rate (subtraction) βœ…
10A Mastery Mixture problems / Systems of equations βœ…
10C Mastery Round-trip rate/time/distance βœ…

Overall Score: 11/12 (92%)
Proficiency Level: 9-10 (Advanced to Mastery)


Example Solutions

The following examples demonstrate the model's step-by-step reasoning capabilities on mastery-level problems.

Example 1: Mixture Problem (Level 10A - Mastery) βœ…

Problem:

A coffee shop mixes two types of beans. Type A costs $12 per pound and Type B costs $18 per pound. The shop creates a 50-pound mixture that sells for $14.40 per pound. How many pounds of each type of bean are in the mixture?

Model Output:

Step 1: Let x be the pounds of Type A beans and y be the pounds of Type B beans.
Step 2: Total weight equation: x + y = 50.
Step 3: Cost equation: 12x + 18y = 14.40 Γ— 50 = 720.
Step 4: Simplify cost equation: divide both sides by 6 β†’ 2x + 3y = 120.
Step 5: Solve the system of equations:
  x + y = 50
  2x + 3y = 120
Subtract the first from the second: x = 30.
Then y = 20.
Step 6: Verify total cost: 12 Γ— 30 + 18 Γ— 20 = 360 + 360 = 720, which matches the given total.
Therefore, the mixture contains 30 pounds of Type A beans and 20 pounds of Type B beans.
#### 30

βœ… Correct Answer: 30 pounds Type A, 20 pounds Type B

Skills Demonstrated:

  • Variable setup with clear definitions
  • System of equations construction
  • Algebraic simplification (dividing by 6 for cleaner numbers)
  • Correct solution method
  • Self-verification step

Example 2: Round-Trip Rate Problem (Level 10C - Mastery) βœ…

Problem:

Emma drives from City A to City B at 60 mph. She immediately returns from City B to City A at 40 mph. If the entire round trip took 5 hours, what is the distance between City A and City B?

Model Output:

1. Let d be the distance between cities in miles.
2. Time to go from A to B: d/60 hours.
3. Time to return from B to A: d/40 hours.
4. Total time: d/60 + d/40 = 5.
5. Find a common denominator (120): 2d/120 + 3d/120 = 5.
6. Combine like terms: 5d/120 = d/24.
7. So, d/24 = 5.
8. Multiply both sides by 24: d = 5 Γ— 24 = 120.
Thus, the distance between cities is 120 miles.
#### 120

βœ… Correct Answer: 120 miles

Skills Demonstrated:

  • Correct rate/time/distance setup (time = distance/speed)
  • Proper fraction arithmetic with common denominators
  • Avoided the "average speed trap" (did NOT incorrectly average 60 and 40 to get 50 mph)
  • Clean algebraic manipulation
  • Correct final answer

⚠️ Note on the "Average Speed Trap": A common failure mode where models incorrectly compute: "Average speed = (60+40)/2 = 50 mph, so distance = 50 Γ— 2.5 = 125 miles". This model correctly recognized that speeds cannot be directly averaged for equal distancesβ€”demonstrating true mathematical reasoning rather than pattern matching.


Demonstrated Capabilities

Skill Status
Multi-step arithmetic βœ… Strong
Fraction operations βœ… Strong
Percentage calculations βœ… Strong
Systems of linear equations βœ… Strong
Rate/time/distance problems βœ… Strong
Algebraic word problem setup βœ… Strong
Solution verification βœ… Strong
Avoiding common traps (e.g., speed averaging) βœ… Strong
Edge case handling (unusual phrasing, boundary conditions) βœ… Improved
Combined work-rate with subtraction βœ… Improved

Key Findings

  • True reasoning ability: Model correctly avoids common mathematical traps (e.g., incorrectly averaging speeds for round-trip problems)
  • Enhanced robustness: v1.2 handles edge cases and non-standard problem formulations more reliably
  • Clean solution formatting: Produces well-structured step-by-step solutions with verification steps
  • Identified training gap: Narrow gap on "combined work-rate with subtraction" problem archetype (e.g., "Three workers together complete a task in X days, Worker A alone takes Y days...")β€”this appears to be a data coverage issue rather than a reasoning limitation

Architecture

  • Custom DenseLLM with Grouped-Query Attention (GQA)
  • RoPE positional embeddings
  • SwiGLU activation
  • RMSNorm normalization
  • Weight tying (embedding & output)

Technical Specifications

Parameter Value
d_model 1280
n_layers 32
n_heads 20
n_kv_heads 4
ff_mult 4.0
max_seq_len 2048
vocab_size 32,064
Total params ~796,099,840

Training

This model was pretrained (not fine-tuned) on mathematical reasoning data including GSM8K-style problems. It performs text completion rather than instruction-following.


Usage

This model uses a custom architecture. See the repository for loading code.

Loading the Model

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("AlgoDriveAI/GSM8K-700M-v1.2")

# Load model (requires the modeling_dense_llm.py file)
from modeling_dense_llm import load_model
model = load_model("pytorch_model.bin", "config.json", device="cuda")

# Or manually:
from modeling_dense_llm import DenseLLM
import json

with open("config.json") as f:
    config = json.load(f)

model = DenseLLM(
    vocab_size=config["vocab_size"],
    d_model=config["d_model"],
    n_layers=config["n_layers"],
    n_heads=config["n_heads"],
    n_kv_heads=config["n_kv_heads"],
    ff_hidden_mult=config["ff_hidden_mult"],
    qk_norm=config["qk_norm"],
    parallel_residual=config["parallel_residual"],
    max_seq_len=config["max_seq_len"],
)

state_dict = torch.load("pytorch_model.bin", map_location="cuda")
model.load_state_dict(state_dict)
model = model.cuda().bfloat16().eval()

Generation Example

# Math problem completion
prompt = """Question: Sarah has 5 apples. She gives 2 to her friend and then buys 3 more. How many apples does Sarah have now?
Let's solve this step by step:"""

input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()

output = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.1,
    eos_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(output[0]))

Intended Use

  • Mathematical reasoning and problem solving
  • Text completion for math-related content
  • Research on efficient LLM architectures
  • Benchmarking small-scale math reasoning models

Limitations

  • This is a pretrained model, not instruction-tuned
  • Best used for text completion, not chat/QA
  • May require fine-tuning for specific downstream tasks
  • Known gap: Combined work-rate problems involving subtraction of reciprocal rates (fixable with targeted training data)

Requirements

torch>=2.0
transformers
einops

Citation

If you use this model, please cite:

@misc{gsm8k-densellm-700m-v1.2,
  author = {AlgoDriveAI, Christopher Smith},
  title = {GSM8K Dense LLM 700M v1.2},
  email = {csm4211@wgu.edu},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/AlgoDriveAI/GSM8K-700M-v1.2}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • Proficiency Score on GSM8K Proficiency Quiz (10 levels)
    self-reported
    92.000
  • Reasoning Level (1-10 scale) on GSM8K Proficiency Quiz (10 levels)
    self-reported
    9-10