---
license: apache-2.0
datasets:
- unsloth/OpenMathReasoning-mini
- nvidia/OpenCodeReasoning
language:
- en
base_model:
- Qwen/Qwen3-4B-Base
pipeline_tag: text-generation
---
# QTM7-4B

QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base.  
It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving.  
This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost).

---

## UPDATE: Observed Performance Shift

This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect:

**QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.**

The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content.

---

## Model Details

- **Developed by:** Independent researcher (solo project)  
- **Funding:** Self-funded (~$20 total compute cost)  
- **Model type:** Decoder-only transformer for text generation  
- **Language(s):** English  
- **License:** Apache-2.0  
- **Finetuned from:** [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)  

### Sources
- **Repository:** [Ma7ee7/QTM7-4b-2hr-checkpoint](https://huggingface.co/Ma7ee7/QTM7-4b-2hr-checkpoint)

---

## Uses

### Direct Use
- Research into math & code reasoning  
- Proof-of-concept for low-budget finetuning on large language models  
- **New Focus:** Evaluation of low-resource impact on creative writing and narrative coherence.

### Downstream Use
- Potential basis for math problem solvers or code reasoning assistants  
- Experiments in lightweight alignment or evaluation pipelines  

### Out-of-Scope
- Not suitable for safety-critical, legal, or medical applications  
- Not RLHF-aligned; outputs may be unfiltered or ungrounded  

---

## Bias, Risks, and Limitations
- Inherits biases from Qwen3-4B-Base  
- Untested on broader NLP benchmarks (MMLU, ARC, etc.)  
- Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow  
- General conversational ability remains base-model level  

**Recommendation:** Treat outputs as experimental. Do not deploy in production or decision-making contexts.

---

## Training Details

### Training Data
- **unsloth/OpenMathReasoning-mini** — math reasoning dataset  
- **nvidia/OpenCodeReasoning** — code reasoning tasks  
- No GSM8K contamination was found in either the training or post-training data.

### Procedure
- Mixed precision: **fp16**  
- Optimizer: AdamW (standard defaults)  
- Duration: ~4 hours on **1x NVIDIA A100**  
- Checkpoint size: ~16 GB (fp16)  

---

## Evaluation

### Setup
- Compared against **Qwen/Qwen3-4B** (post-trained version)  
- Dataset: **GSM8K test split** (subset of 300 “hard” problems)  
- Metrics: Exact match on final numeric answer  

### Results

**Training Loss Curve**  
Stable convergence toward ~0.63 by step 1750, even as difficulty increased.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/EkC0a3csyiasVol9IcUMr.png)

---

**GSM8K Accuracy (Sampled)**  
QTM7-4B* scored ~**80.7%** vs Qwen3-4B’s ~**28.0%**.  

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/C_qw2qw2sqg4tkcCAqOwI.png)

---

**Head-to-Head Outcomes**  
QTM7-4B* won most direct comparisons.  
- **Only QTM7-4B\*** correct → 171  
- **Both** correct → 71  
- **Both** wrong → 45  
- **Only Qwen** correct → 13  

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/T0oA_n-aQqYuQLOxCGQ22.png)

---

**Outcome Breakdown by Model (GSM8K subset)**  
Side-by-side percentages for correctness vs error types.

- **QTM7-4B\***: 80.7% correct, 7.3% mismatch, **12.0% truncated**  
- **Qwen3-4B**: 28.0% correct, **72.0% mismatch**, 0% truncated  

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/MnG3Dx6Ud3l4lpcTIRsGp.png)

---

\* **QTM7-4B = 2hr checkpoint**  

---

## Environmental Impact

Estimated using [MLCO2 Impact Calculator](https://mlco2.github.io/impact#compute):

- **Hardware:** NVIDIA A100 (80GB)  
- **GPU hours:** ~4  
- **Cloud Provider:** Google Colab (us-central assumed)  
- **Carbon emitted:** ≈ **1.2 kg CO2eq**  

*(About the same as driving ~5 km in a gasoline car.)*

---

## Technical Specifications

- **Architecture:** Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention)  
- **Objective:** Causal LM finetuning on reasoning tasks  
- **Software:** PyTorch + Hugging Face Transformers + Datasets  

---

## Summary

QTM7-4B is a minimal-budget proof-of-concept showing that:  
- **Small compute can still move the needle** on reasoning with focused datasets.  
- Math reasoning gains were observed even with short finetunes.  
- The model is not benchmarked broadly, but shows promise as a low-resource experiment.  
```