File size: 5,390 Bytes
59ef085 65bbb2b 59ef085 65bbb2b 59ef085 65bbb2b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
license: apache-2.0
datasets:
- unsloth/OpenMathReasoning-mini
- nvidia/OpenCodeReasoning
language:
- en
base_model:
- Qwen/Qwen3-4B-Base
pipeline_tag: text-generation
---
# QTM7-4B
QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base.
It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving.
This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost).
---
## UPDATE: Observed Performance Shift
This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect:
**QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.**
The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content.
---
## Model Details
- **Developed by:** Independent researcher (solo project)
- **Funding:** Self-funded (~$20 total compute cost)
- **Model type:** Decoder-only transformer for text generation
- **Language(s):** English
- **License:** Apache-2.0
- **Finetuned from:** [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)
### Sources
- **Repository:** [Ma7ee7/QTM7-4b-2hr-checkpoint](https://huggingface.co/Ma7ee7/QTM7-4b-2hr-checkpoint)
---
## Uses
### Direct Use
- Research into math & code reasoning
- Proof-of-concept for low-budget finetuning on large language models
- **New Focus:** Evaluation of low-resource impact on creative writing and narrative coherence.
### Downstream Use
- Potential basis for math problem solvers or code reasoning assistants
- Experiments in lightweight alignment or evaluation pipelines
### Out-of-Scope
- Not suitable for safety-critical, legal, or medical applications
- Not RLHF-aligned; outputs may be unfiltered or ungrounded
---
## Bias, Risks, and Limitations
- Inherits biases from Qwen3-4B-Base
- Untested on broader NLP benchmarks (MMLU, ARC, etc.)
- Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow
- General conversational ability remains base-model level
**Recommendation:** Treat outputs as experimental. Do not deploy in production or decision-making contexts.
---
## Training Details
### Training Data
- **unsloth/OpenMathReasoning-mini** — math reasoning dataset
- **nvidia/OpenCodeReasoning** — code reasoning tasks
- No GSM8K contamination was found in either the training or post-training data.
### Procedure
- Mixed precision: **fp16**
- Optimizer: AdamW (standard defaults)
- Duration: ~4 hours on **1x NVIDIA A100**
- Checkpoint size: ~16 GB (fp16)
---
## Evaluation
### Setup
- Compared against **Qwen/Qwen3-4B** (post-trained version)
- Dataset: **GSM8K test split** (subset of 300 “hard” problems)
- Metrics: Exact match on final numeric answer
### Results
**Training Loss Curve**
Stable convergence toward ~0.63 by step 1750, even as difficulty increased.

---
**GSM8K Accuracy (Sampled)**
QTM7-4B* scored ~**80.7%** vs Qwen3-4B’s ~**28.0%**.

---
**Head-to-Head Outcomes**
QTM7-4B* won most direct comparisons.
- **Only QTM7-4B\*** correct → 171
- **Both** correct → 71
- **Both** wrong → 45
- **Only Qwen** correct → 13

---
**Outcome Breakdown by Model (GSM8K subset)**
Side-by-side percentages for correctness vs error types.
- **QTM7-4B\***: 80.7% correct, 7.3% mismatch, **12.0% truncated**
- **Qwen3-4B**: 28.0% correct, **72.0% mismatch**, 0% truncated

---
\* **QTM7-4B = 2hr checkpoint**
---
## Environmental Impact
Estimated using [MLCO2 Impact Calculator](https://mlco2.github.io/impact#compute):
- **Hardware:** NVIDIA A100 (80GB)
- **GPU hours:** ~4
- **Cloud Provider:** Google Colab (us-central assumed)
- **Carbon emitted:** ≈ **1.2 kg CO2eq**
*(About the same as driving ~5 km in a gasoline car.)*
---
## Technical Specifications
- **Architecture:** Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention)
- **Objective:** Causal LM finetuning on reasoning tasks
- **Software:** PyTorch + Hugging Face Transformers + Datasets
---
## Summary
QTM7-4B is a minimal-budget proof-of-concept showing that:
- **Small compute can still move the needle** on reasoning with focused datasets.
- Math reasoning gains were observed even with short finetunes.
- The model is not benchmarked broadly, but shows promise as a low-resource experiment.
``` |