File size: 5,390 Bytes
59ef085
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65bbb2b
 
 
 
 
 
 
 
 
 
59ef085
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65bbb2b
59ef085
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65bbb2b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: apache-2.0
datasets:
- unsloth/OpenMathReasoning-mini
- nvidia/OpenCodeReasoning
language:
- en
base_model:
- Qwen/Qwen3-4B-Base
pipeline_tag: text-generation
---
# QTM7-4B

QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base.  
It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving.  
This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost).

---

## UPDATE: Observed Performance Shift

This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect:

**QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.**

The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content.

---

## Model Details

- **Developed by:** Independent researcher (solo project)  
- **Funding:** Self-funded (~$20 total compute cost)  
- **Model type:** Decoder-only transformer for text generation  
- **Language(s):** English  
- **License:** Apache-2.0  
- **Finetuned from:** [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)  

### Sources
- **Repository:** [Ma7ee7/QTM7-4b-2hr-checkpoint](https://huggingface.co/Ma7ee7/QTM7-4b-2hr-checkpoint)

---

## Uses

### Direct Use
- Research into math & code reasoning  
- Proof-of-concept for low-budget finetuning on large language models  
- **New Focus:** Evaluation of low-resource impact on creative writing and narrative coherence.

### Downstream Use
- Potential basis for math problem solvers or code reasoning assistants  
- Experiments in lightweight alignment or evaluation pipelines  

### Out-of-Scope
- Not suitable for safety-critical, legal, or medical applications  
- Not RLHF-aligned; outputs may be unfiltered or ungrounded  

---

## Bias, Risks, and Limitations
- Inherits biases from Qwen3-4B-Base  
- Untested on broader NLP benchmarks (MMLU, ARC, etc.)  
- Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow  
- General conversational ability remains base-model level  

**Recommendation:** Treat outputs as experimental. Do not deploy in production or decision-making contexts.

---

## Training Details

### Training Data
- **unsloth/OpenMathReasoning-mini** — math reasoning dataset  
- **nvidia/OpenCodeReasoning** — code reasoning tasks  
- No GSM8K contamination was found in either the training or post-training data.

### Procedure
- Mixed precision: **fp16**  
- Optimizer: AdamW (standard defaults)  
- Duration: ~4 hours on **1x NVIDIA A100**  
- Checkpoint size: ~16 GB (fp16)  

---

## Evaluation

### Setup
- Compared against **Qwen/Qwen3-4B** (post-trained version)  
- Dataset: **GSM8K test split** (subset of 300 “hard” problems)  
- Metrics: Exact match on final numeric answer  

### Results

**Training Loss Curve**  
Stable convergence toward ~0.63 by step 1750, even as difficulty increased.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/EkC0a3csyiasVol9IcUMr.png)

---

**GSM8K Accuracy (Sampled)**  
QTM7-4B* scored ~**80.7%** vs Qwen3-4B’s ~**28.0%**.  

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/C_qw2qw2sqg4tkcCAqOwI.png)

---

**Head-to-Head Outcomes**  
QTM7-4B* won most direct comparisons.  
- **Only QTM7-4B\*** correct → 171  
- **Both** correct → 71  
- **Both** wrong → 45  
- **Only Qwen** correct → 13  

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/T0oA_n-aQqYuQLOxCGQ22.png)

---

**Outcome Breakdown by Model (GSM8K subset)**  
Side-by-side percentages for correctness vs error types.

- **QTM7-4B\***: 80.7% correct, 7.3% mismatch, **12.0% truncated**  
- **Qwen3-4B**: 28.0% correct, **72.0% mismatch**, 0% truncated  

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/MnG3Dx6Ud3l4lpcTIRsGp.png)

---

\* **QTM7-4B = 2hr checkpoint**  

---

## Environmental Impact

Estimated using [MLCO2 Impact Calculator](https://mlco2.github.io/impact#compute):

- **Hardware:** NVIDIA A100 (80GB)  
- **GPU hours:** ~4  
- **Cloud Provider:** Google Colab (us-central assumed)  
- **Carbon emitted:****1.2 kg CO2eq**  

*(About the same as driving ~5 km in a gasoline car.)*

---

## Technical Specifications

- **Architecture:** Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention)  
- **Objective:** Causal LM finetuning on reasoning tasks  
- **Software:** PyTorch + Hugging Face Transformers + Datasets  

---

## Summary

QTM7-4B is a minimal-budget proof-of-concept showing that:  
- **Small compute can still move the needle** on reasoning with focused datasets.  
- Math reasoning gains were observed even with short finetunes.  
- The model is not benchmarked broadly, but shows promise as a low-resource experiment.  
```