QTM7-4b-1771 / README.md
Ma7ee7's picture
Update README.md
65bbb2b verified
---
license: apache-2.0
datasets:
- unsloth/OpenMathReasoning-mini
- nvidia/OpenCodeReasoning
language:
- en
base_model:
- Qwen/Qwen3-4B-Base
pipeline_tag: text-generation
---
# QTM7-4B
QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base.
It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving.
This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost).
---
## UPDATE: Observed Performance Shift
This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect:
**QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.**
The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content.
---
## Model Details
- **Developed by:** Independent researcher (solo project)
- **Funding:** Self-funded (~$20 total compute cost)
- **Model type:** Decoder-only transformer for text generation
- **Language(s):** English
- **License:** Apache-2.0
- **Finetuned from:** [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)
### Sources
- **Repository:** [Ma7ee7/QTM7-4b-2hr-checkpoint](https://huggingface.co/Ma7ee7/QTM7-4b-2hr-checkpoint)
---
## Uses
### Direct Use
- Research into math & code reasoning
- Proof-of-concept for low-budget finetuning on large language models
- **New Focus:** Evaluation of low-resource impact on creative writing and narrative coherence.
### Downstream Use
- Potential basis for math problem solvers or code reasoning assistants
- Experiments in lightweight alignment or evaluation pipelines
### Out-of-Scope
- Not suitable for safety-critical, legal, or medical applications
- Not RLHF-aligned; outputs may be unfiltered or ungrounded
---
## Bias, Risks, and Limitations
- Inherits biases from Qwen3-4B-Base
- Untested on broader NLP benchmarks (MMLU, ARC, etc.)
- Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow
- General conversational ability remains base-model level
**Recommendation:** Treat outputs as experimental. Do not deploy in production or decision-making contexts.
---
## Training Details
### Training Data
- **unsloth/OpenMathReasoning-mini** — math reasoning dataset
- **nvidia/OpenCodeReasoning** — code reasoning tasks
- No GSM8K contamination was found in either the training or post-training data.
### Procedure
- Mixed precision: **fp16**
- Optimizer: AdamW (standard defaults)
- Duration: ~4 hours on **1x NVIDIA A100**
- Checkpoint size: ~16 GB (fp16)
---
## Evaluation
### Setup
- Compared against **Qwen/Qwen3-4B** (post-trained version)
- Dataset: **GSM8K test split** (subset of 300 “hard” problems)
- Metrics: Exact match on final numeric answer
### Results
**Training Loss Curve**
Stable convergence toward ~0.63 by step 1750, even as difficulty increased.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/EkC0a3csyiasVol9IcUMr.png)
---
**GSM8K Accuracy (Sampled)**
QTM7-4B* scored ~**80.7%** vs Qwen3-4B’s ~**28.0%**.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/C_qw2qw2sqg4tkcCAqOwI.png)
---
**Head-to-Head Outcomes**
QTM7-4B* won most direct comparisons.
- **Only QTM7-4B\*** correct → 171
- **Both** correct → 71
- **Both** wrong → 45
- **Only Qwen** correct → 13
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/T0oA_n-aQqYuQLOxCGQ22.png)
---
**Outcome Breakdown by Model (GSM8K subset)**
Side-by-side percentages for correctness vs error types.
- **QTM7-4B\***: 80.7% correct, 7.3% mismatch, **12.0% truncated**
- **Qwen3-4B**: 28.0% correct, **72.0% mismatch**, 0% truncated
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/MnG3Dx6Ud3l4lpcTIRsGp.png)
---
\* **QTM7-4B = 2hr checkpoint**
---
## Environmental Impact
Estimated using [MLCO2 Impact Calculator](https://mlco2.github.io/impact#compute):
- **Hardware:** NVIDIA A100 (80GB)
- **GPU hours:** ~4
- **Cloud Provider:** Google Colab (us-central assumed)
- **Carbon emitted:****1.2 kg CO2eq**
*(About the same as driving ~5 km in a gasoline car.)*
---
## Technical Specifications
- **Architecture:** Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention)
- **Objective:** Causal LM finetuning on reasoning tasks
- **Software:** PyTorch + Hugging Face Transformers + Datasets
---
## Summary
QTM7-4B is a minimal-budget proof-of-concept showing that:
- **Small compute can still move the needle** on reasoning with focused datasets.
- Math reasoning gains were observed even with short finetunes.
- The model is not benchmarked broadly, but shows promise as a low-resource experiment.
```