--- license: apache-2.0 datasets: - unsloth/OpenMathReasoning-mini - nvidia/OpenCodeReasoning language: - en base_model: - Qwen/Qwen3-4B-Base pipeline_tag: text-generation --- # QTM7-4B QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base. It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving. This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost). --- ## UPDATE: Observed Performance Shift This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect: **QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.** The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content. --- ## Model Details - **Developed by:** Independent researcher (solo project) - **Funding:** Self-funded (~$20 total compute cost) - **Model type:** Decoder-only transformer for text generation - **Language(s):** English - **License:** Apache-2.0 - **Finetuned from:** [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) ### Sources - **Repository:** [Ma7ee7/QTM7-4b-2hr-checkpoint](https://huggingface.co/Ma7ee7/QTM7-4b-2hr-checkpoint) --- ## Uses ### Direct Use - Research into math & code reasoning - Proof-of-concept for low-budget finetuning on large language models - **New Focus:** Evaluation of low-resource impact on creative writing and narrative coherence. ### Downstream Use - Potential basis for math problem solvers or code reasoning assistants - Experiments in lightweight alignment or evaluation pipelines ### Out-of-Scope - Not suitable for safety-critical, legal, or medical applications - Not RLHF-aligned; outputs may be unfiltered or ungrounded --- ## Bias, Risks, and Limitations - Inherits biases from Qwen3-4B-Base - Untested on broader NLP benchmarks (MMLU, ARC, etc.) - Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow - General conversational ability remains base-model level **Recommendation:** Treat outputs as experimental. Do not deploy in production or decision-making contexts. --- ## Training Details ### Training Data - **unsloth/OpenMathReasoning-mini** — math reasoning dataset - **nvidia/OpenCodeReasoning** — code reasoning tasks - No GSM8K contamination was found in either the training or post-training data. ### Procedure - Mixed precision: **fp16** - Optimizer: AdamW (standard defaults) - Duration: ~4 hours on **1x NVIDIA A100** - Checkpoint size: ~16 GB (fp16) --- ## Evaluation ### Setup - Compared against **Qwen/Qwen3-4B** (post-trained version) - Dataset: **GSM8K test split** (subset of 300 “hard” problems) - Metrics: Exact match on final numeric answer ### Results **Training Loss Curve** Stable convergence toward ~0.63 by step 1750, even as difficulty increased. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/EkC0a3csyiasVol9IcUMr.png) --- **GSM8K Accuracy (Sampled)** QTM7-4B* scored ~**80.7%** vs Qwen3-4B’s ~**28.0%**. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/C_qw2qw2sqg4tkcCAqOwI.png) --- **Head-to-Head Outcomes** QTM7-4B* won most direct comparisons. - **Only QTM7-4B\*** correct → 171 - **Both** correct → 71 - **Both** wrong → 45 - **Only Qwen** correct → 13 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/T0oA_n-aQqYuQLOxCGQ22.png) --- **Outcome Breakdown by Model (GSM8K subset)** Side-by-side percentages for correctness vs error types. - **QTM7-4B\***: 80.7% correct, 7.3% mismatch, **12.0% truncated** - **Qwen3-4B**: 28.0% correct, **72.0% mismatch**, 0% truncated ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/MnG3Dx6Ud3l4lpcTIRsGp.png) --- \* **QTM7-4B = 2hr checkpoint** --- ## Environmental Impact Estimated using [MLCO2 Impact Calculator](https://mlco2.github.io/impact#compute): - **Hardware:** NVIDIA A100 (80GB) - **GPU hours:** ~4 - **Cloud Provider:** Google Colab (us-central assumed) - **Carbon emitted:** ≈ **1.2 kg CO2eq** *(About the same as driving ~5 km in a gasoline car.)* --- ## Technical Specifications - **Architecture:** Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention) - **Objective:** Causal LM finetuning on reasoning tasks - **Software:** PyTorch + Hugging Face Transformers + Datasets --- ## Summary QTM7-4B is a minimal-budget proof-of-concept showing that: - **Small compute can still move the needle** on reasoning with focused datasets. - Math reasoning gains were observed even with short finetunes. - The model is not benchmarked broadly, but shows promise as a low-resource experiment. ```