|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- unsloth/OpenMathReasoning-mini |
|
|
- nvidia/OpenCodeReasoning |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen3-4B-Base |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
# QTM7-4B |
|
|
|
|
|
QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base. |
|
|
It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving. |
|
|
This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost). |
|
|
|
|
|
--- |
|
|
|
|
|
## UPDATE: Observed Performance Shift |
|
|
|
|
|
This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect: |
|
|
|
|
|
**QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.** |
|
|
|
|
|
The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Developed by:** Independent researcher (solo project) |
|
|
- **Funding:** Self-funded (~$20 total compute cost) |
|
|
- **Model type:** Decoder-only transformer for text generation |
|
|
- **Language(s):** English |
|
|
- **License:** Apache-2.0 |
|
|
- **Finetuned from:** [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) |
|
|
|
|
|
### Sources |
|
|
- **Repository:** [Ma7ee7/QTM7-4b-2hr-checkpoint](https://huggingface.co/Ma7ee7/QTM7-4b-2hr-checkpoint) |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
- Research into math & code reasoning |
|
|
- Proof-of-concept for low-budget finetuning on large language models |
|
|
- **New Focus:** Evaluation of low-resource impact on creative writing and narrative coherence. |
|
|
|
|
|
### Downstream Use |
|
|
- Potential basis for math problem solvers or code reasoning assistants |
|
|
- Experiments in lightweight alignment or evaluation pipelines |
|
|
|
|
|
### Out-of-Scope |
|
|
- Not suitable for safety-critical, legal, or medical applications |
|
|
- Not RLHF-aligned; outputs may be unfiltered or ungrounded |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
- Inherits biases from Qwen3-4B-Base |
|
|
- Untested on broader NLP benchmarks (MMLU, ARC, etc.) |
|
|
- Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow |
|
|
- General conversational ability remains base-model level |
|
|
|
|
|
**Recommendation:** Treat outputs as experimental. Do not deploy in production or decision-making contexts. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **unsloth/OpenMathReasoning-mini** — math reasoning dataset |
|
|
- **nvidia/OpenCodeReasoning** — code reasoning tasks |
|
|
- No GSM8K contamination was found in either the training or post-training data. |
|
|
|
|
|
### Procedure |
|
|
- Mixed precision: **fp16** |
|
|
- Optimizer: AdamW (standard defaults) |
|
|
- Duration: ~4 hours on **1x NVIDIA A100** |
|
|
- Checkpoint size: ~16 GB (fp16) |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Setup |
|
|
- Compared against **Qwen/Qwen3-4B** (post-trained version) |
|
|
- Dataset: **GSM8K test split** (subset of 300 “hard” problems) |
|
|
- Metrics: Exact match on final numeric answer |
|
|
|
|
|
### Results |
|
|
|
|
|
**Training Loss Curve** |
|
|
Stable convergence toward ~0.63 by step 1750, even as difficulty increased. |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
**GSM8K Accuracy (Sampled)** |
|
|
QTM7-4B* scored ~**80.7%** vs Qwen3-4B’s ~**28.0%**. |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
**Head-to-Head Outcomes** |
|
|
QTM7-4B* won most direct comparisons. |
|
|
- **Only QTM7-4B\*** correct → 171 |
|
|
- **Both** correct → 71 |
|
|
- **Both** wrong → 45 |
|
|
- **Only Qwen** correct → 13 |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
**Outcome Breakdown by Model (GSM8K subset)** |
|
|
Side-by-side percentages for correctness vs error types. |
|
|
|
|
|
- **QTM7-4B\***: 80.7% correct, 7.3% mismatch, **12.0% truncated** |
|
|
- **Qwen3-4B**: 28.0% correct, **72.0% mismatch**, 0% truncated |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
\* **QTM7-4B = 2hr checkpoint** |
|
|
|
|
|
--- |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
Estimated using [MLCO2 Impact Calculator](https://mlco2.github.io/impact#compute): |
|
|
|
|
|
- **Hardware:** NVIDIA A100 (80GB) |
|
|
- **GPU hours:** ~4 |
|
|
- **Cloud Provider:** Google Colab (us-central assumed) |
|
|
- **Carbon emitted:** ≈ **1.2 kg CO2eq** |
|
|
|
|
|
*(About the same as driving ~5 km in a gasoline car.)* |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
- **Architecture:** Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention) |
|
|
- **Objective:** Causal LM finetuning on reasoning tasks |
|
|
- **Software:** PyTorch + Hugging Face Transformers + Datasets |
|
|
|
|
|
--- |
|
|
|
|
|
## Summary |
|
|
|
|
|
QTM7-4B is a minimal-budget proof-of-concept showing that: |
|
|
- **Small compute can still move the needle** on reasoning with focused datasets. |
|
|
- Math reasoning gains were observed even with short finetunes. |
|
|
- The model is not benchmarked broadly, but shows promise as a low-resource experiment. |
|
|
``` |