QTM7-4b-1771 / README.md

Update README.md

65bbb2b verified 4 months ago

5.39 kB

	---
	license: apache-2.0
	datasets:
	- unsloth/OpenMathReasoning-mini
	- nvidia/OpenCodeReasoning
	language:
	- en
	base_model:
	- Qwen/Qwen3-4B-Base
	pipeline_tag: text-generation
	---
	# QTM7-4B

	QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base.
	It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving.
	This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost).

	---

	## UPDATE: Observed Performance Shift

	This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect:

	QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.

	The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content.

	---

	## Model Details

	- Developed by: Independent researcher (solo project)
	- Funding: Self-funded (~$20 total compute cost)
	- Model type: Decoder-only transformer for text generation
	- Language(s): English
	- License: Apache-2.0
	- Finetuned from: [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)

	### Sources
	- Repository: [Ma7ee7/QTM7-4b-2hr-checkpoint](https://huggingface.co/Ma7ee7/QTM7-4b-2hr-checkpoint)

	---

	## Uses

	### Direct Use
	- Research into math & code reasoning
	- Proof-of-concept for low-budget finetuning on large language models
	- New Focus: Evaluation of low-resource impact on creative writing and narrative coherence.

	### Downstream Use
	- Potential basis for math problem solvers or code reasoning assistants
	- Experiments in lightweight alignment or evaluation pipelines

	### Out-of-Scope
	- Not suitable for safety-critical, legal, or medical applications
	- Not RLHF-aligned; outputs may be unfiltered or ungrounded

	---

	## Bias, Risks, and Limitations
	- Inherits biases from Qwen3-4B-Base
	- Untested on broader NLP benchmarks (MMLU, ARC, etc.)
	- Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow
	- General conversational ability remains base-model level

	Recommendation: Treat outputs as experimental. Do not deploy in production or decision-making contexts.

	---

	## Training Details

	### Training Data
	- unsloth/OpenMathReasoning-mini — math reasoning dataset
	- nvidia/OpenCodeReasoning — code reasoning tasks
	- No GSM8K contamination was found in either the training or post-training data.

	### Procedure
	- Mixed precision: fp16
	- Optimizer: AdamW (standard defaults)
	- Duration: ~4 hours on 1x NVIDIA A100
	- Checkpoint size: ~16 GB (fp16)

	---

	## Evaluation

	### Setup
	- Compared against Qwen/Qwen3-4B (post-trained version)
	- Dataset: GSM8K test split (subset of 300 “hard” problems)
	- Metrics: Exact match on final numeric answer

	### Results

	Training Loss Curve
	Stable convergence toward ~0.63 by step 1750, even as difficulty increased.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/EkC0a3csyiasVol9IcUMr.png)

	---

	GSM8K Accuracy (Sampled)
	QTM7-4B* scored ~80.7% vs Qwen3-4B’s ~28.0%.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/C_qw2qw2sqg4tkcCAqOwI.png)

	---

	Head-to-Head Outcomes
	QTM7-4B* won most direct comparisons.
	- Only QTM7-4B\* correct → 171
	- Both correct → 71
	- Both wrong → 45
	- Only Qwen correct → 13

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/T0oA_n-aQqYuQLOxCGQ22.png)

	---

	Outcome Breakdown by Model (GSM8K subset)
	Side-by-side percentages for correctness vs error types.

	- QTM7-4B\: 80.7% correct, 7.3% mismatch, 12.0% truncated*
	- Qwen3-4B: 28.0% correct, 72.0% mismatch, 0% truncated

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6466047a326128fd2c693cfa/MnG3Dx6Ud3l4lpcTIRsGp.png)

	---

	\* QTM7-4B = 2hr checkpoint

	---

	## Environmental Impact

	Estimated using [MLCO2 Impact Calculator](https://mlco2.github.io/impact#compute):

	- Hardware: NVIDIA A100 (80GB)
	- GPU hours: ~4
	- Cloud Provider: Google Colab (us-central assumed)
	- Carbon emitted: ≈ 1.2 kg CO2eq

	(About the same as driving ~5 km in a gasoline car.)

	---

	## Technical Specifications

	- Architecture: Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention)
	- Objective: Causal LM finetuning on reasoning tasks
	- Software: PyTorch + Hugging Face Transformers + Datasets

	---

	## Summary

	QTM7-4B is a minimal-budget proof-of-concept showing that:
	- Small compute can still move the needle on reasoning with focused datasets.
	- Math reasoning gains were observed even with short finetunes.
	- The model is not benchmarked broadly, but shows promise as a low-resource experiment.
	```