Qwen3-0.6B-Base-CPT-Math

A continued pretraining (CPT) adapted version of Qwen3-0.6B-Base, fine-tuned on mathematics domain data to enhance the model's knowledge and reasoning capabilities in mathematical tasks.

Model Details

Model Description

This model is Qwen3-0.6B-Base fine-tuned using Continued Pretraining (CPT) with full parameter updates on a curated mathematics pretraining dataset. Unlike instruction tuning which uses Q&A pairs, this model was exposed to raw mathematical text to deepen its understanding of mathematical concepts, notation, and problem-solving patterns.

Key characteristics:

  • Base Model: Qwen/Qwen3-0.6B-Base

  • Training Method: Full finetuning (100% parameter updates, no LoRA)

  • Domain: Mathematics

  • Context Length: Up to 1024-2048 tokens

  • Optimization: Unsloth with Flash Attention 2

  • Developed by: Dayanand (based on Alibaba Qwen team's Qwen3-0.6B-Base)

  • Model type: Language Model (Decoder-only, Causal LM)

  • Language(s): English, with strong mathematical domain coverage

  • License: Qwen model's license (see Qwen/Qwen3-0.6B-Base)

  • Finetuned from model: Qwen/Qwen3-0.6B-Base

Model Sources

Uses

Direct Use

This model can be used for:

  • Mathematical text generation - Generate mathematical explanations, derivations, or proofs
  • Domain-specific language modeling - Continue text in mathematical contexts
  • Math problem analysis - Understand and analyze mathematical problems
  • Knowledge retrieval - Answer questions about mathematical concepts

Example usage:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen3-0.6B-Base-CPT-Math")
tokenizer = AutoTokenizer.from_pretrained("Qwen3-0.6B-Base-CPT-Math")

inputs = tokenizer("Given a quadratic equation ax^2 + bx + c = 0", return_tensors="pt")
outputs = model.generate(**inputs, max_length=150)
print(tokenizer.decode(outputs[0]))

Downstream Use

This model can be fine-tuned for:

  • Math Question Answering - Answer mathematical questions with detailed explanations
  • Mathematical Reasoning - Solve step-by-step math problems
  • Educational Content Generation - Create math tutorials and explanations
  • Mathematical Code Generation - Generate code for mathematical algorithms

Out-of-Scope Use

  • Non-English content generation - Model primarily trained on English mathematical texts
  • Real-time critical applications - Not suitable for safety-critical systems
  • General knowledge tasks outside mathematics - While it retains general language abilities, it's optimized for mathematical domain
  • Instruction following without further fine-tuning - This is a base model, not instruction-tuned

Bias, Risks, and Limitations

Limitations

  1. Domain Specificity - Model performs best on mathematical content; general language performance may vary
  2. Model Size - 0.6B parameters means lower capability compared to larger models (7B+)
  3. Context Length - Maximum sequence length of 1024-2048 tokens limits very long document processing
  4. Training Data Bias - Mathematical domain data may have specific biases and limitations
  5. Hallucination Risk - Like all language models, may generate plausible-sounding but incorrect mathematical statements

Risks

  • Mathematical Errors - May produce mathematically incorrect but grammatically plausible content
  • Computational Resource Requirements - While small, still requires GPU for efficient inference
  • Overconfidence - Model may express high confidence in incorrect mathematical statements

Recommendations

  1. Validation Required - Always validate mathematical outputs for correctness
  2. Human Review - Use model outputs as assistance, not authoritative source
  3. Domain Expertise - Have domain experts review critical applications
  4. Testing - Thoroughly test on your specific use cases before deployment
  5. Prompt Engineering - Use clear, well-structured prompts for better results

How to Get Started with the Model

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_id = "Qwen3-0.6B-Base-CPT-Math"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate text
prompt = "The derivative of f(x) = x^3 + 2x^2 is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=100, temperature=0.7, top_p=0.9)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

With Unsloth (Faster Inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen3-0.6B-Base-CPT-Math",
    max_seq_length=1024,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)

# Use as normal
prompt = "Solve for x: 2x + 5 = 13"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Training Details

Training Data

  • Dataset: pritamdeb68/Math-Pretraining-Data
  • Split: train[:10000] (10,000 samples for this run)
  • Domain: Mathematics (problem sets, derivations, proofs, explanations)
  • Format: Raw text documents (continued pretraining format)

Data Preprocessing:

  • Tokenized using Qwen tokenizer
  • Packed into sequences of 1024-2048 tokens
  • No special instruction formatting (raw domain text)

Training Procedure

Preprocessing

  1. Tokenization - All documents tokenized with Qwen tokenizer
  2. Packing - Short documents concatenated to fill context window (1024+ tokens)
  3. Sequence Masking - Standard causal language modeling masking applied

Training Hyperparameters

  • Training regime: bf16 mixed precision (bfloat16 with bf16 optimizer states)
  • Learning rate: 2e-5 (lower than typical LoRA due to full finetuning)
  • Warmup steps: 100
  • Per-device batch size: 4
  • Gradient accumulation steps: 4
  • Effective batch size: 16 (4 × 4)
  • Number of epochs: 1
  • Optimizer: AdamW 8-bit (memory efficient)
  • Weight decay: 0.01
  • Max sequence length: 1024
  • Logging steps: 20
  • Packing enabled: True (critical for CPT efficiency)

Optimization Details

  • Unsloth Optimization: Flash Attention 2 enabled
  • Compute Capability Required: 8.0+ (A100, A10G, RTX 3090/4090, H100, etc.)
  • Memory Optimization: 8-bit AdamW for reduced optimizer state memory

Speeds, Sizes, Times

  • Training Time: ~30-45 minutes on A10G GPU
  • Training Tokens: ~10M tokens
  • Model Size: ~1.2 GB (full precision)
  • Peak VRAM: ~18-20 GB (on 23GB A10G)
  • Steps Completed: 312 total training steps

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Evaluation conducted on held-out samples from Math-Pretraining-Data
  • Manual evaluation of mathematical accuracy and reasoning quality

Metrics

  • Training Loss: Final loss ~2.34 (converged after 1 epoch)
  • Perplexity: Calculated from validation loss
  • Manual Evaluation: Spot-check of generated mathematical content for:
    • Syntactic correctness
    • Mathematical accuracy
    • Coherence and relevance

Results

Results from continued pretraining show:

  • Effective domain knowledge transfer on mathematics
  • Improved mathematical terminology usage
  • Better mathematical problem structure understanding

Note: Comprehensive benchmark results pending formal evaluation suite

Model Examination

Interpretability Insights

  • Model successfully learned mathematical domain patterns through raw text exposure
  • Context window effectively used for multi-step mathematical reasoning
  • Maintains base model's general language capabilities while enhancing mathematical knowledge

Environmental Impact

Carbon emissions estimate:

  • Hardware Type: NVIDIA A10G Tensor GPU
  • Hours used: ~0.75 hours
  • Cloud Provider: Hugging Face Endpoints
  • Compute Region: US-based datacenter
  • Carbon Emitted: ~0.12 kg CO2eq (estimated using ML Impact calculator)

Training a 0.6B model is relatively efficient compared to larger models (7B+).

Technical Specifications

Model Architecture

  • Architecture: Transformer decoder-only (causal language model)
  • Parameters: 600M (0.6B)
  • Attention: Multi-head self-attention with causal masking
  • Activation: SiLU (Swish)
  • Positional Embeddings: Rotary Position Embeddings (RoPE)

Compute Infrastructure

Hardware

  • GPU: NVIDIA A10G (24GB VRAM)
  • Compute Capability: 8.6
  • CPU: AMD EPYC processor
  • Memory: 100+ GB system RAM

Software

  • PyTorch: 2.1+
  • Transformers: 4.40+
  • Unsloth: Latest version with Flash Attention 2
  • TRL: Hugging Face TRL library for SFTTrainer
  • Python: 3.12+

Citation

If you use this model, please cite:

BibTeX:

@model{qwen3_0.6b_cpt_math,
  author = {Dayanand},
  title = {Qwen3-0.6B-Base-CPT-Math: Continued Pretraining for Mathematical Domain Adaptation},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/YOUR-USERNAME/Qwen3-0.6B-Base-CPT-Math}}
}

APA:

Dayanand. (2026). Qwen3-0.6B-Base-CPT-Math: Continued pretraining for mathematical domain adaptation. Hugging Face. https://huggingface.co/YOUR-USERNAME/Qwen3-0.6B-Base-CPT-Math

Also cite the base model:

Glossary

  • CPT (Continued Pretraining): Further pretraining of a base model on domain-specific data
  • Full Finetuning: Training all model parameters (vs. LoRA which only trains adapters)
  • Flash Attention: Memory-efficient attention implementation enabling longer contexts
  • Packing: Concatenating multiple short documents into longer sequences for training efficiency
  • BF16: Brain Float 16-bit precision format, optimal for modern GPUs
  • Causal LM: Language model that predicts next token based on previous tokens
  • Perplexity: Measure of model uncertainty; lower is better

More Information

For detailed implementation and reproducibility:

  • See GitHub Repository
  • Training script: main.py
  • Setup guide: README.md
  • Original research: Refer to Continued Pretraining literature

Model Card Authors

  • Card Author: Dayanand
  • Model Developer: Dayanand
  • Based on: Qwen Team (Alibaba Qwen3-0.6B-Base)

Model Card Contact

For questions or issues:

Downloads last month
29
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support