CogniTune-Qwen2.5-3B

A domain-specialized AI/ML tutor model fine-tuned from Qwen2.5-3B-Instruct using LoRA on Apple Silicon (M5 Pro, 24GB unified memory) via MLX.

What It Does

Standard LLMs respond to AI/ML questions like encyclopedias — dense, exhaustive, impersonal. CogniTune responds like a tutor — leading with a concrete analogy, explicitly correcting common misconceptions, and compressing the concept into a memorable one-liner.

Base model response to "What is overfitting?"

Overfitting is a phenomenon in machine learning where a model learns the detail and noise in the training data to an extent that it negatively impacts performance on new, unseen data. Here are some key points... (numbered Wikipedia-style bullets follow)

CogniTune response to "What is overfitting?"

Overfitting occurs when a model learns the training data too well — including its noise and random fluctuations — and fails to generalize to new, unseen data. Think of it like a student who memorizes every practice exam answer verbatim instead of understanding the underlying concepts...

Usage

pip install mlx-lm

mlx_lm.generate \
  --model Pickamon/CogniTune-Qwen2.5-3B \
  --prompt "What is gradient boosting?" \
  --max-tokens 400

Training Details

Parameter	Value
Base model	Qwen/Qwen2.5-3B-Instruct
Method	LoRA
Framework	MLX (Apple Silicon native)
Hardware	Apple M5 Pro, 24GB unified memory
LoRA layers	8
LoRA rank	8
Learning rate	5e-5
Batch size	4
Optimal checkpoint	100 iterations
Dataset size	~460 examples
Training time	~4 minutes per 100 iterations
Peak memory	~10GB

Dataset

~460 hand-crafted AI/ML Q&A pairs covering topics including:

Neural network fundamentals (backprop, activations, normalization)
Training dynamics (optimizers, learning rate scheduling, regularization)
Architectures (transformers, CNNs, RNNs, LSTMs)
Modern LLM concepts (attention, LoRA, RLHF, RAG)
Classical ML (SVMs, decision trees, ensemble methods)
Evaluation metrics and experimental methodology

Two format styles were used deliberately:

Structured 4-part format: explanation → analogy → misconception → one-liner
Varied formats: numbered steps, comparison tables, debug-style walkthroughs, decision guides, code-grounded explanations

Key Experimental Findings

Optimal early stopping at ~100 iterations regardless of dataset size
Varied format data reduced best validation loss by 11% over uniform templates
Style transfer confirmed — fine-tuned model leads with analogies vs base model's encyclopedic bullets
Factual accuracy is orthogonal to style fine-tuning — the adapter shapes presentation without correcting base model knowledge
Out-of-distribution topics produce shorter responses with higher hallucination risk

Limitations

Hallucinations persist on topics requiring precise factual recall — this model teaches style, not facts
Out-of-distribution topics (outside AI/ML domain) revert toward base model behavior
Responses on unseen topics are shorter and less structured than responses on training-adjacent topics
Not suitable for high-stakes factual lookup — use RAG for that

Evaluation

Qualitative comparison against base Qwen2.5-3B-Instruct on identical prompts:

Prompt	Base Model	CogniTune
"What is overfitting?"	Numbered bullets, encyclopedic, cut off at token limit	Analogy-led, complete, self-contained
"What is the vanishing gradient problem?"	Textbook definition	Mechanism explanation + one-liner
"What is federated learning?"	Dense paragraph	Analogy + concise explanation

Environmental Impact

Hardware: Apple M5 Pro (no discrete GPU, Neural Engine + GPU cores)
Training time: ~45 minutes total across all experiments
Cloud provider: None — fully local training
Carbon footprint: Minimal (Apple Silicon is significantly more energy efficient than GPU cluster training)