Model Summary
Base Model
The foundation of this system is built upon the Qwen3-8b architecture, which serves as the underlying framework for all subsequent enhancements. This architecture incorporates 8 billion parameters, representing a substantial computational capacity that enables complex reasoning and problem-solving capabilities. The model operates using BF16 precision. The base model demonstrates particularly strong mathematical reasoning and problem-solving abilities, making it an ideal foundation for further specialisation in mathematical domains.
Fine-tuning Dataset
The enhancement of the base model has been achieved through careful fine-tuning using a specialised dataset known as NuminaMath-TIR. This dataset, consisting of 70k data points, is specifically designed for mathematical problem-solving with tool-integrated reasoning, providing a comprehensive collection of mathematical problems accompanied by detailed solution chains.
Link to the dataset can be found here: NuminaMath-TIR
Training Setup
The training methodology employs LoRA, which stands for Low-Rank Adaptation, a sophisticated technique that enables efficient fine-tuning of large language models. The LoRA configuration uses a rank value of 64, which determines the dimensionality of the low-rank matrices used in the adaptation process. The LoRA Alpha parameter is set to 128, controlling the scaling of the adaptation weights relative to the original model parameters.
The target modules for LoRA adaptation include several critical components of the transformer architecture. The query projection (q_proj), key projection (k_proj), value projection (v_proj), and output projection (o_proj) modules handle the attention mechanism computations. Additionally, the gating projection (gate_proj), up projection (up_proj), and down projection (down_proj) modules manage the feed-forward network operations within each transformer layer.
This configuration results in approximately 1.5% of the total model parameters being trainable during the fine-tuning process, making the training highly efficient whilst maintaining the model's core capabilities. A dropout rate of 0.05 is applied to prevent overfitting and ensure robust generalisation to new mathematical problems.
Link to the training notebook (Note that "deepseek_lora_checkpoints" is an unchanged placeholder name forked from another training notebook for deepseek distill qwen 7B model that we also had finetuned): qwen3-8b-finetune-numinamath
Training
During the training process, the train and validation loss consistently decreases over approximately 9000 steps (2 epochs) to around val loss of 0.35, before signs of overfitting emerges. We take the checkpoint at 9000th step as the final optimal weight for merging.
Hyperparameters
The training process utilises a carefully calibrated set of hyperparameters to optimise learning efficiency and model performance. The learning rate is set to 2e-4 and follows a cosine scheduler, which gradually reduces the learning rate throughout training to ensure stable convergence. The batch size is configured as 1 per device, but this is enhanced through gradient accumulation techniques.
Gradient accumulation is performed over 16 steps, creating an effective batch size of 16, which provides stable gradient estimates whilst managing memory constraints. The maximum sequence length is limited to 2048 tokens, ensuring that the model can handle substantial mathematical problems whilst maintaining computational efficiency. The training process spans 3 epochs, providing sufficient exposure to the training data without excessive overfitting.
The optimisation process employs AdamW in 8-bit precision, which reduces memory requirements whilst maintaining training stability. Weight decay is set to 0.01 to regularise the model and prevent overfitting. The warmup ratio of 0.03 ensures that the learning rate gradually increases at the beginning of training, promoting stable initial learning. Gradient clipping is applied with a maximum norm of 1.0 to prevent gradient explosion and maintain training stability.
Model Capabilities
Strengths
The enhanced model demonstrates significantly improved mathematical reasoning capabilities, enabling it to tackle complex mathematical problems that require sophisticated analytical thinking. The chain-of-thought reasoning has been substantially enhanced, allowing the model to generate clear, step-by-step solutions that follow logical mathematical progression.
Problem understanding represents another key strength, with the model showing improved comprehension of mathematical problem statements, including the ability to parse complex word problems and identify the underlying mathematical concepts. Solution verification capabilities enable the model to explain and verify mathematical solutions, providing confidence in the accuracy of its outputs.
The model excels at handling multi-step problems that require sequential reasoning and the integration of multiple mathematical concepts. This capability is particularly valuable for complex mathematical scenarios that mirror real-world problem-solving requirements.
Technical Specifications
Memory Requirements
The technical implementation requires careful consideration of computational resources. During training, approximately 24GB of VRAM is required, though this can be reduced through various optimisation techniques. For inference operations, the model requires 3-4GB of VRAM when operating in BF16 precision, making it accessible for deployment on modern GPU hardware. The merged model size is approximately 3GB when stored in safetensors format, ensuring efficient storage and loading.
Performance Optimisations
Several advanced optimisation techniques have been implemented to enhance performance and efficiency. Flash Attention 2 provides efficient attention computation, reducing memory usage and computational overhead during both training and inference. Gradient checkpointing enables memory-efficient training by trading computational time for memory usage, allowing larger models to be trained on limited hardware.
The 8-bit Adam optimiser reduces memory requirements during training whilst maintaining optimisation effectiveness. Mixed precision training using BF16 format balances numerical precision with computational efficiency. Finally, the LoRA technique itself represents a fundamental optimisation, enabling parameter-efficient fine-tuning that achieves excellent results whilst requiring minimal additional parameters and computational resources.
- Downloads last month
- 13
