File size: 8,416 Bytes

593865f
 
 
 
73a8f60
 
 
 
 
 
 
 
 
 
e89a277
73a8f60
da8557d
 
aa29e57
e89a277
 
 
 
 
 
 
ea6cfbf
 
 
 
 
 
 
 
593865f
 
e89a277
da8557d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73a8f60
da8557d
73a8f60
da8557d
73a8f60
 
 
da8557d
 
73a8f60
da8557d
73a8f60
 
 
 
 
 
da8557d
73a8f60
da8557d
73a8f60
 
 
 
da8557d
 
73a8f60
 
 
 
da8557d
 
73a8f60
 
 
da8557d
73a8f60
da8557d
73a8f60
 
da8557d
73a8f60
 
 
 
a6f19ec
 
 
fd74103
a6f19ec
 
 
 
 
 
da8557d

---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- unsloth
- qwen
- qwen2.5
- math
- reasoning
- alpaca
- pytorch
- custom-finetune
- lora-merged
base_model: unsloth/Qwen2.5-Math-1.5B
datasets:
- Xerv-AI/GRAD
- yahma/alpaca-cleaned
inference:
  parameters:
    repetition_penalty: 1.15
    max_new_tokens: 256
    temperature: 0.5
  examples:
    - text: "### Instruction:\nProvide a step-by-step logical proof finding the eigenvalues of the matrix [[2, 1], [1, 2]].\n### Response:\n"

widget:
  - example_title: Fibonacci (Python)
    messages:
    - role: system
      content: You are a chatbot who can help code!
    - role: user
      content: Write me a function to calculate the first 10 digits of the fibonacci sequence in Python and print it out to the CLI.
---


## 🌌 Xerv-AI/Ada: The Multi-Modal Mathematical Generalist SLM
**Ada** is an ultra-lightweight, high-speed, and highly optimized reasoning Small Language Model (SLM) derived from the powerful **Qwen2.5-Math-1.5B** architecture. Engineered specifically to bridge the gap between hyper-specialized graduate-level mathematical proofs and standard conversational utility, Ada solves the notorious "catastrophic forgetting" problem often found in math-heavy fine-tunes.
Whether you need a step-by-step calculus breakdown, a topological proof in LaTeX, or just a simple conversational assistant for daily tasks, Ada delivers state-of-the-art performance for a 1.5 Billion parameter model.

### 🚀 Model Overview 
Standard math-specific LLMs frequently suffer from domain overfitting. When prompted with basic conversational queries, they either hallucinate lengthy pseudo-proofs or fail entirely to understand the user's intent. **Xerv-AI/Ada** was meticulously engineered to resolve this by utilizing a carefully balanced, dual-distribution training dataset, allowing it to act as both a rigorous STEM assistant and a general-purpose chat model.

| Specification | Details |
| :--- | :--- |
| **Model Name** | Xerv-AI/Ada |
| **Base Architecture** | unsloth/Qwen2.5-Math-1.5B |
| **Parameter Count** | 1.5 Billion |
| **Primary Capabilities** | Graduate-level STEM reasoning, logical deduction, and mathematical proofs. |
| **Secondary Capabilities** | General conversational instruction-following, roleplay, and basic coding. |
| **Training Framework** | QLoRA via Unsloth (Triton kernels). |
| **Precision** | Merged 16-bit (Fine-tuned in 4-bit). |
| **License** | Apache-2.0 | <br> ### 🔬 Core Capabilities & Strengths <br> * **Balanced Generalization:** Ada seamlessly transitions between casual conversation and intense analytical problem-solving without format-forced hallucinations. <br> * **Advanced STEM Reasoning:** Fully optimized to generate detailed, multi-step logical proofs in advanced algebra, calculus, topology, and physics. <br> * **Hardware Optimized for Edge Deployment:** Designed to run at maximum inference throughput on low-VRAM consumer hardware (such as a single 16GB NVIDIA T4 GPU, Mac M-series chips, or edge devices) using 4-bit quantization. <br> * **Impeccable Formatting:** Native understanding of structural formatting, easily outputting highly readable markdown and structured logic steps. <br> ### 🏗 Architecture & Training Methodology <br> Ada was trained using Supervised Fine-Tuning (SFT) targeting the attention mechanisms of the base model. Utilizing **Unsloth** on a standard Google Colab NVIDIA T4 GPU, the training leveraged Low-Rank Adaptation (LoRA) to maximize efficiency before being merged into a standalone 16-bit Hugging Face model. <br> * **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj <br> * **LoRA Rank (r):** 16 <br> * **LoRA Alpha:** 16 <br> * **Optimizer:** adamw_8bit <br> * **Learning Rate:** 2e-4 <br> * **Effective Batch Size:** 8 (Batch size 2 with 4 Gradient Accumulation steps) <br> ### 📚 The Dataset: Dual-Distribution Blending <br> To achieve generalization and prevent catastrophic forgetting, Ada was fine-tuned on a strict 50/50 blend of two distinct datasets, batched and streamed via high-throughput Parquet files:
| Dataset | Sample Size | Description & Purpose |
| :--- | :--- | :--- |
| **Xerv-AI/GRAD** | ~1.93k rows | A proprietary synthetic dataset containing exceptionally long (average 8,000 characters) graduate and research-level mathematical proofs. This instills deep reasoning and strict formatting. |
| **yahma/alpaca-cleaned** | ~2.00k rows | A refined subset of the standard Alpaca dataset. This teaches the model conversational flow, roleplay, basic Q&A, and crucially, *when not to use complex math*. |

### 💻 Usage & Python Inference Guide
The model is highly responsive to the standard **Alpaca Instruction/Response template**.
**Important Inference Note:** For best results, use a repetition_penalty of roughly **1.15**. This acts as a crucial guardrail to prevent the model from infinitely looping through mathematical steps on overly simple arithmetic queries.
**1. Installation Requirements**
```bash
pip install unsloth transformers accelerate torch
```
**2. Fast Inference Script**
```python
from unsloth import FastLanguageModel
import torch
# Configuration
repo_name = "Xerv-AI/Ada"
max_seq_length = 2048
# Load the model and tokenizer (4-bit recommended for low-VRAM)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = repo_name,
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True, 
)
# Enable optimized inference mode
FastLanguageModel.for_inference(model)
# Define the universal prompt template
universal_prompt = """### Instruction:
{}
### Response:
{}"""
# Prepare your query
query = "Provide a step-by-step logical proof finding the eigenvalues of the matrix [[2, 1], [1, 2]]."
inputs = tokenizer(
    [universal_prompt.format(query, "")],
    return_tensors = "pt"
).to("cuda")
print("Generating analytical response...")
# Generate the output
outputs = model.generate(
    **inputs,
    max_new_tokens = 1024,
    max_length = None,           
    use_cache = True,
    repetition_penalty = 1.15,   # Critical: prevents generation loops
    pad_token_id = tokenizer.eos_token_id
)
# Decode and print the result
response = tokenizer.batch_decode(outputs, skip_special_tokens = True)[0]
print(f"\n{'='*50}\nOutput:\n{'='*50}")
print(response.split("### Response:\n")[-1])
```

### Performance Summary

| Dataset | Accuracy |
| :--- | :--- |
| **GSM8K** | **40.00%** |
| **MATH** |**60.00%** |
| **MATH-Hard** |**50.00%** |
| **GRAD** |**40.00%** |

### 🛡️ Safety & Alignment Guardrails
Despite being fine-tuned on raw mathematical logic and conversational instruction data, Ada successfully retains its foundational safety alignments. Because only 1% to 2% of the parameters were actively updated via LoRA (and subsequently merged), the original base Qwen2.5 weights responsible for safety remain fully intact.
 * **Content Moderation:** The model actively refuses to generate explicit, illegal, or harmful content, relying on the RLHF and DPO safety guardrails instilled during Alibaba's original pre-training phase.
### ⚠️ Limitations & Known Biases
While Ada punches well above its 1.5B weight class, it is important to acknowledge the limitations inherent to Small Language Models:
 * **Arithmetic Hallucinations:** Ada is exceptionally capable at symbolic logic, structural breakdowns, and mathematical theory. However, like many SLMs, it can occasionally suffer from minor arithmetic errors (e.g., basic addition/subtraction mistakes) deep within multi-page proofs. Always verify raw calculations.
 * **Language Constraint:** The model is optimized exclusively for **English** text and standard mathematical notation.
 * **Prompt Sensitivity:** Ada performs at its absolute peak when math queries explicitly ask for a "proof," "step-by-step breakdown," or "logical analysis" within the instruction block.
 * **World Knowledge:** It lacks the broad, encyclopedic trivia knowledge found in massive 70B+ parameter models.
### 🤝 Acknowledgements
 * **Alibaba Cloud:** For the phenomenal, state-of-the-art base Qwen2.5-Math architecture.
 * **Unsloth AI:** For the Triton-optimized training kernels that made compiling and fine-tuning this model possible and highly efficient on consumer hardware.
 * **Xerv-AI:** For the curation of the GRAD synthetic dataset powering the advanced reasoning capabilities.