Ada / README.md

Update README.md

ea6cfbf verified about 1 month ago

8.42 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- unsloth
	- qwen
	- qwen2.5
	- math
	- reasoning
	- alpaca
	- pytorch
	- custom-finetune
	- lora-merged
	base_model: unsloth/Qwen2.5-Math-1.5B
	datasets:
	- Xerv-AI/GRAD
	- yahma/alpaca-cleaned
	inference:
	parameters:
	repetition_penalty: 1.15
	max_new_tokens: 256
	temperature: 0.5
	examples:
	- text: "### Instruction:\nProvide a step-by-step logical proof finding the eigenvalues of the matrix [[2, 1], [1, 2]].\n### Response:\n"

	widget:
	- example_title: Fibonacci (Python)
	messages:
	- role: system
	content: You are a chatbot who can help code!
	- role: user
	content: Write me a function to calculate the first 10 digits of the fibonacci sequence in Python and print it out to the CLI.
	---


	## 🌌 Xerv-AI/Ada: The Multi-Modal Mathematical Generalist SLM
	Ada is an ultra-lightweight, high-speed, and highly optimized reasoning Small Language Model (SLM) derived from the powerful Qwen2.5-Math-1.5B architecture. Engineered specifically to bridge the gap between hyper-specialized graduate-level mathematical proofs and standard conversational utility, Ada solves the notorious "catastrophic forgetting" problem often found in math-heavy fine-tunes.
	Whether you need a step-by-step calculus breakdown, a topological proof in LaTeX, or just a simple conversational assistant for daily tasks, Ada delivers state-of-the-art performance for a 1.5 Billion parameter model.

	### 🚀 Model Overview
	Standard math-specific LLMs frequently suffer from domain overfitting. When prompted with basic conversational queries, they either hallucinate lengthy pseudo-proofs or fail entirely to understand the user's intent. Xerv-AI/Ada was meticulously engineered to resolve this by utilizing a carefully balanced, dual-distribution training dataset, allowing it to act as both a rigorous STEM assistant and a general-purpose chat model.

	\| Specification \| Details \|
	\| :--- \| :--- \|
	\| Model Name \| Xerv-AI/Ada \|
	\| Base Architecture \| unsloth/Qwen2.5-Math-1.5B \|
	\| Parameter Count \| 1.5 Billion \|
	\| Primary Capabilities \| Graduate-level STEM reasoning, logical deduction, and mathematical proofs. \|
	\| Secondary Capabilities \| General conversational instruction-following, roleplay, and basic coding. \|
	\| Training Framework \| QLoRA via Unsloth (Triton kernels). \|
	\| Precision \| Merged 16-bit (Fine-tuned in 4-bit). \|
	\| License \| Apache-2.0 \| <br> ### 🔬 Core Capabilities & Strengths <br> * Balanced Generalization: Ada seamlessly transitions between casual conversation and intense analytical problem-solving without format-forced hallucinations. <br> * Advanced STEM Reasoning: Fully optimized to generate detailed, multi-step logical proofs in advanced algebra, calculus, topology, and physics. <br> * Hardware Optimized for Edge Deployment: Designed to run at maximum inference throughput on low-VRAM consumer hardware (such as a single 16GB NVIDIA T4 GPU, Mac M-series chips, or edge devices) using 4-bit quantization. <br> * Impeccable Formatting: Native understanding of structural formatting, easily outputting highly readable markdown and structured logic steps. <br> ### 🏗 Architecture & Training Methodology <br> Ada was trained using Supervised Fine-Tuning (SFT) targeting the attention mechanisms of the base model. Utilizing Unsloth on a standard Google Colab NVIDIA T4 GPU, the training leveraged Low-Rank Adaptation (LoRA) to maximize efficiency before being merged into a standalone 16-bit Hugging Face model. <br> * Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj <br> * LoRA Rank (r): 16 <br> * LoRA Alpha: 16 <br> * Optimizer: adamw_8bit <br> * Learning Rate: 2e-4 <br> * Effective Batch Size: 8 (Batch size 2 with 4 Gradient Accumulation steps) <br> ### 📚 The Dataset: Dual-Distribution Blending <br> To achieve generalization and prevent catastrophic forgetting, Ada was fine-tuned on a strict 50/50 blend of two distinct datasets, batched and streamed via high-throughput Parquet files:
	\| Dataset \| Sample Size \| Description & Purpose \|
	\| :--- \| :--- \| :--- \|
	\| Xerv-AI/GRAD \| ~1.93k rows \| A proprietary synthetic dataset containing exceptionally long (average 8,000 characters) graduate and research-level mathematical proofs. This instills deep reasoning and strict formatting. \|
	\| yahma/alpaca-cleaned \| ~2.00k rows \| A refined subset of the standard Alpaca dataset. This teaches the model conversational flow, roleplay, basic Q&A, and crucially, when not to use complex math. \|

	### 💻 Usage & Python Inference Guide
	The model is highly responsive to the standard Alpaca Instruction/Response template.
	Important Inference Note: For best results, use a repetition_penalty of roughly 1.15. This acts as a crucial guardrail to prevent the model from infinitely looping through mathematical steps on overly simple arithmetic queries.
	1. Installation Requirements
	```bash
	pip install unsloth transformers accelerate torch
	```
	2. Fast Inference Script
	```python
	from unsloth import FastLanguageModel
	import torch
	# Configuration
	repo_name = "Xerv-AI/Ada"
	max_seq_length = 2048
	# Load the model and tokenizer (4-bit recommended for low-VRAM)
	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name = repo_name,
	max_seq_length = max_seq_length,
	dtype = None,
	load_in_4bit = True,
	)
	# Enable optimized inference mode
	FastLanguageModel.for_inference(model)
	# Define the universal prompt template
	universal_prompt = """### Instruction:
	{}
	### Response:
	{}"""
	# Prepare your query
	query = "Provide a step-by-step logical proof finding the eigenvalues of the matrix [[2, 1], [1, 2]]."
	inputs = tokenizer(
	[universal_prompt.format(query, "")],
	return_tensors = "pt"
	).to("cuda")
	print("Generating analytical response...")
	# Generate the output
	outputs = model.generate(
	**inputs,
	max_new_tokens = 1024,
	max_length = None,
	use_cache = True,
	repetition_penalty = 1.15, # Critical: prevents generation loops
	pad_token_id = tokenizer.eos_token_id
	)
	# Decode and print the result
	response = tokenizer.batch_decode(outputs, skip_special_tokens = True)[0]
	print(f"\n{'='50}\nOutput:\n{'='50}")
	print(response.split("### Response:\n")[-1])
	```

	### Performance Summary

	\| Dataset \| Accuracy \|
	\| :--- \| :--- \|
	\| GSM8K \| 40.00% \|
	\| MATH \|60.00% \|
	\| MATH-Hard \|50.00% \|
	\| GRAD \|40.00% \|

	### 🛡️ Safety & Alignment Guardrails
	Despite being fine-tuned on raw mathematical logic and conversational instruction data, Ada successfully retains its foundational safety alignments. Because only 1% to 2% of the parameters were actively updated via LoRA (and subsequently merged), the original base Qwen2.5 weights responsible for safety remain fully intact.
	* Content Moderation: The model actively refuses to generate explicit, illegal, or harmful content, relying on the RLHF and DPO safety guardrails instilled during Alibaba's original pre-training phase.
	### ⚠️ Limitations & Known Biases
	While Ada punches well above its 1.5B weight class, it is important to acknowledge the limitations inherent to Small Language Models:
	* Arithmetic Hallucinations: Ada is exceptionally capable at symbolic logic, structural breakdowns, and mathematical theory. However, like many SLMs, it can occasionally suffer from minor arithmetic errors (e.g., basic addition/subtraction mistakes) deep within multi-page proofs. Always verify raw calculations.
	* Language Constraint: The model is optimized exclusively for English text and standard mathematical notation.
	* Prompt Sensitivity: Ada performs at its absolute peak when math queries explicitly ask for a "proof," "step-by-step breakdown," or "logical analysis" within the instruction block.
	* World Knowledge: It lacks the broad, encyclopedic trivia knowledge found in massive 70B+ parameter models.
	### 🤝 Acknowledgements
	* Alibaba Cloud: For the phenomenal, state-of-the-art base Qwen2.5-Math architecture.
	* Unsloth AI: For the Triton-optimized training kernels that made compiling and fine-tuning this model possible and highly efficient on consumer hardware.
	* Xerv-AI: For the curation of the GRAD synthetic dataset powering the advanced reasoning capabilities.