Muhammad-Shaheer's picture
updated modelcard
155ebc2 verified
---
library_name: transformers
tags:
- unsloth
- llama
- llama-3.2
- text-generation
- reasoning
- chain-of-thought
- loRA
base_model: unsloth/Llama-3.2-3B-Instruct
datasets:
- ServiceNow-AI/R1-Distill-SFT
license: llama3.2
language:
- en
---
# Model Card for FinetunedLAMAtoR1-001-3B
## Model Details
## Technical Specifications
### Model Architecture and Objective
- **Base Model:** Llama-3.2-3B-Instruct
- **Architecture:** Causal Decoder-Only Transformer
- **Hidden Size:** 3072
- **Layers:** 28
- **Heads:** 24
- **Parameters:** ~3.21B (Loaded in 4-bit quantization)
- **Precision:** Float16 (during inference/training via LoRA)
### Compute Infrastructure
- **Hardware:** Tesla T4 GPU (Google Colab)
- **VRAM Usage:** ~2.24 GB (Model) + Training Overhead
- **Quantization:** 4-bit (QLoRA) via `bitsandbytes`
### Model Weights
- **Type:** LoRA Adapter (Peft)
- **Adapter File Size:** ~92 MB
- **Total Saved Size:** ~108 MB
### Model Description
This model is a fine-tuned version of **[unsloth/Llama-3.2-3B-Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct)** designed to mimic reflective, human-like stream-of-consciousness reasoning. It was trained using **[Unsloth](https://github.com/unslothai/unsloth)** on the **[ServiceNow-AI/R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT)** dataset.
The model utilizes a specific system prompt to trigger a "thinking" process (Chain of Thought) before providing the final answer, aiming to replicate the reasoning capabilities seen in models like DeepSeek-R1.
- **Developed by:** Muhammad Shaheer Khan
- **Model type:** Causal Language Model (LoRA Fine-tune)
- **Language(s) (NLP):** English
- **License:** Llama 3.2 Community License
- **Finetuned from model:** unsloth/Llama-3.2-3B-Instruct
## Uses
### Direct Use
The model is intended for reasoning tasks where explainability and step-by-step logic are required. It excels at math problems, logic puzzles, and complex queries requiring iterative thought.
**System Prompt:**
To activate the reasoning capabilities, you must use the following system prompt:
> "You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer."
## How to Get Started with the Model
You can use the model with the `unsloth` library for 2x faster inference, or standard Hugging Face `transformers`.
### Using Unsloth (Recommended)
```python
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3.1",
)
sys_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
{}
</problem>
"""
message = sys_prompt.format("If there are a dozen of eggs at cost $60, how much one egg cost?")
messages = [{"role": "user", "content": message}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
outputs = model.generate(
input_ids = inputs,
max_new_tokens = 1024,
use_cache = True,
temperature = 1.5,
min_p = 0.1
)
print(tokenizer.batch_decode(outputs))