Instructions to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B
- SGLang
How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B", max_seq_length=2048, ) - Docker Model Runner
How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with Docker Model Runner:
docker model run hf.co/Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B
Model Card for FinetunedLAMAtoR1-001-3B
Model Details
Technical Specifications
Model Architecture and Objective
- Base Model: Llama-3.2-3B-Instruct
- Architecture: Causal Decoder-Only Transformer
- Hidden Size: 3072
- Layers: 28
- Heads: 24
- Parameters: ~3.21B (Loaded in 4-bit quantization)
- Precision: Float16 (during inference/training via LoRA)
Compute Infrastructure
- Hardware: Tesla T4 GPU (Google Colab)
- VRAM Usage: ~2.24 GB (Model) + Training Overhead
- Quantization: 4-bit (QLoRA) via
bitsandbytes
Model Weights
- Type: LoRA Adapter (Peft)
- Adapter File Size: ~92 MB
- Total Saved Size: ~108 MB
Model Description
This model is a fine-tuned version of unsloth/Llama-3.2-3B-Instruct designed to mimic reflective, human-like stream-of-consciousness reasoning. It was trained using Unsloth on the ServiceNow-AI/R1-Distill-SFT dataset.
The model utilizes a specific system prompt to trigger a "thinking" process (Chain of Thought) before providing the final answer, aiming to replicate the reasoning capabilities seen in models like DeepSeek-R1.
- Developed by: Muhammad Shaheer Khan
- Model type: Causal Language Model (LoRA Fine-tune)
- Language(s) (NLP): English
- License: Llama 3.2 Community License
- Finetuned from model: unsloth/Llama-3.2-3B-Instruct
Uses
Direct Use
The model is intended for reasoning tasks where explainability and step-by-step logic are required. It excels at math problems, logic puzzles, and complex queries requiring iterative thought.
System Prompt: To activate the reasoning capabilities, you must use the following system prompt:
"You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer."
How to Get Started with the Model
You can use the model with the unsloth library for 2x faster inference, or standard Hugging Face transformers.
Using Unsloth (Recommended)
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3.1",
)
sys_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
{}
</problem>
"""
message = sys_prompt.format("If there are a dozen of eggs at cost $60, how much one egg cost?")
messages = [{"role": "user", "content": message}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
outputs = model.generate(
input_ids = inputs,
max_new_tokens = 1024,
use_cache = True,
temperature = 1.5,
min_p = 0.1
)
print(tokenizer.batch_decode(outputs))
Model tree for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B
Base model
meta-llama/Llama-3.2-3B-Instruct