Instructions to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B

SGLang

How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B",
    max_seq_length=2048,
)

Docker Model Runner
How to use Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B with Docker Model Runner:
```
docker model run hf.co/Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B
```

Model Card for FinetunedLAMAtoR1-001-3B

Model Details

Technical Specifications

Model Architecture and Objective

Base Model: Llama-3.2-3B-Instruct
Architecture: Causal Decoder-Only Transformer
Hidden Size: 3072
Layers: 28
Heads: 24
Parameters: ~3.21B (Loaded in 4-bit quantization)
Precision: Float16 (during inference/training via LoRA)

Compute Infrastructure

Hardware: Tesla T4 GPU (Google Colab)
VRAM Usage: ~2.24 GB (Model) + Training Overhead
Quantization: 4-bit (QLoRA) via bitsandbytes

Model Weights

Type: LoRA Adapter (Peft)
Adapter File Size: ~92 MB
Total Saved Size: ~108 MB

Model Description

This model is a fine-tuned version of unsloth/Llama-3.2-3B-Instruct designed to mimic reflective, human-like stream-of-consciousness reasoning. It was trained using Unsloth on the ServiceNow-AI/R1-Distill-SFT dataset.

The model utilizes a specific system prompt to trigger a "thinking" process (Chain of Thought) before providing the final answer, aiming to replicate the reasoning capabilities seen in models like DeepSeek-R1.

Developed by: Muhammad Shaheer Khan
Model type: Causal Language Model (LoRA Fine-tune)
Language(s) (NLP): English
License: Llama 3.2 Community License
Finetuned from model: unsloth/Llama-3.2-3B-Instruct

Uses

Direct Use

The model is intended for reasoning tasks where explainability and step-by-step logic are required. It excels at math problems, logic puzzles, and complex queries requiring iterative thought.

System Prompt: To activate the reasoning capabilities, you must use the following system prompt:

"You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer."

How to Get Started with the Model

You can use the model with the unsloth library for 2x faster inference, or standard Hugging Face transformers.

Using Unsloth (Recommended)

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

sys_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
{}
</problem>
"""

message = sys_prompt.format("If there are a dozen of eggs at cost $60, how much one egg cost?")

messages = [{"role": "user", "content": message}]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(
    input_ids = inputs,
    max_new_tokens = 1024,
    use_cache = True,
    temperature = 1.5,
    min_p = 0.1
)
print(tokenizer.batch_decode(outputs))

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

unsloth/Llama-3.2-3B-Instruct