Instructions to use leeroy-jankins/leeroy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use leeroy-jankins/leeroy with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("leeroy-jankins/leeroy", dtype="auto")

llama-cpp-python

How to use leeroy-jankins/leeroy with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="leeroy-jankins/leeroy",
	filename="Leeroy-3B-Instruct.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use leeroy-jankins/leeroy with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf leeroy-jankins/leeroy:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf leeroy-jankins/leeroy:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf leeroy-jankins/leeroy:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf leeroy-jankins/leeroy:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf leeroy-jankins/leeroy:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf leeroy-jankins/leeroy:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf leeroy-jankins/leeroy:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf leeroy-jankins/leeroy:Q4_K_M

Use Docker

docker model run hf.co/leeroy-jankins/leeroy:Q4_K_M

LM Studio
Jan
Ollama
How to use leeroy-jankins/leeroy with Ollama:
```
ollama run hf.co/leeroy-jankins/leeroy:Q4_K_M
```

Unsloth Studio

How to use leeroy-jankins/leeroy with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for leeroy-jankins/leeroy to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for leeroy-jankins/leeroy to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for leeroy-jankins/leeroy to start chatting

How to use leeroy-jankins/leeroy with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf leeroy-jankins/leeroy:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "leeroy-jankins/leeroy:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use leeroy-jankins/leeroy with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf leeroy-jankins/leeroy:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default leeroy-jankins/leeroy:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use leeroy-jankins/leeroy with Docker Model Runner:
```
docker model run hf.co/leeroy-jankins/leeroy:Q4_K_M
```

Lemonade

How to use leeroy-jankins/leeroy with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull leeroy-jankins/leeroy:Q4_K_M

Run and chat with the model

lemonade run user.leeroy-Q4_K_M

List all available models

lemonade list

🐎 Leeroy: Fine-Tuned Llama-3.2-1B-Instruct for High-Performance Instruction Following

Overview

Leeroy is a LLM fine-tuned variant of Meta's Llama 3.2 1B Instruct, quantized to Q4_K_M GGUF format for high-efficiency, low-latency inference. Named after the legendary charge-forward ethos, Leeroy specializes in executing user instructions with speed and accuracy — making it the ideal local LLM for both professional tasks and experimental builds. With strong alignment capabilities, multilingual robustness, and support for complex multi-step reasoning, Leeroy strikes a balance between performance, size, and instruction quality. Designed for use on CPUs and modest GPUs, Leeroy runs natively in llama.cpp, LM Studio, Ollama, and similar GGUF-compatible environments.

🧰 Streamlit UI

⚙️ Code Respository

✨ Key Features

Feature	Description
🧠 Llama 3.2 8B Foundation	Built on Meta’s state-of-the-art open LLM (8.1B params)
🛠️ Instruction Fine-Tuned	Tuned on task-specific and open-ended user prompts
⚙️ GGUF Q4_K_M Format	Optimized 4-bit grouped quantization for memory-efficient inference
🧊 Runs Locally	Compatible with `llama.cpp`, `LM Studio`, `Ollama`, and more
💬 Dialogue-Ready	Supports structured, multi-turn instruction following

🚀 Quickstart

📥 LM Studio (GUI - Recommended)

Download: Place Leeroy-3B-Instruct.Q4_K_M into your LM Studio model folder.
Launch LM Studio and go to “Local Models”.
Select Leeroy, click "Chat", and start prompting:

  from llama_cpp import Llama
  
  llm = Llama(model_path="Leeroy-3B-Instruct.Q4_K_M, n_ctx=4096)
  output = llm("Explain the law of diminishing returns in economics.", max_tokens=200)
  print(output["choices"][0]["text"])

⚙️ Vectorized Datasets

Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning

Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
Regulations - Collection of federal regulations on the use of appropriated funds
SF-133 - The Report on Budget Execution and Budgetary Resources
Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
SF-133 The Report on Budget Execution and Budgetary Resources
Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
Fastbook - Treasury guidance on federal ledger accounts
Title 31 CFR - Money & Finance
Redbook - The Principles of Appropriations Law (Volumes I & II).
US Standard General Ledger - Account Definitions
Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies

🧪 Evaluation Results

Task	Leeroy (Q4_K_M)	Llama 3.2 8B Instruct (base)
ARC-Challenge (25-shot)	77.6%	72.9%
NaturalQuestions (EM/F1)	62.4 / 74.1	57.2 / 69.5
GSM8K (reasoning)	68.3%	61.9%
HumanEval (pass@1, reasoning)	10.1%	8.5%
MMLU (5-shot average)	62.5%	58.4%

🧠 Use Cases

Leeroy is optimized for instructional clarity, factual reasoning, and dialogic interaction:

🧠 AI Research Assistants: Summarization, definitions, and analogies
🔍 Search-Augmented RAG Systems: Use Leeroy for answer generation with vector-based retrieval
🧮 Code Writing / Review: Write snippets, explain functions, or generate tests
🧾 Legal / Policy Drafting: Clear summaries, rewriting, scenario simulation
🗃️ Embedded Assistants: Use with offline agents and CLI frontends
🌐 Multilingual Prompting: English, Spanish, French, and more with strong fluency

🧰 Intended Use

Lightweight instruction following, reasoning, summarization, and light code generation.
Edge/desktop assistants, CLI tools, and RAG agents where low latency and small footprint are key.

🔒 Limitations

Context length is dependent on the specific GGUF build; confirm your runtime settings.
Q4_K_M trades some precision for speed; complex coding and multi-hop reasoning may degrade vs. higher-precision builds.
As with any LLM, outputs can contain errors or hallucinations—use validation/guardrails.

🧩 Training Details (summary)

Base: Meta Llama-3.2-8B-Instruct.
Method: Meta’s instruction-tuned base; this package applies GGUF Q4_K_M quantization for local use.
Packaging: Optimized for llama.cpp/LM Studio and other GGUF-compatible runtimes.

Examples: Using the Leeroy LLM (Llama-3.2-8B-Instruct, Q4_K_M, GGUF)

This guide shows multiple ways to run Leeroy locally. All examples assume you have the quantized file Leeroy.Q4_K_M.gguf on disk.

🛠️ 1) llama.cpp (CLI)

Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

Basic run

./main -m ./Leeroy-3B-Instruct.Q4_K_M \
  -p "Write a 3-sentence summary of the Bayes theorem." \
  -n 200 -t 8 -c 4096 -ngl 0

Notes

• -m : path to GGUF file
• -p : prompt text
• -n : max new tokens
• -t : CPU threads
• -c : context tokens (set per your GGUF build; 4096 shown as an example)
• -ngl : #layers offloaded to GPU (0 = CPU-only)

Windows (PowerShell) example

.\main.exe -m .\Leeroy-3B-In.Q4_K_M.gguf `
  -p "List 5 non-obvious Python performance tips." `
  -n 180 -t 10 -c 4096

🖥️ 2) LM Studio

1. Open LM Studio → Local Models → Import.
2. Drag-drop `Leeroy.Q4_K_M.gguf`.
3. In the chat pane, set:
   • Max new tokens: 128–512
   • Temperature: 0.6–0.9
   • Top-p: 0.9 (start conservative)
4. Prompt example:
   Explain the differences between retrieval-augmented generation and fine-tuning.
   Give bullet points, then a short recommendation.

🐍 3) Python via llama-cpp-python

Install

pip install llama-cpp-python

Load and generate

from llama_cpp import Llama

llm = Llama(
    model_path="Leeroy-3B-Instruct.Q4_K_M",
    n_ctx=4096,         # adjust to your GGUF build
    n_threads=8,        # CPU threads
    n_gpu_layers=0      # set >0 to offload to GPU (if supported)
)

prompt = (
    "You are a precise assistant. "
    "In 6 bullet points, explain vector databases for RAG."
)

out = llm(
    prompt,
    max_tokens=256,
    temperature=0.7,
    top_p=0.9
)

print(out["choices"][0]["text"])

Streaming (token-by-token)

for tok in llm.create_completion(
    prompt="Draft a concise project README outline.",
    max_tokens=200,
    temperature=0.6,
    stream=True
):
    print(tok["choices"][0]["text"], end="", flush=True)

🦙 4) Using Leeroy with Ollama (convert GGUF → Modelfile)

Create Modelfile next to your GGUF:

FROM ./Leeroy-3B-Instruct.Q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Then create and run

ollama create leeroy -f Modelfile
ollama run leeroy "Summarize the S3 storage classes and use cases."

🧩 5) Prompting Patterns

Direct instruction (concise)

You are a concise assistant. Explain in plain language.
Question: What is the curse of dimensionality, and how does PCA help?

Constrained format

Role: You produce JSON only.
Task: Extract entities from the text.
Schema: {"org":[], "person":[], "date":[]}
Text: <paste paragraph here>

Chain-of-thought light (compact rationale)

Give a brief 2-step reasoning before the final answer.
Question: Why do transformers use self-attention?

Guarded answers

If you are not confident or the context is insufficient, say "I don't know" and ask for missing info.

🔧 Components Used

LLM: Leeroy.Q4_K_M.gguf (loaded locally)
Embedding Model: all-MiniLM-L6-v2 (via sentence-transformers)
Vector Store: FAISS
Retriever: Top-k similarity search
Prompt Template: Fused with context + user query
Execution Mode: CPU (offline)

📁 1. Document Ingestion

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("my_knowledge.txt")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = splitter.split_documents(documents)

🔍 2. Embedding & Vector Indexing

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)

🔄 3. Retrieval + Prompt Formatting

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retrieved_docs = retriever.get_relevant_documents("What is the RAG method?")

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

prompt = f"""
You are Leeroy, a helpful assistant. Use the context below to answer the question:

<context>
{context}
</context>

<question>
What is the RAG method?
</question>
"""

🧠 4. LLM Inference with Leeroy

./main -m Leeroy.Q4_K_M.gguf -p "$prompt" -n 512 -t 8 -c 2048 --color

The output will be Leeroy's generated answer based on the retrieved content.

📝 Notes

Leeroy is run locally via llama.cpp or LM Studio.
No OpenAI API or GPU is needed.
Make sure your prompt is carefully formatted to simulate a structured context window.

⚙️ 7) Parameter Tuning Tips

• Temperature: 0.6–0.9 (lower = more deterministic)
• Top-p: 0.8–0.95 (adjust one knob at a time)
• Max new tokens: 128–512 for chat; longer for drafts
• Repeat penalty: 1.05–1.2 if you see repetition
• Threads: match physical cores for best CPU throughput

🛟 8) Troubleshooting

• Out-of-memory:
  Reduce -c (context) and -n (max tokens), or switch to fewer GPU layers.
• Garbled text / artifacts:
  Verify llama.cpp is up-to-date and the GGUF was not corrupted.
• Slow generation:
  Increase -t (threads), pin CPU governor to performance, or offload layers (n_gpu_layers>0).
• Incoherent outputs:
  Lower temperature, raise top-p slightly, and add clearer instruction in the prompt.

📝 Prompting Engineering

No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer.

Example system style

You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.

Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
From academic writing to financial analysis, technical support, SEO, and beyond
Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.

🕒 License and Usage

This model package derives from Meta’s Llama-3.x family. You are responsible for ensuring your use complies with the upstream model license and any dataset terms. For commercial deployment, review Meta’s license and your organization’s compliance requirements.

Leeroy is published under the MIT General Public License v3

🏁 Acknowledgements

Base model: Meta Llama-3.2-8B-Instruct.
Quantization and local runtimes: GGUF ecosystem (e.g., llama.cpp, LM Studio, Ollama loaders).

Downloads last month: 67

GGUF

Model size

1B params

Architecture

llama

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leeroy-jankins/leeroy

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

unsloth/Llama-3.2-1B-Instruct-GGUF