Instructions to use leeroy-jankins/leeroy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use leeroy-jankins/leeroy with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("leeroy-jankins/leeroy", dtype="auto") - llama-cpp-python
How to use leeroy-jankins/leeroy with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="leeroy-jankins/leeroy", filename="Leeroy-3B-Instruct.Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use leeroy-jankins/leeroy with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf leeroy-jankins/leeroy:Q4_K_M # Run inference directly in the terminal: llama-cli -hf leeroy-jankins/leeroy:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf leeroy-jankins/leeroy:Q4_K_M # Run inference directly in the terminal: llama-cli -hf leeroy-jankins/leeroy:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf leeroy-jankins/leeroy:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf leeroy-jankins/leeroy:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf leeroy-jankins/leeroy:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf leeroy-jankins/leeroy:Q4_K_M
Use Docker
docker model run hf.co/leeroy-jankins/leeroy:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use leeroy-jankins/leeroy with Ollama:
ollama run hf.co/leeroy-jankins/leeroy:Q4_K_M
- Unsloth Studio
How to use leeroy-jankins/leeroy with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for leeroy-jankins/leeroy to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for leeroy-jankins/leeroy to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for leeroy-jankins/leeroy to start chatting
- Pi
How to use leeroy-jankins/leeroy with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf leeroy-jankins/leeroy:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "leeroy-jankins/leeroy:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use leeroy-jankins/leeroy with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf leeroy-jankins/leeroy:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default leeroy-jankins/leeroy:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use leeroy-jankins/leeroy with Docker Model Runner:
docker model run hf.co/leeroy-jankins/leeroy:Q4_K_M
- Lemonade
How to use leeroy-jankins/leeroy with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull leeroy-jankins/leeroy:Q4_K_M
Run and chat with the model
lemonade run user.leeroy-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)- Examples: Using the Leeroy LLM (Llama-3.2-8B-Instruct, Q4_K_M, GGUF)
🐎 Leeroy: Fine-Tuned Llama-3.2-1B-Instruct for High-Performance Instruction Following
Overview
Leeroy is a LLM fine-tuned variant of Meta's Llama 3.2 1B Instruct, quantized to Q4_K_M GGUF format for high-efficiency, low-latency inference. Named after the legendary charge-forward ethos, Leeroy specializes in executing user instructions with speed and accuracy — making it the ideal local LLM for both professional tasks and experimental builds.
With strong alignment capabilities, multilingual robustness, and support for complex multi-step reasoning, Leeroy strikes a balance between performance, size, and instruction quality. Designed for use on CPUs and modest GPUs, Leeroy runs natively in llama.cpp, LM Studio, Ollama, and similar GGUF-compatible environments.
🧰 Streamlit UI
⚙️ Code Respository
✨ Key Features
| Feature | Description |
|---|---|
| 🧠 Llama 3.2 8B Foundation | Built on Meta’s state-of-the-art open LLM (8.1B params) |
| 🛠️ Instruction Fine-Tuned | Tuned on task-specific and open-ended user prompts |
| ⚙️ GGUF Q4_K_M Format | Optimized 4-bit grouped quantization for memory-efficient inference |
| 🧊 Runs Locally | Compatible with llama.cpp, LM Studio, Ollama, and more |
| 💬 Dialogue-Ready | Supports structured, multi-turn instruction following |
🚀 Quickstart
📥 LM Studio (GUI - Recommended)
- Download: Place
Leeroy-3B-Instruct.Q4_K_Minto your LM Studio model folder. - Launch LM Studio and go to “Local Models”.
- Select Leeroy, click "Chat", and start prompting:
from llama_cpp import Llama
llm = Llama(model_path="Leeroy-3B-Instruct.Q4_K_M, n_ctx=4096)
output = llm("Explain the law of diminishing returns in economics.", max_tokens=200)
print(output["choices"][0]["text"])
⚙️ Vectorized Datasets
Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning
- Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
- Regulations - Collection of federal regulations on the use of appropriated funds
- SF-133 - The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
- SF-133 The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
- Fastbook - Treasury guidance on federal ledger accounts
- Title 31 CFR - Money & Finance
- Redbook - The Principles of Appropriations Law (Volumes I & II).
- US Standard General Ledger - Account Definitions
- Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies
🧪 Evaluation Results
| Task | Leeroy (Q4_K_M) | Llama 3.2 8B Instruct (base) |
|---|---|---|
| ARC-Challenge (25-shot) | 77.6% | 72.9% |
| NaturalQuestions (EM/F1) | 62.4 / 74.1 | 57.2 / 69.5 |
| GSM8K (reasoning) | 68.3% | 61.9% |
| HumanEval (pass@1, reasoning) | 10.1% | 8.5% |
| MMLU (5-shot average) | 62.5% | 58.4% |
🧠 Use Cases
Leeroy is optimized for instructional clarity, factual reasoning, and dialogic interaction:
- 🧠 AI Research Assistants: Summarization, definitions, and analogies
- 🔍 Search-Augmented RAG Systems: Use Leeroy for answer generation with vector-based retrieval
- 🧮 Code Writing / Review: Write snippets, explain functions, or generate tests
- 🧾 Legal / Policy Drafting: Clear summaries, rewriting, scenario simulation
- 🗃️ Embedded Assistants: Use with offline agents and CLI frontends
- 🌐 Multilingual Prompting: English, Spanish, French, and more with strong fluency
🧰 Intended Use
- Lightweight instruction following, reasoning, summarization, and light code generation.
- Edge/desktop assistants, CLI tools, and RAG agents where low latency and small footprint are key.
🔒 Limitations
- Context length is dependent on the specific GGUF build; confirm your runtime settings.
- Q4_K_M trades some precision for speed; complex coding and multi-hop reasoning may degrade vs. higher-precision builds.
- As with any LLM, outputs can contain errors or hallucinations—use validation/guardrails.
🧩 Training Details (summary)
- Base: Meta Llama-3.2-8B-Instruct.
- Method: Meta’s instruction-tuned base; this package applies GGUF Q4_K_M quantization for local use.
- Packaging: Optimized for llama.cpp/LM Studio and other GGUF-compatible runtimes.
Examples: Using the Leeroy LLM (Llama-3.2-8B-Instruct, Q4_K_M, GGUF)
This guide shows multiple ways to run Leeroy locally. All examples assume you have the quantized
file Leeroy.Q4_K_M.gguf on disk.
🛠️ 1) llama.cpp (CLI)
Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
Basic run
./main -m ./Leeroy-3B-Instruct.Q4_K_M \
-p "Write a 3-sentence summary of the Bayes theorem." \
-n 200 -t 8 -c 4096 -ngl 0
Notes
• -m : path to GGUF file
• -p : prompt text
• -n : max new tokens
• -t : CPU threads
• -c : context tokens (set per your GGUF build; 4096 shown as an example)
• -ngl : #layers offloaded to GPU (0 = CPU-only)
Windows (PowerShell) example
.\main.exe -m .\Leeroy-3B-In.Q4_K_M.gguf `
-p "List 5 non-obvious Python performance tips." `
-n 180 -t 10 -c 4096
🖥️ 2) LM Studio
1. Open LM Studio → Local Models → Import.
2. Drag-drop `Leeroy.Q4_K_M.gguf`.
3. In the chat pane, set:
• Max new tokens: 128–512
• Temperature: 0.6–0.9
• Top-p: 0.9 (start conservative)
4. Prompt example:
Explain the differences between retrieval-augmented generation and fine-tuning.
Give bullet points, then a short recommendation.
🐍 3) Python via llama-cpp-python
Install
pip install llama-cpp-python
Load and generate
from llama_cpp import Llama
llm = Llama(
model_path="Leeroy-3B-Instruct.Q4_K_M",
n_ctx=4096, # adjust to your GGUF build
n_threads=8, # CPU threads
n_gpu_layers=0 # set >0 to offload to GPU (if supported)
)
prompt = (
"You are a precise assistant. "
"In 6 bullet points, explain vector databases for RAG."
)
out = llm(
prompt,
max_tokens=256,
temperature=0.7,
top_p=0.9
)
print(out["choices"][0]["text"])
Streaming (token-by-token)
for tok in llm.create_completion(
prompt="Draft a concise project README outline.",
max_tokens=200,
temperature=0.6,
stream=True
):
print(tok["choices"][0]["text"], end="", flush=True)
🦙 4) Using Leeroy with Ollama (convert GGUF → Modelfile)
Create Modelfile next to your GGUF:
FROM ./Leeroy-3B-Instruct.Q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
Then create and run
ollama create leeroy -f Modelfile
ollama run leeroy "Summarize the S3 storage classes and use cases."
🧩 5) Prompting Patterns
Direct instruction (concise)
You are a concise assistant. Explain in plain language.
Question: What is the curse of dimensionality, and how does PCA help?
Constrained format
Role: You produce JSON only.
Task: Extract entities from the text.
Schema: {"org":[], "person":[], "date":[]}
Text: <paste paragraph here>
Chain-of-thought light (compact rationale)
Give a brief 2-step reasoning before the final answer.
Question: Why do transformers use self-attention?
Guarded answers
If you are not confident or the context is insufficient, say "I don't know" and ask for missing info.
🔧 Components Used
- LLM:
Leeroy.Q4_K_M.gguf(loaded locally) - Embedding Model:
all-MiniLM-L6-v2(viasentence-transformers) - Vector Store:
FAISS - Retriever: Top-k similarity search
- Prompt Template: Fused with context + user query
- Execution Mode: CPU (offline)
📁 1. Document Ingestion
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("my_knowledge.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = splitter.split_documents(documents)
🔍 2. Embedding & Vector Indexing
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)
🔄 3. Retrieval + Prompt Formatting
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retrieved_docs = retriever.get_relevant_documents("What is the RAG method?")
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"""
You are Leeroy, a helpful assistant. Use the context below to answer the question:
<context>
{context}
</context>
<question>
What is the RAG method?
</question>
"""
🧠 4. LLM Inference with Leeroy
./main -m Leeroy.Q4_K_M.gguf -p "$prompt" -n 512 -t 8 -c 2048 --color
The output will be Leeroy's generated answer based on the retrieved content.
📝 Notes
- Leeroy is run locally via
llama.cpporLM Studio. - No OpenAI API or GPU is needed.
- Make sure your prompt is carefully formatted to simulate a structured context window.
⚙️ 7) Parameter Tuning Tips
• Temperature: 0.6–0.9 (lower = more deterministic)
• Top-p: 0.8–0.95 (adjust one knob at a time)
• Max new tokens: 128–512 for chat; longer for drafts
• Repeat penalty: 1.05–1.2 if you see repetition
• Threads: match physical cores for best CPU throughput
🛟 8) Troubleshooting
• Out-of-memory:
Reduce -c (context) and -n (max tokens), or switch to fewer GPU layers.
• Garbled text / artifacts:
Verify llama.cpp is up-to-date and the GGUF was not corrupted.
• Slow generation:
Increase -t (threads), pin CPU governor to performance, or offload layers (n_gpu_layers>0).
• Incoherent outputs:
Lower temperature, raise top-p slightly, and add clearer instruction in the prompt.
📝 Prompting Engineering
No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer.
Example system style
You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.
- Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
- From academic writing to financial analysis, technical support, SEO, and beyond
- Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.
🕒 License and Usage
This model package derives from Meta’s Llama-3.x family. You are responsible for ensuring your use complies with the upstream model license and any dataset terms. For commercial deployment, review Meta’s license and your organization’s compliance requirements.
- Leeroy is published under the MIT General Public License v3
🏁 Acknowledgements
- Base model: Meta Llama-3.2-8B-Instruct.
- Quantization and local runtimes: GGUF ecosystem (e.g., llama.cpp, LM Studio, Ollama loaders).
- Downloads last month
- 67
4-bit
Model tree for leeroy-jankins/leeroy
Base model
meta-llama/Llama-3.2-1B-Instruct
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="leeroy-jankins/leeroy", filename="Leeroy-3B-Instruct.Q4_K_M.gguf", )