🐎 Leeroy: Fine-Tuned Llama-3.2-1B-Instruct for High-Performance Instruction Following
Overview
Leeroy is a LLM fine-tuned variant of Meta's Llama 3.2 1B Instruct, quantized to Q4_K_M GGUF format for high-efficiency, low-latency inference. Named after the legendary charge-forward ethos, Leeroy specializes in executing user instructions with speed and accuracy — making it the ideal local LLM for both professional tasks and experimental builds.
With strong alignment capabilities, multilingual robustness, and support for complex multi-step reasoning, Leeroy strikes a balance between performance, size, and instruction quality. Designed for use on CPUs and modest GPUs, Leeroy runs natively in llama.cpp, LM Studio, Ollama, and similar GGUF-compatible environments.
🧰 Streamlit UI
⚙️ Code Respository
✨ Key Features
| Feature | Description |
|---|---|
| 🧠 Llama 3.2 8B Foundation | Built on Meta’s state-of-the-art open LLM (8.1B params) |
| 🛠️ Instruction Fine-Tuned | Tuned on task-specific and open-ended user prompts |
| ⚙️ GGUF Q4_K_M Format | Optimized 4-bit grouped quantization for memory-efficient inference |
| 🧊 Runs Locally | Compatible with llama.cpp, LM Studio, Ollama, and more |
| 💬 Dialogue-Ready | Supports structured, multi-turn instruction following |
🚀 Quickstart
📥 LM Studio (GUI - Recommended)
- Download: Place
Leeroy-3B-Instruct.Q4_K_Minto your LM Studio model folder. - Launch LM Studio and go to “Local Models”.
- Select Leeroy, click "Chat", and start prompting:
from llama_cpp import Llama
llm = Llama(model_path="Leeroy-3B-Instruct.Q4_K_M, n_ctx=4096)
output = llm("Explain the law of diminishing returns in economics.", max_tokens=200)
print(output["choices"][0]["text"])
⚙️ Vectorized Datasets
Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning
- Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
- Regulations - Collection of federal regulations on the use of appropriated funds
- SF-133 - The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
- SF-133 The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
- Fastbook - Treasury guidance on federal ledger accounts
- Title 31 CFR - Money & Finance
- Redbook - The Principles of Appropriations Law (Volumes I & II).
- US Standard General Ledger - Account Definitions
- Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies
🧪 Evaluation Results
| Task | Leeroy (Q4_K_M) | Llama 3.2 8B Instruct (base) |
|---|---|---|
| ARC-Challenge (25-shot) | 77.6% | 72.9% |
| NaturalQuestions (EM/F1) | 62.4 / 74.1 | 57.2 / 69.5 |
| GSM8K (reasoning) | 68.3% | 61.9% |
| HumanEval (pass@1, reasoning) | 10.1% | 8.5% |
| MMLU (5-shot average) | 62.5% | 58.4% |
🧠 Use Cases
Leeroy is optimized for instructional clarity, factual reasoning, and dialogic interaction:
- 🧠 AI Research Assistants: Summarization, definitions, and analogies
- 🔍 Search-Augmented RAG Systems: Use Leeroy for answer generation with vector-based retrieval
- 🧮 Code Writing / Review: Write snippets, explain functions, or generate tests
- 🧾 Legal / Policy Drafting: Clear summaries, rewriting, scenario simulation
- 🗃️ Embedded Assistants: Use with offline agents and CLI frontends
- 🌐 Multilingual Prompting: English, Spanish, French, and more with strong fluency
🧰 Intended Use
- Lightweight instruction following, reasoning, summarization, and light code generation.
- Edge/desktop assistants, CLI tools, and RAG agents where low latency and small footprint are key.
🔒 Limitations
- Context length is dependent on the specific GGUF build; confirm your runtime settings.
- Q4_K_M trades some precision for speed; complex coding and multi-hop reasoning may degrade vs. higher-precision builds.
- As with any LLM, outputs can contain errors or hallucinations—use validation/guardrails.
🧩 Training Details (summary)
- Base: Meta Llama-3.2-8B-Instruct.
- Method: Meta’s instruction-tuned base; this package applies GGUF Q4_K_M quantization for local use.
- Packaging: Optimized for llama.cpp/LM Studio and other GGUF-compatible runtimes.
Examples: Using the Leeroy LLM (Llama-3.2-8B-Instruct, Q4_K_M, GGUF)
This guide shows multiple ways to run Leeroy locally. All examples assume you have the quantized
file Leeroy.Q4_K_M.gguf on disk.
🛠️ 1) llama.cpp (CLI)
Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
Basic run
./main -m ./Leeroy-3B-Instruct.Q4_K_M \
-p "Write a 3-sentence summary of the Bayes theorem." \
-n 200 -t 8 -c 4096 -ngl 0
Notes
• -m : path to GGUF file
• -p : prompt text
• -n : max new tokens
• -t : CPU threads
• -c : context tokens (set per your GGUF build; 4096 shown as an example)
• -ngl : #layers offloaded to GPU (0 = CPU-only)
Windows (PowerShell) example
.\main.exe -m .\Leeroy-3B-In.Q4_K_M.gguf `
-p "List 5 non-obvious Python performance tips." `
-n 180 -t 10 -c 4096
🖥️ 2) LM Studio
1. Open LM Studio → Local Models → Import.
2. Drag-drop `Leeroy.Q4_K_M.gguf`.
3. In the chat pane, set:
• Max new tokens: 128–512
• Temperature: 0.6–0.9
• Top-p: 0.9 (start conservative)
4. Prompt example:
Explain the differences between retrieval-augmented generation and fine-tuning.
Give bullet points, then a short recommendation.
🐍 3) Python via llama-cpp-python
Install
pip install llama-cpp-python
Load and generate
from llama_cpp import Llama
llm = Llama(
model_path="Leeroy-3B-Instruct.Q4_K_M",
n_ctx=4096, # adjust to your GGUF build
n_threads=8, # CPU threads
n_gpu_layers=0 # set >0 to offload to GPU (if supported)
)
prompt = (
"You are a precise assistant. "
"In 6 bullet points, explain vector databases for RAG."
)
out = llm(
prompt,
max_tokens=256,
temperature=0.7,
top_p=0.9
)
print(out["choices"][0]["text"])
Streaming (token-by-token)
for tok in llm.create_completion(
prompt="Draft a concise project README outline.",
max_tokens=200,
temperature=0.6,
stream=True
):
print(tok["choices"][0]["text"], end="", flush=True)
🦙 4) Using Leeroy with Ollama (convert GGUF → Modelfile)
Create Modelfile next to your GGUF:
FROM ./Leeroy-3B-Instruct.Q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
Then create and run
ollama create leeroy -f Modelfile
ollama run leeroy "Summarize the S3 storage classes and use cases."
🧩 5) Prompting Patterns
Direct instruction (concise)
You are a concise assistant. Explain in plain language.
Question: What is the curse of dimensionality, and how does PCA help?
Constrained format
Role: You produce JSON only.
Task: Extract entities from the text.
Schema: {"org":[], "person":[], "date":[]}
Text: <paste paragraph here>
Chain-of-thought light (compact rationale)
Give a brief 2-step reasoning before the final answer.
Question: Why do transformers use self-attention?
Guarded answers
If you are not confident or the context is insufficient, say "I don't know" and ask for missing info.
🔧 Components Used
- LLM:
Leeroy.Q4_K_M.gguf(loaded locally) - Embedding Model:
all-MiniLM-L6-v2(viasentence-transformers) - Vector Store:
FAISS - Retriever: Top-k similarity search
- Prompt Template: Fused with context + user query
- Execution Mode: CPU (offline)
📁 1. Document Ingestion
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("my_knowledge.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = splitter.split_documents(documents)
🔍 2. Embedding & Vector Indexing
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)
🔄 3. Retrieval + Prompt Formatting
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retrieved_docs = retriever.get_relevant_documents("What is the RAG method?")
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"""
You are Leeroy, a helpful assistant. Use the context below to answer the question:
<context>
{context}
</context>
<question>
What is the RAG method?
</question>
"""
🧠 4. LLM Inference with Leeroy
./main -m Leeroy.Q4_K_M.gguf -p "$prompt" -n 512 -t 8 -c 2048 --color
The output will be Leeroy's generated answer based on the retrieved content.
📝 Notes
- Leeroy is run locally via
llama.cpporLM Studio. - No OpenAI API or GPU is needed.
- Make sure your prompt is carefully formatted to simulate a structured context window.
⚙️ 7) Parameter Tuning Tips
• Temperature: 0.6–0.9 (lower = more deterministic)
• Top-p: 0.8–0.95 (adjust one knob at a time)
• Max new tokens: 128–512 for chat; longer for drafts
• Repeat penalty: 1.05–1.2 if you see repetition
• Threads: match physical cores for best CPU throughput
🛟 8) Troubleshooting
• Out-of-memory:
Reduce -c (context) and -n (max tokens), or switch to fewer GPU layers.
• Garbled text / artifacts:
Verify llama.cpp is up-to-date and the GGUF was not corrupted.
• Slow generation:
Increase -t (threads), pin CPU governor to performance, or offload layers (n_gpu_layers>0).
• Incoherent outputs:
Lower temperature, raise top-p slightly, and add clearer instruction in the prompt.
📝 Prompting Engineering
No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer.
Example system style
You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.
- Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
- From academic writing to financial analysis, technical support, SEO, and beyond
- Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.
🕒 License and Usage
This model package derives from Meta’s Llama-3.x family. You are responsible for ensuring your use complies with the upstream model license and any dataset terms. For commercial deployment, review Meta’s license and your organization’s compliance requirements.
- Leeroy is published under the MIT General Public License v3
🏁 Acknowledgements
- Base model: Meta Llama-3.2-8B-Instruct.
- Quantization and local runtimes: GGUF ecosystem (e.g., llama.cpp, LM Studio, Ollama loaders).
- Downloads last month
- 6
4-bit
Model tree for leeroy-jankins/leeroy
Base model
meta-llama/Llama-3.2-1B-Instruct