🐎 Leeroy: Fine-Tuned Llama-3.2-1B-Instruct for High-Performance Instruction Following

Overview

Leeroy is a LLM fine-tuned variant of Meta's Llama 3.2 1B Instruct, quantized to Q4_K_M GGUF format for high-efficiency, low-latency inference. Named after the legendary charge-forward ethos, Leeroy specializes in executing user instructions with speed and accuracy — making it the ideal local LLM for both professional tasks and experimental builds. With strong alignment capabilities, multilingual robustness, and support for complex multi-step reasoning, Leeroy strikes a balance between performance, size, and instruction quality. Designed for use on CPUs and modest GPUs, Leeroy runs natively in llama.cpp, LM Studio, Ollama, and similar GGUF-compatible environments.


🧰 Streamlit UI

Open In Streamlit

⚙️ Code Respository

✨ Key Features

Feature Description
🧠 Llama 3.2 8B Foundation Built on Meta’s state-of-the-art open LLM (8.1B params)
🛠️ Instruction Fine-Tuned Tuned on task-specific and open-ended user prompts
⚙️ GGUF Q4_K_M Format Optimized 4-bit grouped quantization for memory-efficient inference
🧊 Runs Locally Compatible with llama.cpp, LM Studio, Ollama, and more
💬 Dialogue-Ready Supports structured, multi-turn instruction following

🚀 Quickstart

📥 LM Studio (GUI - Recommended)

  1. Download: Place Leeroy-3B-Instruct.Q4_K_M into your LM Studio model folder.
  2. Launch LM Studio and go to “Local Models”.
  3. Select Leeroy, click "Chat", and start prompting:
  from llama_cpp import Llama
  
  llm = Llama(model_path="Leeroy-3B-Instruct.Q4_K_M, n_ctx=4096)
  output = llm("Explain the law of diminishing returns in economics.", max_tokens=200)
  print(output["choices"][0]["text"])

⚙️ Vectorized Datasets

Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning

  • Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
  • Regulations - Collection of federal regulations on the use of appropriated funds
  • SF-133 - The Report on Budget Execution and Budgetary Resources
  • Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
  • Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
  • SF-133 The Report on Budget Execution and Budgetary Resources
  • Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
  • Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
  • Fastbook - Treasury guidance on federal ledger accounts
  • Title 31 CFR - Money & Finance
  • Redbook - The Principles of Appropriations Law (Volumes I & II).
  • US Standard General Ledger - Account Definitions
  • Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies

🧪 Evaluation Results

Task Leeroy (Q4_K_M) Llama 3.2 8B Instruct (base)
ARC-Challenge (25-shot) 77.6% 72.9%
NaturalQuestions (EM/F1) 62.4 / 74.1 57.2 / 69.5
GSM8K (reasoning) 68.3% 61.9%
HumanEval (pass@1, reasoning) 10.1% 8.5%
MMLU (5-shot average) 62.5% 58.4%

🧠 Use Cases

Leeroy is optimized for instructional clarity, factual reasoning, and dialogic interaction:
  • 🧠 AI Research Assistants: Summarization, definitions, and analogies
  • 🔍 Search-Augmented RAG Systems: Use Leeroy for answer generation with vector-based retrieval
  • 🧮 Code Writing / Review: Write snippets, explain functions, or generate tests
  • 🧾 Legal / Policy Drafting: Clear summaries, rewriting, scenario simulation
  • 🗃️ Embedded Assistants: Use with offline agents and CLI frontends
  • 🌐 Multilingual Prompting: English, Spanish, French, and more with strong fluency

🧰 Intended Use

  • Lightweight instruction following, reasoning, summarization, and light code generation.
  • Edge/desktop assistants, CLI tools, and RAG agents where low latency and small footprint are key.

🔒 Limitations

  • Context length is dependent on the specific GGUF build; confirm your runtime settings.
  • Q4_K_M trades some precision for speed; complex coding and multi-hop reasoning may degrade vs. higher-precision builds.
  • As with any LLM, outputs can contain errors or hallucinations—use validation/guardrails.

🧩 Training Details (summary)

  • Base: Meta Llama-3.2-8B-Instruct.
  • Method: Meta’s instruction-tuned base; this package applies GGUF Q4_K_M quantization for local use.
  • Packaging: Optimized for llama.cpp/LM Studio and other GGUF-compatible runtimes.

Examples: Using the Leeroy LLM (Llama-3.2-8B-Instruct, Q4_K_M, GGUF)

This guide shows multiple ways to run Leeroy locally. All examples assume you have the quantized file Leeroy.Q4_K_M.gguf on disk.


🛠️ 1) llama.cpp (CLI)

Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

Basic run

./main -m ./Leeroy-3B-Instruct.Q4_K_M \
  -p "Write a 3-sentence summary of the Bayes theorem." \
  -n 200 -t 8 -c 4096 -ngl 0

Notes

• -m : path to GGUF file
• -p : prompt text
• -n : max new tokens
• -t : CPU threads
• -c : context tokens (set per your GGUF build; 4096 shown as an example)
• -ngl : #layers offloaded to GPU (0 = CPU-only)

Windows (PowerShell) example

.\main.exe -m .\Leeroy-3B-In.Q4_K_M.gguf `
  -p "List 5 non-obvious Python performance tips." `
  -n 180 -t 10 -c 4096

🖥️ 2) LM Studio

1. Open LM Studio → Local Models → Import.
2. Drag-drop `Leeroy.Q4_K_M.gguf`.
3. In the chat pane, set:
   • Max new tokens: 128–512
   • Temperature: 0.6–0.9
   • Top-p: 0.9 (start conservative)
4. Prompt example:
   Explain the differences between retrieval-augmented generation and fine-tuning.
   Give bullet points, then a short recommendation.

🐍 3) Python via llama-cpp-python

Install

pip install llama-cpp-python

Load and generate

from llama_cpp import Llama

llm = Llama(
    model_path="Leeroy-3B-Instruct.Q4_K_M",
    n_ctx=4096,         # adjust to your GGUF build
    n_threads=8,        # CPU threads
    n_gpu_layers=0      # set >0 to offload to GPU (if supported)
)

prompt = (
    "You are a precise assistant. "
    "In 6 bullet points, explain vector databases for RAG."
)

out = llm(
    prompt,
    max_tokens=256,
    temperature=0.7,
    top_p=0.9
)

print(out["choices"][0]["text"])

Streaming (token-by-token)

for tok in llm.create_completion(
    prompt="Draft a concise project README outline.",
    max_tokens=200,
    temperature=0.6,
    stream=True
):
    print(tok["choices"][0]["text"], end="", flush=True)

🦙 4) Using Leeroy with Ollama (convert GGUF → Modelfile)

Create Modelfile next to your GGUF:

FROM ./Leeroy-3B-Instruct.Q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Then create and run

ollama create leeroy -f Modelfile
ollama run leeroy "Summarize the S3 storage classes and use cases."

🧩 5) Prompting Patterns

Direct instruction (concise)

You are a concise assistant. Explain in plain language.
Question: What is the curse of dimensionality, and how does PCA help?

Constrained format

Role: You produce JSON only.
Task: Extract entities from the text.
Schema: {"org":[], "person":[], "date":[]}
Text: <paste paragraph here>

Chain-of-thought light (compact rationale)

Give a brief 2-step reasoning before the final answer.
Question: Why do transformers use self-attention?

Guarded answers

If you are not confident or the context is insufficient, say "I don't know" and ask for missing info.

🔧 Components Used

  • LLM: Leeroy.Q4_K_M.gguf (loaded locally)
  • Embedding Model: all-MiniLM-L6-v2 (via sentence-transformers)
  • Vector Store: FAISS
  • Retriever: Top-k similarity search
  • Prompt Template: Fused with context + user query
  • Execution Mode: CPU (offline)

📁 1. Document Ingestion

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("my_knowledge.txt")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = splitter.split_documents(documents)

🔍 2. Embedding & Vector Indexing

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)

🔄 3. Retrieval + Prompt Formatting

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retrieved_docs = retriever.get_relevant_documents("What is the RAG method?")

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

prompt = f"""
You are Leeroy, a helpful assistant. Use the context below to answer the question:

<context>
{context}
</context>

<question>
What is the RAG method?
</question>
"""

🧠 4. LLM Inference with Leeroy

./main -m Leeroy.Q4_K_M.gguf -p "$prompt" -n 512 -t 8 -c 2048 --color

The output will be Leeroy's generated answer based on the retrieved content.


📝 Notes

  • Leeroy is run locally via llama.cpp or LM Studio.
  • No OpenAI API or GPU is needed.
  • Make sure your prompt is carefully formatted to simulate a structured context window.

⚙️ 7) Parameter Tuning Tips

• Temperature: 0.6–0.9 (lower = more deterministic)
• Top-p: 0.8–0.95 (adjust one knob at a time)
• Max new tokens: 128–512 for chat; longer for drafts
• Repeat penalty: 1.05–1.2 if you see repetition
• Threads: match physical cores for best CPU throughput

🛟 8) Troubleshooting

• Out-of-memory:
  Reduce -c (context) and -n (max tokens), or switch to fewer GPU layers.
• Garbled text / artifacts:
  Verify llama.cpp is up-to-date and the GGUF was not corrupted.
• Slow generation:
  Increase -t (threads), pin CPU governor to performance, or offload layers (n_gpu_layers>0).
• Incoherent outputs:
  Lower temperature, raise top-p slightly, and add clearer instruction in the prompt.

📝 Prompting Engineering

No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer.

Example system style

You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.
  • Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
  • From academic writing to financial analysis, technical support, SEO, and beyond
  • Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.

🕒 License and Usage

This model package derives from Meta’s Llama-3.x family. You are responsible for ensuring your use complies with the upstream model license and any dataset terms. For commercial deployment, review Meta’s license and your organization’s compliance requirements.


🏁 Acknowledgements

  • Base model: Meta Llama-3.2-8B-Instruct.
  • Quantization and local runtimes: GGUF ecosystem (e.g., llama.cpp, LM Studio, Ollama loaders).

Downloads last month
6
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leeroy-jankins/leeroy

Quantized
(350)
this model

Datasets used to train leeroy-jankins/leeroy