Instructions to use leeroy-jankins/bubba with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use leeroy-jankins/bubba with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="leeroy-jankins/bubba", filename="bubba-20b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use leeroy-jankins/bubba with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf leeroy-jankins/bubba:Q4_K_M # Run inference directly in the terminal: llama-cli -hf leeroy-jankins/bubba:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf leeroy-jankins/bubba:Q4_K_M # Run inference directly in the terminal: llama-cli -hf leeroy-jankins/bubba:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf leeroy-jankins/bubba:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf leeroy-jankins/bubba:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf leeroy-jankins/bubba:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf leeroy-jankins/bubba:Q4_K_M
Use Docker
docker model run hf.co/leeroy-jankins/bubba:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use leeroy-jankins/bubba with Ollama:
ollama run hf.co/leeroy-jankins/bubba:Q4_K_M
- Unsloth Studio new
How to use leeroy-jankins/bubba with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for leeroy-jankins/bubba to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for leeroy-jankins/bubba to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for leeroy-jankins/bubba to start chatting
- Pi new
How to use leeroy-jankins/bubba with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf leeroy-jankins/bubba:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "leeroy-jankins/bubba:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use leeroy-jankins/bubba with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf leeroy-jankins/bubba:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default leeroy-jankins/bubba:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use leeroy-jankins/bubba with Docker Model Runner:
docker model run hf.co/leeroy-jankins/bubba:Q4_K_M
- Lemonade
How to use leeroy-jankins/bubba with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull leeroy-jankins/bubba:Q4_K_M
Run and chat with the model
lemonade run user.bubba-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf leeroy-jankins/bubba:Q4_K_M# Run inference directly in the terminal:
llama-cli -hf leeroy-jankins/bubba:Q4_K_MUse pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf leeroy-jankins/bubba:Q4_K_M# Run inference directly in the terminal:
./llama-cli -hf leeroy-jankins/bubba:Q4_K_MBuild from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf leeroy-jankins/bubba:Q4_K_M# Run inference directly in the terminal:
./build/bin/llama-cli -hf leeroy-jankins/bubba:Q4_K_MUse Docker
docker model run hf.co/leeroy-jankins/bubba:Q4_K_M- Examples: Using the Bubba LLM (Fine-tuned from gpt-oss-20b)
- π Python (Transformers) β Full Weights
- π§© Python (PEFT) β Adapters on Top of the Base
- πΎ 4-bit (bitsandbytes) β Memory-Efficient Loading
- π Serve with vLLM (OpenAI-compatible API)
- π¦ Serve with Text Generation Inference (TGI)
- π§ Prompt Patterns
- π Basic RAG
- βοΈ Parameter Tips
- π Troubleshooting
- π Minimal Batch Inference Example
- Inference Tips
- π WebUI
- π§© KoboldCPP
- β‘ LM Studio
- β Prompting
- βοΈ Performance & Memory Guidance (Rules of Thumb)
- π» Files
- βοΈ GGUF Format
- π Safety, Bias & Responsible Use
- π License and Usage
- π§© Attribution
- β FAQ
- π Changelog
- π Python (Transformers) β Full Weights
Bubba is a fine-tuned LLM based on OpenAIβs Chat GPT-5. This release packages the fine-tuned weights (or adapters) for practical, low-latency instruction following, summarization, reasoning, and light code generation. It is intended for local or self-hosted environments and RAG (Retrieval-Augmented Generation) stacks that require predictable, fast outputs.
Quantized, and fine-tuned GGUF based on OpenAIβs gpt-oss-20b
Format: GGUF (for llama.cpp and compatible runtimes) β’ Quantization: Q4_K_XL (4-bit, K-grouped, extra-low loss)
File: bubba-20b-Q4_K_XL.gguf
βοΈ Streamlit UI
π Code Repository
π§ Overview
- This repo provides a 4-bit K-quantized
.gguffor fast local inference of a 20B-parameter model derived from OpenAIβsgpt-oss-20b(as reported by the uploader). - Use cases: general chat/instruction following, coding help, knowledge Q&A (see Intended Use & Limitations).
- Works with:
llama.cpp,llama-cpp-python, KoboldCPP, Text Generation WebUI, LM Studio, and other GGUF-compatible backends. - Hardware guidance (rule of thumb): ~12β16 GB VRAM/RAM for comfortable batch-1 inference with Q4_K_XL; CPU-only works too (expect lower tokens/s).
Key Features
- Instruction-tuned derivative of gpt-oss-20b for concise, helpful responses.
- Optimized defaults for short to medium prompts; strong compatibility with RAG pipelines.
- Flexible distribution: full finetuned weights or lightweight LoRA/QLoRA adapters.
- Compatible with popular runtimes and libraries (Transformers, PEFT, vLLM, Text Generation Inference).
β οΈ Provenance & license: This quant is produced from a base model claimed to be OpenAIβs
gpt-oss-20b. Please review and comply with the original modelβs license/terms. The GGUF quantization inherits those terms. See the License section.
βοΈ Vectorized Datasets
Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning
- Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
- Regulations - Collection of federal regulations on the use of appropriated funds
- SF-133 - The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
- Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
- Fastbook - Treasury guidance on federal ledger accounts
- Title 31 CFR - Money & Finance
- Redbook - The Principles of Appropriations Law (Volumes I & II).
- US Standard General Ledger - Account Definitions
- Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies
Technical Specifications
| Property | Value / Guidance |
|---|---|
| Base model | gpt-oss-20b (decoder-only Transformer) |
| Parameters | ~20B (as per upstream) |
| Tokenizer | Use the upstream tokenizer associated with gpt-oss-20b |
| Context window | Determined by the upstream base; set accordingly in your runtime |
| Fine-tuning | Supervised Fine-Tuning (SFT); optional preference optimization (DPO/ORPO) |
| Precision | FP16/BF16 recommended; 4-bit (bnb) for single-GPU experimentation |
| Intended runtimes | Hugging Face Transformers, PEFT, vLLM, TGI (Text Generation Inference) |
Note: Please adjust any specifics (context length, tokenizer name) to match the exact upstream build you use for gpt-oss-20b.
Files
| File / Folder | Description |
|---|---|
| README.md | This model card |
| config.json / tokenizer files | Configuration and tokenizer artifacts (from upstream) |
| pytorch_model.safetensors | Full fine-tuned weights (if released as full model) |
| adapter_model.safetensors | LoRA/QLoRA adapters only (if released as adapters) |
| training_args.json (optional) | Minimal training configuration for reproducibility |
Only one of βfull weightsβ or βadaptersβ may be included depending on how you distribute Bubba.
π Intended Use & Limitations
Intended Use
- Instruction following, general dialogue
- Code assistance (reasoning, boilerplate, refactoring)
- Knowledge/Q&A within the modelβs training cutoff
Out-of-Scope / Known Limitations
- Factuality: may produce inaccurate or outdated info
- Safety: can emit biased or unsafe text; apply your own filters/guardrails
- High-stakes decisions: not for medical, legal, financial, or safety-critical use
π― Quick Start
Examples: Using the Bubba LLM (Fine-tuned from gpt-oss-20b)
This guide shows several ways to run Bubba locally or on a server. Examples cover full weights, LoRA/QLoRA adapters, vLLM, and Text Generation Inference (TGI), plus prompt patterns and RAG.
π Python (Transformers) β Full Weights
Install
pip install "transformers>=4.44.0" accelerate torch --upgrade
Load and generate
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"
tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "In 5 bullet points, explain retrieval-augmented generation and when to use it."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9
)
print(tok.decode(out[0], skip_special_tokens=True))
Notes
β’ device_map="auto" will place weights across available GPUs/CPU.
β’ Prefer BF16 if supported; otherwise FP16. For VRAM-constrained experiments, see 4-bit below.
π§© Python (PEFT) β Adapters on Top of the Base
Install
pip install "transformers>=4.44.0" peft accelerate torch --upgrade
Load base + LoRA/QLoRA adapters
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_name = "openai/gpt-oss-20b" # replace with the exact upstream base you use
lora_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"
tok = AutoTokenizer.from_pretrained(base_name, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
base_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(base, lora_name)
prompt = "Draft a JSON spec with keys: goal, steps[], risks[], success_metric."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, temperature=0.6, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
πΎ 4-bit (bitsandbytes) β Memory-Efficient Loading
Install
pip install "transformers>=4.44.0" accelerate bitsandbytes --upgrade
Load with 4-bit quantization
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"
tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb,
device_map="auto"
)
prompt = "Explain beam search vs. nucleus sampling in three short bullets."
inputs = tok(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
out = model.generate(**inputs, max_new_tokens=160, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
π Serve with vLLM (OpenAI-compatible API)
Install and launch (example)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model your-namespace/Bubba-gpt-oss-20b-finetuned \
--dtype bfloat16 --max-model-len 8192 \
--port 8000
Call the endpoint (Python)
import requests, json
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "your-namespace/Bubba-gpt-oss-20b-finetuned",
"messages": [
{"role": "system", "content": "You are concise and factual."},
{"role": "user", "content": "Give a 4-step checklist for evaluating a RAG pipeline."}
],
"temperature": 0.7,
"max_tokens": 256,
"stream": True
}
with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as r:
for line in r.iter_lines():
if line and line.startswith(b"data: "):
chunk = line[len(b"data: "):].decode("utf-8")
if chunk == "[DONE]":
break
print(chunk, flush=True)
π¦ Serve with Text Generation Inference (TGI)
Run the server (Docker)
docker run --gpus all --shm-size 1g -p 8080:80 \
-e MODEL_ID=your-namespace/Bubba-gpt-oss-20b-finetuned \
ghcr.io/huggingface/text-generation-inference:latest
Call the server (HTTP)
curl http://localhost:8080/generate \
-X POST -d '{
"inputs": "Summarize pros/cons of hybrid search (BM25 + embeddings).",
"parameters": {"max_new_tokens": 200, "temperature": 0.7, "top_p": 0.9}
}' \
-H "Content-Type: application/json"
π§ Prompt Patterns
Direct instruction (concise)
You are a precise assistant. In 6 bullets, explain evaluation metrics for retrieval (Recall@k,
MRR, nDCG). Keep each bullet under 20 words.
Constrained JSON output
System: Output only valid JSON. No prose.
User: Produce {"goal":"", "steps":[""], "risks":[""], "metrics":[""]} for testing a QA bot.
Guarded answer
If the answer isnβt derivable from the context, say βI donβt knowβ and ask for the missing info.
Few-shot structure
Example:
Q: Map 3 tasks to suitable embedding dimensions.
A: 256: short titles; 768: support FAQs; 1024: multi-paragraph knowledge base.
π Basic RAG
# 1) Retrieve
chunks = retriever.search("compare vector DBs for legal discovery", k=5)
# 2) Build prompt
context = "\n".join([f"β’ {c.text} [{c.source}]" for c in chunks])
prompt = f"""
You are a helpful assistant. Use only the context to answer.
Context:
{context}
Question:
What selection criteria should teams use when picking a vector DB for scale and cost?
"""
# 3) Generate (Transformers / vLLM / TGI)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, temperature=0.6, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
π 1. Document Ingestion
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("docs/corpus.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=150)
docs = splitter.split_documents(documents)
π 2. Embedding & Vector Indexing
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)
π 3. Retrieval + Prompt Formatting
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
retrieved_docs = retriever.get_relevant_documents("What role does Bubba play in improving document QA?")
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"""
You are Bubba, a reasoning-heavy assistant. Use only the context below to answer:
<context>
{context}
</context>
<question>
What role does Bubba play in improving document QA?
</question>
"""
π§ 4. LLM Inference with Bubba
./main -m Bubba.Q4_K_M.gguf -p "$prompt" -n 768 -t 16 -c 4096 --color
Bubbaβs output will include a context-aware, citation-grounded response backed by the retrieved input.
π Notes
- Bubba (20B parameter model) may require more memory than smaller models like Bro or Leeroy.
- Use a higher
-cvalue (context size) to accommodate longer prompts with more chunks. - GPU acceleration is recommended for smooth generation if your hardware supports it.
βοΈ Parameter Tips
β’ Temperature: 0.6β0.9 (lower = more deterministic)
β’ Top-p: 0.8β0.95 (tune one knob at a time)
β’ Max new tokens: 128β384 for chat; longer for drafting
β’ Repetition penalty: 1.05β1.2 if loops appear
β’ Batch size: use padding_side="left" and dynamic padding for throughput
β’ Context length: set to your runtimeβs max; compress context via selective retrieval
π Troubleshooting
β’ CUDA OOM:
Lower max_new_tokens; enable 4-bit; shard across GPUs; reduce context length.
β’ Slow throughput:
Use vLLM/TGI with tensor/PP sharding; enable paged attention; pin to BF16.
β’ Messy JSON:
Use a JSON-only system prompt; set temperature β€0.6; add a JSON schema in the prompt.
β’ Domain shift:
Consider small adapter tuning on your domain data; add retrieval grounding.
π Minimal Batch Inference Example
prompts = [
"List 5 key features of FAISS.",
"Why would I choose pgvector over Milvus?"
]
inputs = tok(prompts, return_tensors="pt", padding=True).to(model.device)
out = model.generate(**inputs, max_new_tokens=160, temperature=0.7, top_p=0.9)
for i, seq in enumerate(out):
print(f"--- Prompt {i+1} ---")
print(tok.decode(seq, skip_special_tokens=True))
Inference Tips
- Prefer BF16 if available; otherwise FP16. For limited VRAM, try 4-bit (bitsandbytes) to explore.
- Start with max_new_tokens between 128β384 and temperature 0.6β0.9; tune top_p for stability.
- For RAG, constrain prompt length and adopt strict chunking/citation formatting for better grounding.
π WebUI
- Place the GGUF in
text-generation-webui/models/bubba-20b-Q4_K_XL/ - Launch with the
llama.cpploader (orllama-cpp-pythonbackend) - Select the model in the UI, adjust context length, GPU layers, and sampling
π§© KoboldCPP
./koboldcpp \
-m bubba-20b-Q4_K_XL.gguf \
--contextsize 4096 \
--gpulayers 35 \
--usecublas
β‘ LM Studio
- Open LM Studio β Models β Local models β Add local model and select the
.gguf. - In Chat, pick the model, set Context length (β€ base model max), and adjust GPU Layers.
- For API use, enable Local Server and target the exposed endpoint with OpenAI-compatible clients.
β Prompting
This build is instruction-tuned (downstream behavior depends on your base). Common prompt patterns work:
Simple instruction
Write a concise summary of the benefits of grouped 4-bit quantization.
ChatML-like
<|system|>
You are a helpful, concise assistant.
<|user|>
Compare Q4_K_XL vs Q5_K_M in terms of quality and RAM.
<|assistant|>
Code task
Task: Write a Python function that computes perplexity given log-likelihoods.
Constraints: Include docstrings and type hints.
Tip: Keep prompts explicit and structured (roles, constraints, examples).
Suggested starting points: temperature 0.2β0.8, top_p 0.8β0.95, repeat_penalty 1.05β1.15.
- No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your appβs memory/RAG layer.
Example system style
You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.
- Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
- From academic writing to financial analysis, technical support, SEO, and beyond
- Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.
βοΈ Performance & Memory Guidance (Rules of Thumb)
- RAM/VRAM for Q4_K_XL (20B): ~12β16 GB for batch-1 inference (varies by backend and offloading).
- Throughput: Highly dependent on CPU/GPU, backend, context length, and GPU offload.
Start with-nglas high as your VRAM allows, then tune threads/batch sizes. - Context window: Do not exceed the base modelβs maximum (quantization does not increase it).
π» Files
bubba-20b-Q4_K_XL.ggufβ 4-bit K-quantized weights (XL variant)tokenizer.*β packed inside GGUF (no separate files needed)
Integrity: Verify your download (e.g., SHA256) if provided by the host/mirror.
βοΈ GGUF Format
- Start from the base
gpt-oss-20bweights (FP16/BF16). - Convert to GGUF with
llama.cppβsconverttooling (or equivalent for the base arch). - Quantize with
llama.cppquantizeto Q4_K_XL. - Sanity-check perplexity/behavior, package with metadata.
Exact scripts/commits may vary by environment; please share your pipeline for full reproducibility if you fork this card.
π Safety, Bias & Responsible Use
Large language models can generate plausible but incorrect or harmful content and may reflect societal biases. If you deploy this model:
- Add moderation/guardrails and domain-specific filters.
- Provide user disclaimers and feedback channels.
- Keep human-in-the-loop for consequential outputs.
π License and Usage
This model package derives from Chat GPT-5 so you're responsible for ensuring your use complies with the upstream model license and any dataset terms. For commercial deployment, review OpenAIβs license and your organizationβs compliance requirements.
- Bubba is published under the MIT General Public License v3
π§© Attribution
If this quant helped you, consider citing like:
bubba-20bβQ4_K_XL.gguf (2025).
Quantized GGUF build derived from OpenAIβs gpt-oss-20b.
Retrieved from the Hugging Face Hub.
β FAQ
Does quantization change the context window or tokenizer?
No. Those are inherited from the base model; quantization only changes weight representation.
Why am I hitting out-of-memory?
Lower -ngl (fewer GPU layers), reduce context (-c), or switch to a smaller quant (e.g., Q3_K).
Ensure no other large models occupy VRAM.
Best sampler settings?
Start with temp 0.7, top_p 0.9, repeat_penalty 1.1.
Lower temperature for coding/planning; raise for creative writing.
π Changelog
- v1.0 β Initial release of
bubba-20b-Q4_K_XL.gguf.
- Downloads last month
- 10
4-bit
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf leeroy-jankins/bubba:Q4_K_M# Run inference directly in the terminal: llama-cli -hf leeroy-jankins/bubba:Q4_K_M