How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf shlbnrj/phi3-kubernetes:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf shlbnrj/phi3-kubernetes:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf shlbnrj/phi3-kubernetes:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf shlbnrj/phi3-kubernetes:Q4_K_M
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf shlbnrj/phi3-kubernetes:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf shlbnrj/phi3-kubernetes:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf shlbnrj/phi3-kubernetes:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf shlbnrj/phi3-kubernetes:Q4_K_M
Use Docker
docker model run hf.co/shlbnrj/phi3-kubernetes:Q4_K_M
Quick Links

phi3-kubernetes

Phi-3-mini fine-tuned on 300 Kubernetes Q&A samples from Stack Overflow using QLoRA. Built as part of an end-to-end Local Research Assistant pipeline. See the project repo for full context, including dataset preparation, evaluation results, inference server, and CI/CD setup.

Files

File Size Purpose
phi3_kubernetes_lora.zip 106 MB LoRA adapter weights (rank=16, alpha=32)
phi3-kubernetes-q4_k_m.gguf 2.16 GB Merged GGUF, 4-bit quant โ€” recommended deployment
phi3-kubernetes-q8_0.gguf not hosted Benchmarked locally; regenerable with llama-quantize from the LoRA adapter

The Q8_0 quantization was benchmarked during development (see eval section below) but is not hosted here to keep the repo small. The Q4_K_M is the deployment-recommended variant: ~3.7ร— faster throughput at the same VRAM ceiling.

Training

Hyperparameter Value
Base model microsoft/Phi-3-mini-4k-instruct
Fine-tune method QLoRA (Unsloth)
LoRA rank 16
LoRA alpha 32
Learning rate 2e-4 (cosine schedule)
Batch size 8 (effective 16 via grad accumulation)
Epochs 3
Hardware Colab T4 (16 GB VRAM)
Dataset mcipriano/stackoverflow-kubernetes-questions (300 samples filtered)

Evaluation

Tested on 30 held-out K8s Q&A samples against the base phi3:mini model:

Metric Base phi3:mini Fine-tuned phi3-kubernetes ฮ”
ROUGE-L 0.1382 0.1622 +17.4%
Avg latency 18,104 ms 9,272 ms 2ร— faster

Quantization benchmark (RTX 3050 Laptop, 4 GB VRAM)

Quant TTFT (ms) Throughput (tok/s) Peak VRAM
Q4_K_M 2,692 63.6 3.4 GB
Q8_0 2,798 17.3 3.6 GB

Q4_K_M is ~3.7ร— faster at the same VRAM ceiling โ€” chosen as the deployment default.

Usage with Ollama

ollama run hf.co/shlbnrj/phi3-kubernetes:Q4_K_M

Or manually:

wget https://huggingface.co/shlbnrj/phi3-kubernetes/resolve/main/phi3-kubernetes-q4_k_m.gguf

cat > Modelfile <<'EOF'
FROM ./phi3-kubernetes-q4_k_m.gguf
PARAMETER temperature 0.4
PARAMETER num_ctx 4096
EOF

ollama create phi3-kubernetes -f Modelfile
ollama run phi3-kubernetes "What is a Pod?"

License

MIT. The base model (Phi-3) is under the Microsoft Research License.

Limitations

This is a small (3.8B parameter) model. It works well for direct K8s knowledge questions but is unreliable for multi-step tool-use scenarios (e.g., chaining a search with a Python computation). See the project's NOTES.md for detailed failure-mode analysis.

Downloads last month
21
GGUF
Model size
4B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for shlbnrj/phi3-kubernetes

Adapter
(839)
this model