Solar FAQ โ€” GGUF Q4_K_M (4.6 GB)

Llama-3.1-8B-Instruct fine-tuned with LoRA on a solar energy FAQ dataset, quantized to Q4_K_M GGUF โ€” runs on any platform, any OS, no CUDA required.

Format GGUF Q4_K_M (safe โ€” no pickle, no .bin)
Size 4.6 GB (original: 16 GB float16)
Platforms Mac / Windows / Linux / CPU / GPU
Tools llama-cpp-python ยท Ollama ยท LM Studio ยท Jan ยท GPT4All

Install & Run

Option 1 โ€” Python API (llama-cpp-python)

Step 1 โ€” Install llama-cpp-python (choose one):

# Mac Apple Silicon โ€” Metal GPU acceleration (FAST):
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

# NVIDIA GPU Linux/Windows โ€” CUDA 12.4 (FAST):
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

# CPU only โ€” any platform, no GPU needed (slower ~5 tok/s):
pip install llama-cpp-python

Step 2 โ€” Run:

from llama_cpp import Llama

# Auto-downloads the GGUF from HF on first run (~4.6 GB)
llm = Llama.from_pretrained(
    repo_id="ankur1423/solar-faq-gguf",
    filename="*.gguf",
    n_ctx=2048,          # context window
    n_gpu_layers=-1,     # -1 = all layers on GPU; set 0 for CPU-only
    verbose=False,
)

# Single question
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a knowledgeable assistant for a solar energy company. Answer questions accurately about solar products, manufacturing, and company operations."},
        {"role": "user",   "content": "What is a BOM?"},
    ],
    max_tokens=512,
    temperature=0.1,
    top_p=0.9,
)
print(response["choices"][0]["message"]["content"])

Multi-turn conversation:

from llama_cpp import Llama

SYSTEM = "You are a knowledgeable assistant for a solar energy company."

llm = Llama.from_pretrained(
    repo_id="ankur1423/solar-faq-gguf",
    filename="*.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    verbose=False,
)

history = [{"role": "system", "content": SYSTEM}]

while True:
    user = input("You: ").strip()
    if not user or user.lower() in {"exit", "quit"}:
        break
    history.append({"role": "user", "content": user})
    resp = llm.create_chat_completion(history, max_tokens=512, temperature=0.1)
    answer = resp["choices"][0]["message"]["content"].strip()
    print(f"Assistant: {answer}\n")
    history.append({"role": "assistant", "content": answer})

Option 2 โ€” Ollama (no Python, no code)

# Install Ollama: https://ollama.com
ollama run hf.co/ankur1423/solar-faq-gguf

One command โ€” downloads and runs interactively.


Option 3 โ€” LM Studio (GUI, Windows/Mac/Linux)

  1. Download LM Studio
  2. Search ankur1423/solar-faq-gguf in the model browser
  3. Download โ†’ Chat

Option 4 โ€” Jan App (GUI, offline)

  1. Download Jan
  2. Go to Hub โ†’ search ankur1423/solar-faq-gguf
  3. Download โ†’ Chat

Option 5 โ€” llama.cpp CLI (raw, fastest)

# macOS (homebrew):
brew install llama.cpp

# Linux:
sudo apt install llama.cpp   # Ubuntu 24.04+

# Then download GGUF and run:
llama-cli \
  -m solar-faq-Q4_K_M.gguf \
  --system-prompt "You are a knowledgeable assistant for a solar energy company." \
  -i --color -c 2048

Platform Support Matrix

Platform Backend RAM needed Speed
Mac M1/M2/M3/M4 Metal GPU 6 GB Fast
NVIDIA GPU (Linux/Windows) CUDA 6 GB VRAM Fast
CPU โ€” Mac / Windows / Linux llama.cpp CPU 6 GB RAM ~5 tok/s
Google Colab (free tier) CPU or T4 GPU 6 GB OK
Ollama (any OS) auto-detect GPU/CPU 6 GB Fast / OK
LM Studio / Jan / GPT4All auto-detect 6 GB Fast / OK

Minimum: 6 GB RAM/VRAM. Works on most modern laptops with no GPU.


Generation Parameters (recommended)

Parameter Value Notes
temperature 0.1 Low โ†’ factual, consistent answers
top_p 0.9 Nucleus sampling
max_tokens 256โ€“512 FAQ answers are concise
n_ctx 2048 Context window (increase to 4096 for long conversations)

For creative/varied responses, raise temperature to 0.5โ€“0.7.


Prompt Format (Llama-3 chat template)

This model uses the Llama-3 chat template. The prompt format is:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a knowledgeable assistant for a solar energy company.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

What is a BOM?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

llama-cpp-python's create_chat_completion() handles this automatically.


Training Details

Base model meta-llama/Meta-Llama-3.1-8B-Instruct
Fine-tuning method LoRA (rank 8, 8 layers)
Dataset ~62 solar energy FAQ Q&A pairs
Training iterations 300
Learning rate 1e-4 (cosine decay โ†’ 1e-5)
Batch size 2
Max sequence length 1024 tokens
Framework MLX-LM 0.31+ on Apple Silicon
Quantization GGUF Q4_K_M via llama.cpp
Size reduction 16 GB float16 โ†’ 4.6 GB (โˆ’71%)
Training hardware MacBook M4 16 GB unified memory
Training time ~20 minutes

What is GGUF Q4_K_M?

GGUF (GPT-Generated Unified Format) is a safe, portable model format used by llama.cpp.

Q4_K_M = 4-bit quantization, K-quant method, Medium size/quality tradeoff:

  • Most weights stored in 4 bits (vs 16 bits in float16)
  • Quality loss: minimal (~0.1โ€“0.5% perplexity increase vs float16)
  • Speed: faster than float16 on CPU due to smaller memory bandwidth

No pickle tensors, no arbitrary code โ€” HF security scanner marks this as safe.


Limitations

  • Domain-specific (solar FAQ) โ€” best for solar energy questions; falls back to base Llama-3 behavior outside training domain
  • English only
  • Small dataset (~62 pairs) โ€” may not generalize to all solar topics
  • Fine-tuned on Q4_K_M base, so further quantization artifacts possible

License

This model is derived from Meta Llama 3.1, which is licensed under the Meta Llama 3 Community License. Use is subject to Meta's acceptable use policy.

Downloads last month
6
GGUF
Model size
8B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ankur1423/fine-tune-test-2

Adapter
(2292)
this model