My Optimal Model - GGUF
β‘ CPU-optimized quantized version for 10x faster inference on free hardware!
π Quick Start
from llama_cpp import Llama
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(
repo_id="yamraj047/my_optimal_model-GGUF",
filename="my-optimal-model-Q4_K_M.gguf"
)
# Load model
llm = Llama(model_path=model_path, n_ctx=2048, n_threads=4)
# Generate text
response = llm("Your prompt here", max_tokens=300)
print(response['choices'][0]['text'])
π Specifications
- Size: ~4.07 GB (vs 14.5 GB original)
- Quantization: Q4_K_M (mixed precision)
- Quality: ~98% of original FP16
- Speed: 2-4 min on free CPU vs 20+ min on GPU
- Context: 32K tokens supported
- Hardware: CPU only - no GPU needed!
π» Use with Gradio
from llama_cpp import Llama
from huggingface_hub import hf_hub_download
import gradio as gr
model_path = hf_hub_download(
repo_id="yamraj047/my_optimal_model-GGUF",
filename="my-optimal-model-Q4_K_M.gguf"
)
llm = Llama(model_path=model_path, n_ctx=2048, n_threads=4)
def chat(message, history):
response = llm(message, max_tokens=400, temperature=0.7)
return response['choices'][0]['text'].strip()
demo = gr.ChatInterface(
fn=chat,
title="π€ My Optimal Model Assistant"
)
demo.launch()
π Model Versions
This is the GGUF quantized version of the merged FP16 model.
| Version | Size | Quality | Use Case |
|---|---|---|---|
| Original FP16 | 14.5 GB | 100% | GPU inference |
| GGUF Q4_K_M | 4.07 GB | 98% | CPU inference |
π License
Apache 2.0
- Downloads last month
- 25
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for yamraj047/my_optimal_model-GGUF
Base model
yamraj047/my_optimal_model-merged-fp16