Luigi's picture
fix readme error due to short description length
1bd76fd
|
raw
history blame
1.69 kB
metadata
title: Multi-GGUF LLM Inference
emoji: ๐Ÿง 
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Run GGUF models (Qwen2.5, Gemma-3, Phi-4) with llama.cpp

This Streamlit app enables chat-based inference on various GGUF models using llama.cpp and llama-cpp-python.

๐Ÿ”„ Supported Models:

  • Qwen/Qwen2.5-7B-Instruct-GGUF โ†’ qwen2.5-7b-instruct-q2_k.gguf
  • unsloth/gemma-3-4b-it-GGUF โ†’ gemma-3-4b-it-Q4_K_M.gguf
  • unsloth/Phi-4-mini-instruct-GGUF โ†’ Phi-4-mini-instruct-Q4_K_M.gguf
  • MaziyarPanahi/Meta-Llama-3.1-8B-Instruct-GGUF โ†’ Meta-Llama-3.1-8B-Instruct.Q2_K.gguf
  • unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF โ†’ DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf
  • MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF โ†’ Mistral-7B-Instruct-v0.3.IQ3_XS.gguf
  • Qwen/Qwen2.5-Coder-7B-Instruct-GGUF โ†’ qwen2.5-coder-7b-instruct-q2_k.gguf

โš™๏ธ Features:

  • Model selection in the sidebar
  • Customizable system prompt and generation parameters
  • Chat-style UI with streaming responses

๐Ÿง  Memory-Safe Design (for HuggingFace Spaces):

  • Loads only one model at a time to prevent memory bloat
  • Utilizes manual unloading and gc.collect() to free memory when switching models
  • Adjusts n_ctx context length to operate within a 16 GB RAM limit
  • Automatically downloads models as needed
  • Limits history to the last 8 user-assistant turns to prevent context overflow

Ideal for deploying multiple GGUF chat models on free-tier HuggingFace Spaces!

Refer to the configuration guide at https://huggingface.co/docs/hub/spaces-config-reference