File size: 5,308 Bytes
4e1369a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# llama-cpp-python Prebuilt Wheels for HuggingFace Spaces (Free CPU)
Prebuilt `llama-cpp-python` wheels optimized for HuggingFace Spaces free tier (16GB RAM, 2 vCPU, CPU-only).
## Purpose
These wheels include the latest llama.cpp backend with support for newer model architectures:
- **LFM2 MoE** architecture (32 experts) for LFM2-8B-A1B
- Latest IQ4_XS quantization support
- OpenBLAS CPU acceleration
## Available Wheels
| Wheel File | Python | Platform | llama.cpp | Features |
|------------|--------|----------|-----------|----------|
| `llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl` | 3.10 | Linux x86_64 | Latest (Jan 2026) | LFM2 MoE, IQ4_XS, OpenBLAS |
## Usage
### Setting Up HuggingFace Spaces with Python 3.10
These wheels are built for **Python 3.10**. To use them in HuggingFace Spaces:
**Step 1: Switch to Docker**
1. Go to your Space settings
2. Change "Space SDK" from **Gradio** to **Docker**
3. This enables custom Dockerfile support
**Step 2: Create a Dockerfile with Python 3.10**
Your Dockerfile should start with `python:3.10-slim` as the base image:
```dockerfile
# Use Python 3.10 explicitly (required for these wheels)
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc g++ make cmake git libopenblas-dev \
&& rm -rf /var/lib/apt/lists/*
# Install llama-cpp-python from prebuilt wheel
RUN pip install --no-cache-dir \
https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl
# Install other dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV GRADIO_SERVER_NAME=0.0.0.0
# Expose Gradio port
EXPOSE 7860
# Run the app
CMD ["python", "app.py"]
```
**Complete Example:** See the template below for a production-ready setup.
### Why Docker SDK?
When you use a custom Dockerfile:
- ✅ Explicit Python version control (`FROM python:3.10-slim`)
- ✅ Full control over system dependencies
- ✅ Can use prebuilt wheels for faster builds
- ✅ No need for `runtime.txt` (Dockerfile takes precedence)
### Dockerfile (Recommended)
```dockerfile
FROM python:3.10-slim
# Install system dependencies for OpenBLAS
RUN apt-get update && apt-get install -y \
gcc g++ make cmake git libopenblas-dev \
&& rm -rf /var/lib/apt/lists/*
# Install llama-cpp-python from prebuilt wheel (fast)
RUN pip install --no-cache-dir \
https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl
```
### With Fallback to Source Build
```dockerfile
# Try prebuilt wheel first, fall back to source build if unavailable
RUN if pip install --no-cache-dir https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl; then \
echo "✅ Using prebuilt wheel"; \
else \
echo "⚠️ Building from source"; \
pip install --no-cache-dir git+https://github.com/JamePeng/llama-cpp-python.git@5a0391e8; \
fi
```
## Why This Fork?
These wheels are built from the **JamePeng/llama-cpp-python** fork (v0.3.22) instead of the official abetlen/llama-cpp-python:
| Repository | Latest Version | llama.cpp | LFM2 MoE Support |
|------------|---------------|-----------|-----------------|
| JamePeng fork | v0.3.22 (Jan 2026) | Latest | ✅ Yes |
| Official (abetlen) | v0.3.16 (Aug 2025) | Outdated | ❌ No |
**Key Difference:** LFM2-8B-A1B requires llama.cpp backend with LFM2 MoE architecture support (added Oct 2025). The official llama-cpp-python hasn't been updated since August 2025.
## Build Configuration
```bash
CMAKE_ARGS="-DGGML_OPENBLAS=ON -DGGML_NATIVE=OFF"
FORCE_CMAKE=1
pip wheel --no-deps git+https://github.com/JamePeng/llama-cpp-python.git@5a0391e8
```
## Supported Models
These wheels enable the following IQ4_XS quantized models:
- **LFM2-8B-A1B** (LiquidAI) - 8.3B params, 1.5B active, MoE with 32 experts
- **Granite-4.0-h-micro** (IBM) - Ultra-fast inference
- **Granite-4.0-h-tiny** (IBM) - Balanced speed/quality
- All standard llama.cpp models (Llama, Gemma, Qwen, etc.)
## Performance
- **Build time savings:** ~4 minutes → 3 seconds (98% faster)
- **Memory footprint:** Fits in 16GB RAM with context up to 8192 tokens
- **CPU acceleration:** OpenBLAS optimized for x86_64
## Limitations
- **CPU-only:** No GPU/CUDA support (optimized for HF Spaces free tier)
- **Platform:** Linux x86_64 only
- **Python:** 3.10 only (matches HF Spaces default)
## License
These wheels include code from:
- [llama-cpp-python](https://github.com/JamePeng/llama-cpp-python) (MIT license)
- [llama.cpp](https://github.com/ggerganov/llama.cpp) (MIT license)
See upstream repositories for full license information.
## Maintenance
Built from: https://github.com/JamePeng/llama-cpp-python/tree/5a0391e8
To rebuild: See `build_wheel.sh` in the main project repository.
## Related
- Main project: [gemma-book-summarizer](https://huggingface.co/spaces/Luigi/gemma-book-summarizer)
- JamePeng fork: https://github.com/JamePeng/llama-cpp-python
- Original project: https://github.com/abetlen/llama-cpp-python
|