File size: 5,308 Bytes
4e1369a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# llama-cpp-python Prebuilt Wheels for HuggingFace Spaces (Free CPU)

Prebuilt `llama-cpp-python` wheels optimized for HuggingFace Spaces free tier (16GB RAM, 2 vCPU, CPU-only).

## Purpose

These wheels include the latest llama.cpp backend with support for newer model architectures:
- **LFM2 MoE** architecture (32 experts) for LFM2-8B-A1B
- Latest IQ4_XS quantization support
- OpenBLAS CPU acceleration

## Available Wheels

| Wheel File | Python | Platform | llama.cpp | Features |
|------------|--------|----------|-----------|----------|
| `llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl` | 3.10 | Linux x86_64 | Latest (Jan 2026) | LFM2 MoE, IQ4_XS, OpenBLAS |

## Usage

### Setting Up HuggingFace Spaces with Python 3.10

These wheels are built for **Python 3.10**. To use them in HuggingFace Spaces:

**Step 1: Switch to Docker**
1. Go to your Space settings
2. Change "Space SDK" from **Gradio** to **Docker**
3. This enables custom Dockerfile support

**Step 2: Create a Dockerfile with Python 3.10**

Your Dockerfile should start with `python:3.10-slim` as the base image:

```dockerfile
# Use Python 3.10 explicitly (required for these wheels)
FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc g++ make cmake git libopenblas-dev \
    && rm -rf /var/lib/apt/lists/*

# Install llama-cpp-python from prebuilt wheel
RUN pip install --no-cache-dir \
    https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl

# Install other dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV GRADIO_SERVER_NAME=0.0.0.0

# Expose Gradio port
EXPOSE 7860

# Run the app
CMD ["python", "app.py"]
```

**Complete Example:** See the template below for a production-ready setup.

### Why Docker SDK?

When you use a custom Dockerfile:
- ✅ Explicit Python version control (`FROM python:3.10-slim`)
- ✅ Full control over system dependencies
- ✅ Can use prebuilt wheels for faster builds
- ✅ No need for `runtime.txt` (Dockerfile takes precedence)

### Dockerfile (Recommended)

```dockerfile
FROM python:3.10-slim

# Install system dependencies for OpenBLAS
RUN apt-get update && apt-get install -y \
    gcc g++ make cmake git libopenblas-dev \
    && rm -rf /var/lib/apt/lists/*

# Install llama-cpp-python from prebuilt wheel (fast)
RUN pip install --no-cache-dir \
    https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl
```

### With Fallback to Source Build

```dockerfile
# Try prebuilt wheel first, fall back to source build if unavailable
RUN if pip install --no-cache-dir https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl; then \
    echo "✅ Using prebuilt wheel"; \
    else \
    echo "⚠️  Building from source"; \
    pip install --no-cache-dir git+https://github.com/JamePeng/llama-cpp-python.git@5a0391e8; \
    fi
```

## Why This Fork?

These wheels are built from the **JamePeng/llama-cpp-python** fork (v0.3.22) instead of the official abetlen/llama-cpp-python:

| Repository | Latest Version | llama.cpp | LFM2 MoE Support |
|------------|---------------|-----------|-----------------|
| JamePeng fork | v0.3.22 (Jan 2026) | Latest | ✅ Yes |
| Official (abetlen) | v0.3.16 (Aug 2025) | Outdated | ❌ No |

**Key Difference:** LFM2-8B-A1B requires llama.cpp backend with LFM2 MoE architecture support (added Oct 2025). The official llama-cpp-python hasn't been updated since August 2025.

## Build Configuration

```bash
CMAKE_ARGS="-DGGML_OPENBLAS=ON -DGGML_NATIVE=OFF"
FORCE_CMAKE=1
pip wheel --no-deps git+https://github.com/JamePeng/llama-cpp-python.git@5a0391e8
```

## Supported Models

These wheels enable the following IQ4_XS quantized models:

- **LFM2-8B-A1B** (LiquidAI) - 8.3B params, 1.5B active, MoE with 32 experts
- **Granite-4.0-h-micro** (IBM) - Ultra-fast inference
- **Granite-4.0-h-tiny** (IBM) - Balanced speed/quality
- All standard llama.cpp models (Llama, Gemma, Qwen, etc.)

## Performance

- **Build time savings:** ~4 minutes → 3 seconds (98% faster)
- **Memory footprint:** Fits in 16GB RAM with context up to 8192 tokens
- **CPU acceleration:** OpenBLAS optimized for x86_64

## Limitations

- **CPU-only:** No GPU/CUDA support (optimized for HF Spaces free tier)
- **Platform:** Linux x86_64 only
- **Python:** 3.10 only (matches HF Spaces default)

## License

These wheels include code from:
- [llama-cpp-python](https://github.com/JamePeng/llama-cpp-python) (MIT license)
- [llama.cpp](https://github.com/ggerganov/llama.cpp) (MIT license)

See upstream repositories for full license information.

## Maintenance

Built from: https://github.com/JamePeng/llama-cpp-python/tree/5a0391e8

To rebuild: See `build_wheel.sh` in the main project repository.

## Related

- Main project: [gemma-book-summarizer](https://huggingface.co/spaces/Luigi/gemma-book-summarizer)
- JamePeng fork: https://github.com/JamePeng/llama-cpp-python
- Original project: https://github.com/abetlen/llama-cpp-python