Add comprehensive README with usage instructions

4e1369a verified 10 days ago

5.31 kB

	# llama-cpp-python Prebuilt Wheels for HuggingFace Spaces (Free CPU)

	Prebuilt `llama-cpp-python` wheels optimized for HuggingFace Spaces free tier (16GB RAM, 2 vCPU, CPU-only).

	## Purpose

	These wheels include the latest llama.cpp backend with support for newer model architectures:
	- LFM2 MoE architecture (32 experts) for LFM2-8B-A1B
	- Latest IQ4_XS quantization support
	- OpenBLAS CPU acceleration

	## Available Wheels

	\| Wheel File \| Python \| Platform \| llama.cpp \| Features \|
	\|------------\|--------\|----------\|-----------\|----------\|
	\| `llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl` \| 3.10 \| Linux x86_64 \| Latest (Jan 2026) \| LFM2 MoE, IQ4_XS, OpenBLAS \|

	## Usage

	### Setting Up HuggingFace Spaces with Python 3.10

	These wheels are built for Python 3.10. To use them in HuggingFace Spaces:

	Step 1: Switch to Docker
	1. Go to your Space settings
	2. Change "Space SDK" from Gradio to Docker
	3. This enables custom Dockerfile support

	Step 2: Create a Dockerfile with Python 3.10

	Your Dockerfile should start with `python:3.10-slim` as the base image:

	```dockerfile
	# Use Python 3.10 explicitly (required for these wheels)
	FROM python:3.10-slim

	WORKDIR /app

	# Install system dependencies
	RUN apt-get update && apt-get install -y \
	gcc g++ make cmake git libopenblas-dev \
	&& rm -rf /var/lib/apt/lists/*

	# Install llama-cpp-python from prebuilt wheel
	RUN pip install --no-cache-dir \
	https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl

	# Install other dependencies
	COPY requirements.txt .
	RUN pip install --no-cache-dir -r requirements.txt

	# Copy application code
	COPY . .

	# Set environment variables
	ENV PYTHONUNBUFFERED=1
	ENV GRADIO_SERVER_NAME=0.0.0.0

	# Expose Gradio port
	EXPOSE 7860

	# Run the app
	CMD ["python", "app.py"]
	```

	Complete Example: See the template below for a production-ready setup.

	### Why Docker SDK?

	When you use a custom Dockerfile:
	- ✅ Explicit Python version control (`FROM python:3.10-slim`)
	- ✅ Full control over system dependencies
	- ✅ Can use prebuilt wheels for faster builds
	- ✅ No need for `runtime.txt` (Dockerfile takes precedence)

	### Dockerfile (Recommended)

	```dockerfile
	FROM python:3.10-slim

	# Install system dependencies for OpenBLAS
	RUN apt-get update && apt-get install -y \
	gcc g++ make cmake git libopenblas-dev \
	&& rm -rf /var/lib/apt/lists/*

	# Install llama-cpp-python from prebuilt wheel (fast)
	RUN pip install --no-cache-dir \
	https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl
	```

	### With Fallback to Source Build

	```dockerfile
	# Try prebuilt wheel first, fall back to source build if unavailable
	RUN if pip install --no-cache-dir https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl; then \
	echo "✅ Using prebuilt wheel"; \
	else \
	echo "⚠️ Building from source"; \
	pip install --no-cache-dir git+https://github.com/JamePeng/llama-cpp-python.git@5a0391e8; \
	fi
	```

	## Why This Fork?

	These wheels are built from the JamePeng/llama-cpp-python fork (v0.3.22) instead of the official abetlen/llama-cpp-python:

	\| Repository \| Latest Version \| llama.cpp \| LFM2 MoE Support \|
	\|------------\|---------------\|-----------\|-----------------\|
	\| JamePeng fork \| v0.3.22 (Jan 2026) \| Latest \| ✅ Yes \|
	\| Official (abetlen) \| v0.3.16 (Aug 2025) \| Outdated \| ❌ No \|

	Key Difference: LFM2-8B-A1B requires llama.cpp backend with LFM2 MoE architecture support (added Oct 2025). The official llama-cpp-python hasn't been updated since August 2025.

	## Build Configuration

	```bash
	CMAKE_ARGS="-DGGML_OPENBLAS=ON -DGGML_NATIVE=OFF"
	FORCE_CMAKE=1
	pip wheel --no-deps git+https://github.com/JamePeng/llama-cpp-python.git@5a0391e8
	```

	## Supported Models

	These wheels enable the following IQ4_XS quantized models:

	- LFM2-8B-A1B (LiquidAI) - 8.3B params, 1.5B active, MoE with 32 experts
	- Granite-4.0-h-micro (IBM) - Ultra-fast inference
	- Granite-4.0-h-tiny (IBM) - Balanced speed/quality
	- All standard llama.cpp models (Llama, Gemma, Qwen, etc.)

	## Performance

	- Build time savings: ~4 minutes → 3 seconds (98% faster)
	- Memory footprint: Fits in 16GB RAM with context up to 8192 tokens
	- CPU acceleration: OpenBLAS optimized for x86_64

	## Limitations

	- CPU-only: No GPU/CUDA support (optimized for HF Spaces free tier)
	- Platform: Linux x86_64 only
	- Python: 3.10 only (matches HF Spaces default)

	## License

	These wheels include code from:
	- [llama-cpp-python](https://github.com/JamePeng/llama-cpp-python) (MIT license)
	- [llama.cpp](https://github.com/ggerganov/llama.cpp) (MIT license)

	See upstream repositories for full license information.

	## Maintenance

	Built from: https://github.com/JamePeng/llama-cpp-python/tree/5a0391e8

	To rebuild: See `build_wheel.sh` in the main project repository.

	## Related

	- Main project: [gemma-book-summarizer](https://huggingface.co/spaces/Luigi/gemma-book-summarizer)
	- JamePeng fork: https://github.com/JamePeng/llama-cpp-python
	- Original project: https://github.com/abetlen/llama-cpp-python