README / gpu-space-cpu /SKILL.md
Nekochu's picture
Add gpu-space-cpu SKILL
0248d77 verified
metadata
name: gpu-space-cpu
description: >-
  Convert GPU HuggingFace Spaces to CPU-only for free tier deployment. Use when
  user says "make this run on CPU", "convert to CPU Space", "remove GPU
  dependencies", "deploy without GPU", "free HuggingFace Space", or "port CUDA
  app to CPU". Also use when analyzing repos with bitsandbytes, flash-attn, or
  CUDA dependencies for CPU compatibility.

GPU β†’ CPU Space demo Conversion

  • Free space Tier CPU = 2 CPU cores (vCPUs) + 16GB RAM + 50GB non-persistent disk space

Workflow

  1. Grep GPU deps: @spaces.GPU|bitsandbytes|flash-attn|triton|xformers|auto-gptq|exllama|apex|\.cuda\(|device.*cuda
  2. Remove GPU packages, replace code: cuda→cpu, float16→int8, remove .half(), .cuda(), device_map="auto"
  3. Create: app.py (all logic, prioritize recent built-in gradio v6+), README.md, requirements.txt
  4. Test: pip install -r requirements.txt && python app.py and Test API by adding mcp https://www.gradio.app/guides/building-mcp-server-with-gradio
  5. Deploy after local test passes

requirements.txt

--extra-index-url https://download.pytorch.org/whl/cpu
torch

Never pin transitive deps. No packages.txt (ffmpeg/git/cmake pre-installed).

README.md

---
title: Name
emoji: X
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
---

Quantization (model too large)

  1. ONNX Runtime (Optimum+OMP_NUM_THREADS=2) INT8 (preferred - onnxruntime fast on CPU)
  2. TorchAO
  3. torch.quantization.quantize_dynamic
  4. Onnx FP32 is faster than FP16 on CPU

Stop

  • Model >12GB with INT8 β†’ needs GPU
  • INT8 quality loss β†’ needs GPU
  • 3 failed approaches β†’ tell user honestly