Local Model Setup Guide for HuggingClaw
This guide explains how to run small language models (β€1B) locally on HuggingFace Spaces using Ollama.
Why Local Models?
- Free: No API costs - runs on HF Spaces free tier
- Private: All inference happens inside your container
- Fast: 0.6B models achieve 20-50 tokens/second on CPU
- Always Available: No rate limits or downtime
Supported Models
| Model | Size | Speed (CPU) | RAM | Recommended |
|---|---|---|---|---|
| NeuralNexusLab/HacKing | 0.6B | 20-50 t/s | 500MB | β Best |
| TinyLlama-1.1B | 1.1B | 10-20 t/s | 1GB | β Good |
| Qwen-1.5B | 1.5B | 8-15 t/s | 1.5GB | β οΈ OK |
| Phi-2 | 2.7B | 3-8 t/s | 2GB | β οΈ Slower |
Quick Start
Step 1: Set Environment Variables
In your HuggingFace Space Settings β Repository secrets, add:
LOCAL_MODEL_ENABLED=true
LOCAL_MODEL_NAME=neuralnexuslab/hacking
LOCAL_MODEL_ID=neuralnexuslab/hacking
LOCAL_MODEL_NAME_DISPLAY=NeuralNexus HacKing 0.6B
Step 2: Deploy
Push your changes or redeploy the Space. On startup:
- Ollama server starts on port 11434
- The model is pulled from Ollama library (~30 seconds)
- OpenClaw configures the local provider
- Model appears in Control UI
Step 3: Use
- Open your Space URL
- Enter gateway token (default:
huggingclaw) - Select "NeuralNexus HacKing 0.6B" from model dropdown
- Start chatting!
Advanced Configuration
Custom Model from HuggingFace
For models not in Ollama library:
# Set in HF Spaces secrets
LOCAL_MODEL_NAME=hf.co/NeuralNexusLab/HacKing
LOCAL_MODEL_ID=neuralnexuslab/hacking
Using Custom Modelfile
- Create
Modelfile(seescripts/Modelfile.HacKing) - Add to your project
- In
entrypoint.sh, add after Ollama start:
if [ -f /home/node/scripts/Modelfile.HacKing ]; then
ollama create neuralnexuslab/hacking -f /home/node/scripts/Modelfile.HacKing
fi
Performance Tuning
# Number of parallel requests
OLLAMA_NUM_PARALLEL=2
# Keep model loaded (-1 = forever)
OLLAMA_KEEP_ALIVE=-1
# Context window size
# Set in Modelfile: PARAMETER num_ctx 2048
Troubleshooting
Model Not Appearing
- Check logs:
docker logs <container> - Look for:
[SYNC] Set local model provider - Verify
LOCAL_MODEL_ENABLED=true
Slow Inference
- Use smaller models (β€1B)
- Reduce
OLLAMA_NUM_PARALLEL=1 - Decrease
num_ctxin Modelfile
Out of Memory
- HF Spaces has 16GB RAM - should be enough for 0.6B
- Check other processes:
docker stats - Reduce model size or quantization
Model Pull Fails
- Check internet connectivity
- Try alternative:
LOCAL_MODEL_NAME=hf.co/username/model - Use pre-quantized GGUF format
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββ
β HuggingFace Spaces Container β
β β
β ββββββββββββββββ ββββββββββββββββββββ β
β β Ollama β β OpenClaw β β
β β :11434 βββββΊβ Gateway :7860 β β
β β HacKing β β - WhatsApp β β
β β 0.6B β β - Telegram β β
β ββββββββββββββββ ββββββββββββββββββββ β
β β
β /home/node/.ollama/models (persisted) β
βββββββββββββββββββββββββββββββββββββββββββββββ
Cost Comparison
| Setup | Cost/Month | Speed | Privacy |
|---|---|---|---|
| Local (HF Free) | $0 | 20-50 t/s | β Full |
| OpenRouter Free | $0 | 10-30 t/s | β οΈ Shared |
| HF Inference Endpoint | ~$400 | 50-100 t/s | β Full |
| Self-hosted GPU | ~$50+ | 100+ t/s | β Full |
Best Practices
- Start Small: Begin with 0.6B models, upgrade if needed
- Monitor RAM: Keep usage under 8GB for stability
- Use Quantization: GGUF Q4_K_M offers best speed/quality
- Persist Models: Store in
/home/node/.ollama/models - Set Defaults: Use
LOCAL_MODEL_*for auto-selection
Example: WhatsApp Bot with Local AI
# HF Spaces secrets
LOCAL_MODEL_ENABLED=true
LOCAL_MODEL_NAME=neuralnexuslab/hacking
HF_TOKEN=hf_xxxxx
AUTO_CREATE_DATASET=true
# WhatsApp credentials (set in Control UI)
WHATSAPP_PHONE=+1234567890
WHATSAPP_CODE=ABC123
Result: Free, always-on WhatsApp AI bot!
Next Steps
- Test with default 0.6B model
- Experiment with different models
- Customize Modelfile for your use case
- Share your setup with the community!
Support
- Issues: https://github.com/openclaw/openclaw/issues
- Ollama Docs: https://ollama.ai/docs
- HF Spaces: https://huggingface.co/docs/hub/spaces