try / scripts /LOCAL_MODEL_SETUP.md
proti0070's picture
Upload folder using huggingface_hub
d75ac2b verified
# Local Model Setup Guide for HuggingClaw
This guide explains how to run small language models (≀1B) locally on HuggingFace Spaces using Ollama.
## Why Local Models?
- **Free**: No API costs - runs on HF Spaces free tier
- **Private**: All inference happens inside your container
- **Fast**: 0.6B models achieve 20-50 tokens/second on CPU
- **Always Available**: No rate limits or downtime
## Supported Models
| Model | Size | Speed (CPU) | RAM | Recommended |
|-------|------|-------------|-----|-------------|
| NeuralNexusLab/HacKing | 0.6B | 20-50 t/s | 500MB | βœ… Best |
| TinyLlama-1.1B | 1.1B | 10-20 t/s | 1GB | βœ… Good |
| Qwen-1.5B | 1.5B | 8-15 t/s | 1.5GB | ⚠️ OK |
| Phi-2 | 2.7B | 3-8 t/s | 2GB | ⚠️ Slower |
## Quick Start
### Step 1: Set Environment Variables
In your HuggingFace Space **Settings β†’ Repository secrets**, add:
```bash
LOCAL_MODEL_ENABLED=true
LOCAL_MODEL_NAME=neuralnexuslab/hacking
LOCAL_MODEL_ID=neuralnexuslab/hacking
LOCAL_MODEL_NAME_DISPLAY=NeuralNexus HacKing 0.6B
```
### Step 2: Deploy
Push your changes or redeploy the Space. On startup:
1. Ollama server starts on port 11434
2. The model is pulled from Ollama library (~30 seconds)
3. OpenClaw configures the local provider
4. Model appears in Control UI
### Step 3: Use
1. Open your Space URL
2. Enter gateway token (default: `huggingclaw`)
3. Select "NeuralNexus HacKing 0.6B" from model dropdown
4. Start chatting!
## Advanced Configuration
### Custom Model from HuggingFace
For models not in Ollama library:
```bash
# Set in HF Spaces secrets
LOCAL_MODEL_NAME=hf.co/NeuralNexusLab/HacKing
LOCAL_MODEL_ID=neuralnexuslab/hacking
```
### Using Custom Modelfile
1. Create `Modelfile` (see `scripts/Modelfile.HacKing`)
2. Add to your project
3. In `entrypoint.sh`, add after Ollama start:
```bash
if [ -f /home/node/scripts/Modelfile.HacKing ]; then
ollama create neuralnexuslab/hacking -f /home/node/scripts/Modelfile.HacKing
fi
```
### Performance Tuning
```bash
# Number of parallel requests
OLLAMA_NUM_PARALLEL=2
# Keep model loaded (-1 = forever)
OLLAMA_KEEP_ALIVE=-1
# Context window size
# Set in Modelfile: PARAMETER num_ctx 2048
```
## Troubleshooting
### Model Not Appearing
1. Check logs: `docker logs <container>`
2. Look for: `[SYNC] Set local model provider`
3. Verify `LOCAL_MODEL_ENABLED=true`
### Slow Inference
1. Use smaller models (≀1B)
2. Reduce `OLLAMA_NUM_PARALLEL=1`
3. Decrease `num_ctx` in Modelfile
### Out of Memory
1. HF Spaces has 16GB RAM - should be enough for 0.6B
2. Check other processes: `docker stats`
3. Reduce model size or quantization
### Model Pull Fails
1. Check internet connectivity
2. Try alternative: `LOCAL_MODEL_NAME=hf.co/username/model`
3. Use pre-quantized GGUF format
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HuggingFace Spaces Container β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Ollama β”‚ β”‚ OpenClaw β”‚ β”‚
β”‚ β”‚ :11434 │───►│ Gateway :7860 β”‚ β”‚
β”‚ β”‚ HacKing β”‚ β”‚ - WhatsApp β”‚ β”‚
β”‚ β”‚ 0.6B β”‚ β”‚ - Telegram β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ /home/node/.ollama/models (persisted) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Cost Comparison
| Setup | Cost/Month | Speed | Privacy |
|-------|-----------|-------|---------|
| Local (HF Free) | $0 | 20-50 t/s | βœ… Full |
| OpenRouter Free | $0 | 10-30 t/s | ⚠️ Shared |
| HF Inference Endpoint | ~$400 | 50-100 t/s | βœ… Full |
| Self-hosted GPU | ~$50+ | 100+ t/s | βœ… Full |
## Best Practices
1. **Start Small**: Begin with 0.6B models, upgrade if needed
2. **Monitor RAM**: Keep usage under 8GB for stability
3. **Use Quantization**: GGUF Q4_K_M offers best speed/quality
4. **Persist Models**: Store in `/home/node/.ollama/models`
5. **Set Defaults**: Use `LOCAL_MODEL_*` for auto-selection
## Example: WhatsApp Bot with Local AI
```bash
# HF Spaces secrets
LOCAL_MODEL_ENABLED=true
LOCAL_MODEL_NAME=neuralnexuslab/hacking
HF_TOKEN=hf_xxxxx
AUTO_CREATE_DATASET=true
# WhatsApp credentials (set in Control UI)
WHATSAPP_PHONE=+1234567890
WHATSAPP_CODE=ABC123
```
Result: Free, always-on WhatsApp AI bot!
## Next Steps
1. Test with default 0.6B model
2. Experiment with different models
3. Customize Modelfile for your use case
4. Share your setup with the community!
## Support
- Issues: https://github.com/openclaw/openclaw/issues
- Ollama Docs: https://ollama.ai/docs
- HF Spaces: https://huggingface.co/docs/hub/spaces