Spaces:

GGSheng
/

fe

Running

App Files Files Community

fe / README.md

GGSheng

fix: improve SSH service stability and backup.py error handling

e178b46 verified 1 day ago

preview code

raw

history blame contribute delete

16.2 kB

	---
	title: fe
	emoji: 🦞
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	sdk_version: 29.0.4
	python_version: 3.14.4
	app_port: 7860
	app_file: mian.py
	pinned: false
	---

	# OpenClaw on Hugging Face Space (Docker)

	> Languages: [English](./README.md) · [简体中文](./README_zh.md)
	> Deployment Guide: [DEPLOY_GUIDE.md](./DEPLOY_GUIDE.md) \| [中文部署指南](./DEPLOY_GUIDE_zh.md)

	This setup is designed to provide the following:

	- Build the OpenClaw container on top of `ubuntu:24.04`
	- Serve the OpenClaw dashboard directly on port `7860` (default Space access port)
	- Use third-party OpenAI-compatible `base_url + api_key` by default (injected via environment variables)
	- Store OpenClaw config/workspace under `/root/.openclaw`
	- Restore state automatically from a Hugging Face Dataset on startup
	- Run scheduled backups of OpenClaw data to a Hugging Face Dataset via `cron` (as `root` user)
	- Incremental backup + dynamic strategy + AES-256-CBC encryption + large file splitting
	- Backup watchdog (auto-triggers backup when cron fails)
	- SSH service with auto-healing watchdog + host key generation
	- CCMR (Claude Code Model Router) with 10 platform API key support
	- Multi-dataset restore (restore from a different dataset)
	- Preinstall `python3`, `uv`, `vim`, `neovim`, `chromium` (via Chrome for Testing archive), `gh`, `hf`, `opencode`, `codex`, `claude` (Claude Code CLI), `@larksuite/cli` (with `npx skills add larksuite/cli -y -g`), and `sshx` in the image for interactive terminal use

	## Repository Layout

	- `Dockerfile`: Runtime image for the Space
	- `scripts/openclaw-entrypoint.sh`: Main startup flow (restore, config generation, cron setup, gateway start)
	- `scripts/hf-entrypoint.sh`: HF Spaces container entrypoint (PID 1, manages supervisord + SSH + PM2 + BT Panel)
	- `scripts/supervisord.conf`: Supervisord config, manages cron, backup-watchdog, openclaw-gateway, ccmr-gateway
	- `openclaw_hf/backup.py`: Backup/restore implementation (full/incremental, encryption, split, dynamic strategy, resume)
	- `scripts/openclaw-backup-cron.sh`: Cron entrypoint for backup jobs
	- `scripts/openclaw-backup-watchdog.sh`: Backup watchdog, auto-triggers backup when overdue
	- `scripts/openclaw-backup-health.sh`: Backup health check & auto-repair
	- `scripts/openclaw-restore.sh`: Startup restore entrypoint
	- `scripts/openclaw-gateway-ctl`: Gateway process management (start/stop/restart/reload)
	- `scripts/openclaw-env-sync.sh`: Sync environment variables from HF API
	- `scripts/update-env-from-secrets.sh`: Fetch latest env vars from HF API
	- `scripts/bt_install_panel_custom.sh`: BT Panel installation script
	- `scripts/bootstrap-hf.sh`: Interactive bootstrap for Space/Dataset creation, upload, and Space variables/secrets setup (macOS/Linux)
	- `scripts/bootstrap-hf.ps1`: Interactive bootstrap for Space/Dataset creation, upload, and Space variables/secrets setup (Windows PowerShell)
	- `scripts/rebuild-space.sh`: Force push latest code to Space and trigger rebuild
	- `scripts/delete-backups.sh`: Batch cleanup old backups from Dataset
	- `scripts/delete-hf.py`: HF resource deletion tool (Space/Dataset/files/storage)
	- `scripts/find-largest-backup.py`: Find best backup in Dataset
	- `scripts/ssh_service_watchdog.sh`: SSH service watchdog (process monitor + auto-recovery)
	- `scripts/check_ssh_health.sh`: SSH health check (used by Docker HEALTHCHECK)
	- `scripts/ssh-agent-autostart.sh`: SSH agent auto-start and key loading
	- `scripts/optimize_ssh.sh`: SSH configuration optimization
	- `scripts/save-env.sh`: Save environment to `/etc/profile.d`
	- `scripts/hf-storage.sh` / `scripts/hf-storage.py`: HuggingFace storage utilities
	- `scripts/ccmr-setup.sh`: CCMR configuration generation
	- `scripts/ccmr-wrapper.sh`: CCMR Supervisor wrapper (hot-reload + crash recovery)
	- `scripts/server.js`: PID 1 keep-alive HTTP server
	- `pm2/ecosystem.config.js`: PM2 configuration (optional extension)
	- `tests/test_backup.py`: Unit tests for the backup module
	- `tests/test_entrypoint_config.py`: Unit tests for gateway config generation behavior

	## Required Variables (Space Settings)

	In your Hugging Face Space (`Settings -> Variables and secrets`), configure at least:

	- Variable: `OPENCLAW_BACKUP_DATASET_REPO`: Backup target Dataset in `username/dataset-name` format
	- Secret: `HF_TOKEN`: Used to write backups to the Dataset (must have write permission to that Dataset)
	- Secret: `OPENCLAW_GATEWAY_TOKEN`: Gateway token (recommended; if omitted in deployment workflow, generate a random 32-character value)
	- Secret: `OPENCLAW_GATEWAY_PASSWORD`: Gateway password (optional; if omitted in deployment workflow, generate a random 16-character value)

	When using `./scripts/bootstrap-hf.sh` (macOS/Linux) or `./scripts/bootstrap-hf.ps1` (Windows PowerShell), these values are configured automatically on the target Space.

	## Optional LLM Variables (All-Or-None)

	Set all of these together only when you want OpenClaw to preconfigure a custom third-party model:

	- Variable: `OPENCLAW_LLM_BASE_URL`: Third-party base URL (for example OpenAI-compatible `/v1`)
	- Variable: `OPENCLAW_LLM_MODEL`: Third-party model ID
	- Secret: `OPENCLAW_LLM_API_KEY`: Third-party API key

	If any of the three is missing, entrypoint skips custom model generation.
	In that case, you can still configure from inside the container (for example via `sshx`).

	## Common Optional Variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `OPENCLAW_VERSION` \| `latest` \| OpenClaw version for Docker install \|
	\| `OPENCLAW_GATEWAY_PORT` \| `18789` \| Gateway listen port \|
	\| `OPENCLAW_GATEWAY_BIND` \| `lan` \| Gateway bind mode (`lan`/`local`) \|
	\| `OPENCLAW_STATE_DIR` \| `/root/.openclaw` \| OpenClaw state directory \|
	\| `OPENCLAW_USER` \| `root` \| Runtime user for gateway and cron \|
	\| `OPENCLAW_GROUP` \| `root` \| Runtime group \|
	\| `OPENCLAW_CONFIG_PATH` \| `/root/.openclaw/openclaw.json` \| Gateway config path \|
	\| `OPENCLAW_WORKSPACE_DIR` \| `/root/.openclaw/workspace` \| Workspace directory \|
	\| `OPENCLAW_BACKUP_CRON` \| `/10 * * *` \| Backup cron expression \|
	\| `OPENCLAW_BACKUP_SOURCE_DIR` \| `/root/.openclaw` \| Backup/restore base directory \|
	\| `OPENCLAW_BACKUP_ROOT_*_DIR` \| Various \| Extra backup dirs (config, codex, claude, agents, ssh, env, npm, lark-cli) \|
	\| `OPENCLAW_BACKUP_PATH_PREFIX` \| `backups` \| Backup path prefix \|
	\| `OPENCLAW_BACKUP_KEEP_COUNT` \| `24` \| Number of backups to keep \|
	\| `OPENCLAW_BACKUP_ENCRYPTION_ENABLED` \| `false` \| Enable AES-256-CBC encryption \|
	\| `OPENCLAW_BACKUP_SPLIT_SIZE` \| `500M` \| Large file split volume size \|
	\| `OPENCLAW_INCREMENTAL_BACKUP` \| `true` \| Enable incremental backup \|
	\| `OPENCLAW_DYNAMIC_BACKUP` \| `true` \| Enable dynamic backup strategy \|
	\| `OPENCLAW_FULL_BACKUP_INTERVAL_HOURS` \| `1` \| Force full backup interval \|
	\| `OPENCLAW_MAX_INCREMENTAL_BACKUPS` \| `15` \| Max incremental backups before full \|
	\| `OPENCLAW_RESTORE_TIMEOUT` \| `5400` \| Restore timeout (seconds, 90 min) \|
	\| `WATCHDOG_INTERVAL` \| `600` \| Backup watchdog check interval (s) \|
	\| `MAX_BACKUP_AGE_MINUTES` \| `30` \| Max backup age (minutes) \|
	\| `FORCE_BACKUP_INTERVAL` \| `14400` \| Force backup interval (seconds) \|
	\| `OPENCLAW_SSHX_AUTO_START` \| `false` \| Auto-start `sshx` on boot \|
	\| `OPENCLAW_GATEWAY_AUTH_MODE` \| `token` \| Auth mode (`token`/`password`) \|
	\| `ROOT_PASSWORD` \| `lauer3912` \| SSH root password \|
	\| `CCMR_ENABLED` \| `false` \| Enable Claude Code Model Router \|
	\| `CCMR_PORT` \| `8080` \| CCMR gateway port \|

	## Quick Deployment

	Run the interactive bootstrap script from repo root:

	```bash
	./scripts/bootstrap-hf.sh
	```

	```powershell
	powershell -ExecutionPolicy ByPass -File .\scripts\bootstrap-hf.ps1
	```

	`bootstrap-hf.sh` / `bootstrap-hf.ps1` will:

	- Check/install `hf` CLI:
	- macOS/Linux: `curl -LsSf https://hf.co/cli/install.sh \| bash`
	- Windows PowerShell: `powershell -ExecutionPolicy ByPass -c "irm https://hf.co/cli/install.ps1 \| iex"`
	- Resolve HF auth first (before all other variables):
	- if `hf auth whoami` is not logged in: prompt `HF_TOKEN` and run `hf auth login --token <HF_TOKEN>`
	- if already logged in: ask whether to use current user
	- choose `yes`: continue
	- choose `no`: backup current token, prompt new `HF_TOKEN`, run `hf auth login --token <HF_TOKEN>`, and restore the previous token at the end
	- Ask for `space_name`, `dataset_name`, `OPENCLAW_VERSION`, gateway token/password, and optional LLM settings
	- Default `OPENCLAW_VERSION` to latest detected from npm registry (`openclaw`), fallback `latest` when detection fails
	- Auto-generate `OPENCLAW_GATEWAY_TOKEN` (32 chars) and `OPENCLAW_GATEWAY_PASSWORD` (16 chars) if left empty
	- Create private Space + Dataset and upload this repository
	- Configure Space `Variables and secrets` automatically, including:
	- `OPENCLAW_BACKUP_DATASET_REPO`
	- `OPENCLAW_VERSION`
	- `HF_TOKEN`
	- `OPENCLAW_GATEWAY_TOKEN`
	- `OPENCLAW_GATEWAY_PASSWORD`
	- `OPENCLAW_GATEWAY_CONTROLUI_ALLOW_INSECURE_AUTH=false`
	- `OPENCLAW_GATEWAY_CONTROLUI_DANGEROUSLY_DISABLE_DEVICE_AUTH=false`
	- Optionally configure LLM triplet and set `OPENCLAW_SSHX_AUTO_START` from prompt choice (`true`/`false`)
	- Print planned deployment settings and require a final confirmation before creating/updating Space/Dataset resources
	- Print Hugging Face Space page URL, app URL, and `/healthz`

	If gateway token/password were auto-generated, the script prints them at the end.

	## Agent Hand-off Prompt

	Copy and send to your agent:

	```
	Please deploy OpenClaw to Hugging Face by strictly following the deployment skill in https://github.com/tenfyzhong/openclaw-hf/blob/main/SKILL.md
	```

	## Hugging Face Keep-Alive

	How to keep a Space available depends on hardware tier:

	- Free `cpu-basic`: the Space sleeps after inactivity (currently around 48h). It cannot be configured to run forever on free hardware.
	- Paid hardware: the Space runs continuously by default. In `Settings -> Hardware`, set `Sleep time` to `Never` (or use API with `sleep_time=-1`) for true 24/7 availability.
	- Cost-saving mode on paid hardware: set a custom `Sleep time` (for example `3600` seconds) so it auto-sleeps and auto-wakes on the next visit.

	Space URL composition:

	- Space repo ID format: `<owner>/<space_name>` (example: `tenfyzhong/openclaw-hf`)
	- Public runtime host format: `https://<owner>-<space_name>.hf.space`
	- OpenClaw health check URL: `https://<owner>-<space_name>.hf.space/healthz`
	- Inside the Space runtime, Hugging Face also provides `SPACE_HOST`, so health URL can be built as `https://${SPACE_HOST}/healthz`.

	Example:

	```bash
	OPENCLAW_HF_SPACE_ID="tenfyzhong/openclaw-hf"
	SPACE_HOST="${OPENCLAW_HF_SPACE_ID/\//-}.hf.space"
	HEALTH_URL="https://${SPACE_HOST}/healthz"
	echo "$HEALTH_URL"
	```

	Keep-alive by periodic health checks:

	```bash
	/12 * * * HF_TOKEN=hf_xxx /path/to/repo/scripts/check-space-health.sh tenfyzhong/openclaw-hf >/dev/null \|\| true
	```

	Notes:

	- For private Spaces, unauthenticated calls to `https://<owner>-<space_name>.hf.space/healthz` return a Hub 404 page. This is expected access control behavior.
	- For private Spaces, include `Authorization: Bearer <HF_TOKEN>` (the helper script above does this automatically via `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`).
	- This ping strategy is a practical workaround for reducing idle sleep on free hardware, but it is not a guaranteed always-on method.
	- If you need strict 24/7 uptime, use paid hardware and set sleep time to `Never`.

	References:

	- <https://huggingface.co/docs/hub/spaces-gpus#sleep-time>
	- <https://huggingface.co/docs/huggingface_hub/package_reference/space_runtime>
	- <https://huggingface.co/docs/hub/spaces-overview>

	Programmatic options (owner token required):

	```python
	from huggingface_hub import HfApi

	api = HfApi(token="hf_xxx")
	repo_id = "your-username/your-space"

	# Keep running (paid hardware)
	api.set_space_sleep_time(repo_id=repo_id, sleep_time=-1)

	# Or sleep after 1 hour of inactivity
	api.set_space_sleep_time(repo_id=repo_id, sleep_time=3600)

	# Manual control
	api.pause_space(repo_id=repo_id)
	api.restart_space(repo_id=repo_id)
	```

	For this project, if you need stable dashboard access without cold starts, use paid hardware and set sleep time to `Never`.

	## SSH Service

	The container has a comprehensive SSH service guarding system to ensure continuous availability:

	- Auto-start: Entrypoint generates host keys, cleans stale PID files, starts sshd
	- SSH Watchdog (`ssh_service_watchdog.sh`): Monitors sshd every 30s, auto-recovers on failure
	- Multi-level repair: Config corruption → backup config → minimal config → auto-reinstall openssh-server
	- Exponential backoff: Gradually increases wait time on consecutive failures
	- Health check (`check_ssh_health.sh`): Used by Docker HEALTHCHECK
	- SSH Agent auto-load: Auto-starts ssh-agent and loads keys from `/root/.ssh/`
	- Root password: Set via `ROOT_PASSWORD` environment variable

	## CCMR (Claude Code Model Router)

	CCMR gateway is integrated and managed by Supervisord with hot-reload support:

	- Auto-config: Set `CCMR_*_API_KEY` env vars to enable
	- 10 API Key slots: DeepSeek, Qwen, Kimi, GLM, MiniMax (CN/Global), MiMo (SGP/CN/AMS/PAYG)
	- File hot-reload: Edit `/root/.env.d/ccmr.env` and changes apply immediately without restart
	- Crash recovery: Supervisord auto-restarts CCMR process

	## Backup/Restore Flow

	### Restore

	Automatic restore on startup (always runs on container restart/rebuild):

	- `openclaw-state` -> `OPENCLAW_BACKUP_SOURCE_DIR` (default `/root/.openclaw`)
	- `root-config` -> `OPENCLAW_BACKUP_ROOT_CONFIG_DIR` (default `/root/.config`)
	- `root-codex` -> `OPENCLAW_BACKUP_ROOT_CODEX_DIR` (default `/root/.codex`)
	- `root-claude` -> `OPENCLAW_BACKUP_ROOT_CLAUDE_DIR` (default `/root/.claude`)
	- `root-agents` -> `OPENCLAW_BACKUP_ROOT_AGENTS_DIR` (default `/root/.agents`)
	- `root-ssh` -> `OPENCLAW_BACKUP_ROOT_SSH_DIR` (default `/root/.ssh`)
	- `root-env` -> `OPENCLAW_BACKUP_ROOT_ENV_DIR` (default `/root/.env.d`)
	- `root-npm` -> `OPENCLAW_BACKUP_ROOT_NPM_DIR` (default `/root/.npm`)
	- `root-lark-cli` -> `OPENCLAW_BACKUP_ROOT_LARK_CLI_DIR` (default `/root/.lark-cli`)

	Multi-dataset restore: set `OPENCLAW_RESTORE_DATASET_REPO` to restore from a different dataset.

	### Backup

	- Scheduled backup: Runs based on `OPENCLAW_BACKUP_CRON` (default every 10 min)
	- Incremental backup (default on): Only backs up changed files after a full backup
	- Dynamic strategy (default on): Auto-adjusts compression and splitting based on file size and change rate
	- AES-256-CBC encryption: Optional, allows secure storage on public datasets
	- Large file splitting: Default 500MB per volume, avoids upload failures
	- Resume support: Creates checkpoint files during upload, allows resume on interruption
	- Shutdown backup: Final backup before container exit on stop signal
	- Retention: Keeps newest `OPENCLAW_BACKUP_KEEP_COUNT` (default 24) archives, auto-deletes older ones

	### Backup Watchdog

	`openclaw-backup-watchdog.sh` acts as the last line of defense:

	- Auto-triggers backup when no backup for `MAX_BACKUP_AGE_MINUTES` (default 30 min)
	- Force backup every `FORCE_BACKUP_INTERVAL` (default 4 hours)
	- File lock prevents concurrent execution
	- Automatic backoff on consecutive failures

	## Use sshx Inside the Container

	`sshx` is preinstalled in the image.

	1. Auto-start `sshx` in background via environment variables:

	```bash
	OPENCLAW_SSHX_AUTO_START=true
	```

	When enabled, entrypoint starts `sshx` in background and sends `sshx` output directly to container stdout/stderr logs (no file logging).

	2. Manual start inside container:

	```bash
	sshx
	```

	3. Let OpenClaw start a process itself (run in OpenClaw terminal/tool):

	```bash
	nohup sshx >/proc/1/fd/1 2>/proc/1/fd/2 &
	```

	4. After use, close `sshx` process promptly:

	```bash
	pgrep -fa sshx
	pkill -TERM -f '(^\|/)sshx($\| )'
	```

	## Local Test

	```bash
	python3 -m unittest discover -s tests -p 'test_*.py'
	```

	Pull Requests to `main` run GitHub Actions CI automatically (`.github/workflows/pr-ci.yml`):
	- Unit tests: `python3 -m unittest discover -s tests -p 'test_*.py'`
	- Docker image build: `docker build` (via Buildx) with `OPENCLAW_VERSION=latest`

	## License

	MIT. See `LICENSE`.