cot-anc / docs /deploy-huggingface.md
BART-ender's picture
Switch default model to HRM-Text-1B
2620860 verified
# Hugging Face Deployment
Primary target: Hugging Face `Docker Space` on upgraded GPU hardware.
## What Gets Deployed
- FastAPI backend
- static web frontend
- Hugging Face OAuth routes
- ephemeral SQLite-backed session queue
## Required Space Settings
- SDK: `Docker`
- Port: `7860`
- OAuth: enabled via README metadata
- Hardware: upgraded GPU recommended
## Recommended Runtime Variables
Core:
- `MODEL_NAME=sapientinc/HRM-Text-1B`
- `DEVICE_PREFERENCE=auto`
- `DTYPE_PREFERENCE=auto`
- `ATTN_IMPLEMENTATION=eager`
- `LOW_CPU_MEM_USAGE=true`
- `TRUST_REMOTE_CODE=true`
- `PRELOAD_MODEL=true`
Traffic limits:
- `MAX_TRACE_TOKENS=256`
- `MAX_SENTENCES=16`
- `JOB_WORKERS=1`
- `MAX_QUEUED_JOBS=8`
- `MAX_ACTIVE_JOBS_PER_USER=2`
- `REQUIRE_AUTH=true`
## Deploy Flow
1. Create new Hugging Face Space with `Docker` SDK.
2. Push repo contents.
3. Set runtime variables in Space settings.
4. Upgrade hardware.
5. Wait for build.
6. Verify:
- `GET /healthz`
- sign-in works
- one short analysis completes
- JSON / CSV export works
## Operational Notes
- Local disk is ephemeral. Session history disappears on restart.
- OAuth helper is mocked locally but real inside Space.
- Keep public defaults conservative. Long traces can OOM small GPUs.
- If queue pressure grows, lower token caps before increasing worker count.
## Common Failure Modes
- `attn_implementation` not eager:
- attribution disabled for model
- unsupported model layout:
- generation may work, attribution fails early with clear error
- OOM:
- reduce `MAX_TRACE_TOKENS`, `MAX_SENTENCES`, or choose larger GPU
- cold start slow:
- keep `PRELOAD_MODEL=true`