Spaces:

BART-ender
/

cot-anc

Sleeping

App Files Files Community

cot-anc / docs /deploy-huggingface.md

BART-ender

Switch default model to HRM-Text-1B

2620860 verified 3 days ago

preview code

raw

history blame contribute delete

1.65 kB

	# Hugging Face Deployment

	Primary target: Hugging Face `Docker Space` on upgraded GPU hardware.

	## What Gets Deployed

	- FastAPI backend
	- static web frontend
	- Hugging Face OAuth routes
	- ephemeral SQLite-backed session queue

	## Required Space Settings

	- SDK: `Docker`
	- Port: `7860`
	- OAuth: enabled via README metadata
	- Hardware: upgraded GPU recommended

	## Recommended Runtime Variables

	Core:

	- `MODEL_NAME=sapientinc/HRM-Text-1B`
	- `DEVICE_PREFERENCE=auto`
	- `DTYPE_PREFERENCE=auto`
	- `ATTN_IMPLEMENTATION=eager`
	- `LOW_CPU_MEM_USAGE=true`
	- `TRUST_REMOTE_CODE=true`
	- `PRELOAD_MODEL=true`

	Traffic limits:

	- `MAX_TRACE_TOKENS=256`
	- `MAX_SENTENCES=16`
	- `JOB_WORKERS=1`
	- `MAX_QUEUED_JOBS=8`
	- `MAX_ACTIVE_JOBS_PER_USER=2`
	- `REQUIRE_AUTH=true`

	## Deploy Flow

	1. Create new Hugging Face Space with `Docker` SDK.
	2. Push repo contents.
	3. Set runtime variables in Space settings.
	4. Upgrade hardware.
	5. Wait for build.
	6. Verify:
	- `GET /healthz`
	- sign-in works
	- one short analysis completes
	- JSON / CSV export works

	## Operational Notes

	- Local disk is ephemeral. Session history disappears on restart.
	- OAuth helper is mocked locally but real inside Space.
	- Keep public defaults conservative. Long traces can OOM small GPUs.
	- If queue pressure grows, lower token caps before increasing worker count.

	## Common Failure Modes

	- `attn_implementation` not eager:
	- attribution disabled for model
	- unsupported model layout:
	- generation may work, attribution fails early with clear error
	- OOM:
	- reduce `MAX_TRACE_TOKENS`, `MAX_SENTENCES`, or choose larger GPU
	- cold start slow:
	- keep `PRELOAD_MODEL=true`