Spaces:

BART-ender
/

cot-anc

Sleeping

File size: 1,654 Bytes

# Hugging Face Deployment

Primary target: Hugging Face `Docker Space` on upgraded GPU hardware.

## What Gets Deployed

- FastAPI backend
- static web frontend
- Hugging Face OAuth routes
- ephemeral SQLite-backed session queue

## Required Space Settings

- SDK: `Docker`
- Port: `7860`
- OAuth: enabled via README metadata
- Hardware: upgraded GPU recommended

## Recommended Runtime Variables

Core:

- `MODEL_NAME=sapientinc/HRM-Text-1B`
- `DEVICE_PREFERENCE=auto`
- `DTYPE_PREFERENCE=auto`
- `ATTN_IMPLEMENTATION=eager`
- `LOW_CPU_MEM_USAGE=true`
- `TRUST_REMOTE_CODE=true`
- `PRELOAD_MODEL=true`

Traffic limits:

- `MAX_TRACE_TOKENS=256`
- `MAX_SENTENCES=16`
- `JOB_WORKERS=1`
- `MAX_QUEUED_JOBS=8`
- `MAX_ACTIVE_JOBS_PER_USER=2`
- `REQUIRE_AUTH=true`

## Deploy Flow

1. Create new Hugging Face Space with `Docker` SDK.
2. Push repo contents.
3. Set runtime variables in Space settings.
4. Upgrade hardware.
5. Wait for build.
6. Verify:
   - `GET /healthz`
   - sign-in works
   - one short analysis completes
   - JSON / CSV export works

## Operational Notes

- Local disk is ephemeral. Session history disappears on restart.
- OAuth helper is mocked locally but real inside Space.
- Keep public defaults conservative. Long traces can OOM small GPUs.
- If queue pressure grows, lower token caps before increasing worker count.

## Common Failure Modes

- `attn_implementation` not eager:
  - attribution disabled for model
- unsupported model layout:
  - generation may work, attribution fails early with clear error
- OOM:
  - reduce `MAX_TRACE_TOKENS`, `MAX_SENTENCES`, or choose larger GPU
- cold start slow:
  - keep `PRELOAD_MODEL=true`