File size: 4,629 Bytes
b2c269a 63799d5 b2c269a 63799d5 b2c269a 63799d5 2dab644 01f13f1 0dfe65a 1212e7d dd56f7d 01f13f1 0dfe65a 01f13f1 0dfe65a 1212e7d 87c9bb7 4d2c01b 0dfe65a 1212e7d 3040495 01f13f1 0dfe65a 3040495 0dfe65a a7250d0 d6356f5 67df764 2dab644 b784d0b 1212e7d 2343c2a b5632b3 472c58c 2dab644 63799d5 2dab644 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | ---
title: Touchdown Compression Classifier
emoji: 🚀
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---
# Touchdown Compression Classifier
Free CPU Hugging Face Space scaffold for the managed prompt compression API.
Phase 1 serves deterministic deletion-only compression with receipts. The
planned classifier backbone is `microsoft/deberta-v3-small`; the API reports
classifier status honestly until a trained KEEP/DROP head or ONNX export is
mounted.
Endpoints:
- `GET /health`
- `POST /v1/compress`
- `POST /v1/classify`
Live Space:
- `https://wchen22-touchdown-compression-classifier.hf.space`
- Verified 2026-06-11 with HF CLI: runtime stage `RUNNING`, hardware
`cpu-basic`, domain `READY`, repo/runtime SHA
`0dfe65a6c82c9e7fa37d2c4a32c8eda3ed4e96d7`.
- The deployed scaffold supports chunked ONNX artifact inference for long
prompts. Use `hf spaces info wchen22/touchdown-compression-classifier --format
json` for the current repo/runtime SHA.
- Live smoke:
`python3 scripts/smoke_compression_api.py --base-url https://wchen22-touchdown-compression-classifier.hf.space --include-classify --include-batch --include-messages --include-gzip`
validates `/health`, `/v1/classify`, single `/v1/compress`, and managed
`inputs[]` batch, managed `messages[]`, plus gzipped JSON request/response
transport.
- Real-corpus API benchmark:
`python3 scripts/benchmark_compression_api.py --base-url https://wchen22-touchdown-compression-classifier.hf.space --input-jsonl benchmarks/prompts/real/kv_stress_seed.jsonl --limit 4 --tokenizer-model Qwen/Qwen2.5-7B-Instruct --require-exact-tokens`.
This calls hosted `/v1/compress` over real prompt rows and fails the run if
receipts return estimated token counts. Use this before claiming real-token
savings.
- Full deployment receipt:
`python3 scripts/verify_compression_space.py --expected-sha <sha> --out reports/generated/compression_space/hf_space_verification.json`
validates HF runtime metadata, repo/runtime SHA agreement, API smoke, and
remote/local Space file parity.
- Fresh local receipts are written under
`reports/generated/compression_space/`; run the full verifier with the
current Space SHA to check runtime, API smoke, and remote/local file parity.
Current live receipt:
`reports/generated/compression_space/hf_space_verification_2026-06-11-managed-messages.json`.
- Latest live result: `/v1/compress` saved 27/102 estimated tokens;
managed `inputs[]` returned `input_count=2`, `succeeded=2`, `failed=0`,
managed `messages[]` returned `message_count=2` with system-role protection,
gzip transport returned `response_content_encoding=gzip`, and `/v1/classify`
returned KEEP-only DeBERTa tokenizer labels. Receipts include
removed-span/char totals, classifier DROP block reasons, tool-schema
preservation counts when `tools` or `tool_schemas` are supplied, and
`/health` idempotency TTL reporting.
Matching `Idempotency-Key` retries replay the first in-memory response;
payload conflicts return HTTP 409. This is per-process memory on the Space,
not a durable distributed store.
The HTTP surface accepts `Content-Encoding: gzip` request bodies and gzip
responses for `Accept-Encoding: gzip` or gzipped requests. If an ingress
strips the standard content-encoding header, also send
`X-Touchdown-Content-Encoding: gzip`.
- `/v1/classify` is tokenizer/fallback KEEP-only until a trained KEEP/DROP head
is mounted. `/v1/compress` is rules-first deletion-only compression with
safety receipts. The Space app supports both single `input` requests and
managed `inputs[]` batches with per-item receipts and partial-error rows.
`/v1/compress` now accepts `tokenizer_model`; when the tokenizer loads,
receipts report `token_count_exact=true`, `token_count_method=tokenizer`, and
the requested model. If it cannot load, receipts remain estimated and the
benchmark `--require-exact-tokens` gate fails.
- Mount `classifier_manifest.json`, tokenizer files, and optional `model.onnx`;
set `TOUCHDOWN_CLASSIFIER_ARTIFACT_DIR` to let the Space use artifact DROP
labels through ONNX Runtime or the manifest fallback. ONNX labels are
evaluated in chunked windows using manifest `max_length` and `stride`; mounted
ONNX labels expose `keep_score`, `drop_score`, and `drop_score_threshold`.
DROP spans still pass through protected-span and deletion-only safety gates.
Deploy:
```bash
hf auth login
./deploy.sh <namespace>/touchdown-compression-classifier
```
Free CPU Spaces are enough for this scaffold; production traffic should move to
paid or owned infrastructure after validation.
|