File size: 4,629 Bytes
b2c269a
 
63799d5
b2c269a
 
 
63799d5
b2c269a
 
63799d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2dab644
 
 
01f13f1
0dfe65a
1212e7d
dd56f7d
 
 
01f13f1
0dfe65a
01f13f1
0dfe65a
 
1212e7d
 
 
 
 
87c9bb7
 
 
 
4d2c01b
 
 
0dfe65a
1212e7d
3040495
01f13f1
0dfe65a
3040495
 
0dfe65a
 
 
a7250d0
 
 
d6356f5
67df764
 
 
2dab644
 
b784d0b
 
1212e7d
 
 
 
2343c2a
 
b5632b3
472c58c
 
 
2dab644
63799d5
 
 
 
 
 
 
2dab644
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
title: Touchdown Compression Classifier
emoji: 🚀
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---

# Touchdown Compression Classifier

Free CPU Hugging Face Space scaffold for the managed prompt compression API.

Phase 1 serves deterministic deletion-only compression with receipts. The
planned classifier backbone is `microsoft/deberta-v3-small`; the API reports
classifier status honestly until a trained KEEP/DROP head or ONNX export is
mounted.

Endpoints:

- `GET /health`
- `POST /v1/compress`
- `POST /v1/classify`

Live Space:

- `https://wchen22-touchdown-compression-classifier.hf.space`
- Verified 2026-06-11 with HF CLI: runtime stage `RUNNING`, hardware
  `cpu-basic`, domain `READY`, repo/runtime SHA
  `0dfe65a6c82c9e7fa37d2c4a32c8eda3ed4e96d7`.
- The deployed scaffold supports chunked ONNX artifact inference for long
  prompts. Use `hf spaces info wchen22/touchdown-compression-classifier --format
  json` for the current repo/runtime SHA.
- Live smoke:
  `python3 scripts/smoke_compression_api.py --base-url https://wchen22-touchdown-compression-classifier.hf.space --include-classify --include-batch --include-messages --include-gzip`
  validates `/health`, `/v1/classify`, single `/v1/compress`, and managed
  `inputs[]` batch, managed `messages[]`, plus gzipped JSON request/response
  transport.
- Real-corpus API benchmark:
  `python3 scripts/benchmark_compression_api.py --base-url https://wchen22-touchdown-compression-classifier.hf.space --input-jsonl benchmarks/prompts/real/kv_stress_seed.jsonl --limit 4 --tokenizer-model Qwen/Qwen2.5-7B-Instruct --require-exact-tokens`.
  This calls hosted `/v1/compress` over real prompt rows and fails the run if
  receipts return estimated token counts. Use this before claiming real-token
  savings.
- Full deployment receipt:
  `python3 scripts/verify_compression_space.py --expected-sha <sha> --out reports/generated/compression_space/hf_space_verification.json`
  validates HF runtime metadata, repo/runtime SHA agreement, API smoke, and
  remote/local Space file parity.
- Fresh local receipts are written under
  `reports/generated/compression_space/`; run the full verifier with the
  current Space SHA to check runtime, API smoke, and remote/local file parity.
  Current live receipt:
  `reports/generated/compression_space/hf_space_verification_2026-06-11-managed-messages.json`.
- Latest live result: `/v1/compress` saved 27/102 estimated tokens;
  managed `inputs[]` returned `input_count=2`, `succeeded=2`, `failed=0`,
  managed `messages[]` returned `message_count=2` with system-role protection,
  gzip transport returned `response_content_encoding=gzip`, and `/v1/classify`
  returned KEEP-only DeBERTa tokenizer labels. Receipts include
  removed-span/char totals, classifier DROP block reasons, tool-schema
  preservation counts when `tools` or `tool_schemas` are supplied, and
  `/health` idempotency TTL reporting.
  Matching `Idempotency-Key` retries replay the first in-memory response;
  payload conflicts return HTTP 409. This is per-process memory on the Space,
  not a durable distributed store.
  The HTTP surface accepts `Content-Encoding: gzip` request bodies and gzip
  responses for `Accept-Encoding: gzip` or gzipped requests. If an ingress
  strips the standard content-encoding header, also send
  `X-Touchdown-Content-Encoding: gzip`.
- `/v1/classify` is tokenizer/fallback KEEP-only until a trained KEEP/DROP head
  is mounted. `/v1/compress` is rules-first deletion-only compression with
  safety receipts. The Space app supports both single `input` requests and
  managed `inputs[]` batches with per-item receipts and partial-error rows.
  `/v1/compress` now accepts `tokenizer_model`; when the tokenizer loads,
  receipts report `token_count_exact=true`, `token_count_method=tokenizer`, and
  the requested model. If it cannot load, receipts remain estimated and the
  benchmark `--require-exact-tokens` gate fails.
- Mount `classifier_manifest.json`, tokenizer files, and optional `model.onnx`;
  set `TOUCHDOWN_CLASSIFIER_ARTIFACT_DIR` to let the Space use artifact DROP
  labels through ONNX Runtime or the manifest fallback. ONNX labels are
  evaluated in chunked windows using manifest `max_length` and `stride`; mounted
  ONNX labels expose `keep_score`, `drop_score`, and `drop_score_threshold`.
  DROP spans still pass through protected-span and deletion-only safety gates.

Deploy:

```bash
hf auth login
./deploy.sh <namespace>/touchdown-compression-classifier
```

Free CPU Spaces are enough for this scaffold; production traffic should move to
paid or owned infrastructure after validation.