wchen22's picture
Add exact tokenizer accounting to compression API
1212e7d verified
|
Raw
History Blame Contribute Delete
4.63 kB
metadata
title: Touchdown Compression Classifier
emoji: 🚀
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860

Touchdown Compression Classifier

Free CPU Hugging Face Space scaffold for the managed prompt compression API.

Phase 1 serves deterministic deletion-only compression with receipts. The planned classifier backbone is microsoft/deberta-v3-small; the API reports classifier status honestly until a trained KEEP/DROP head or ONNX export is mounted.

Endpoints:

  • GET /health
  • POST /v1/compress
  • POST /v1/classify

Live Space:

  • https://wchen22-touchdown-compression-classifier.hf.space
  • Verified 2026-06-11 with HF CLI: runtime stage RUNNING, hardware cpu-basic, domain READY, repo/runtime SHA 0dfe65a6c82c9e7fa37d2c4a32c8eda3ed4e96d7.
  • The deployed scaffold supports chunked ONNX artifact inference for long prompts. Use hf spaces info wchen22/touchdown-compression-classifier --format json for the current repo/runtime SHA.
  • Live smoke: python3 scripts/smoke_compression_api.py --base-url https://wchen22-touchdown-compression-classifier.hf.space --include-classify --include-batch --include-messages --include-gzip validates /health, /v1/classify, single /v1/compress, and managed inputs[] batch, managed messages[], plus gzipped JSON request/response transport.
  • Real-corpus API benchmark: python3 scripts/benchmark_compression_api.py --base-url https://wchen22-touchdown-compression-classifier.hf.space --input-jsonl benchmarks/prompts/real/kv_stress_seed.jsonl --limit 4 --tokenizer-model Qwen/Qwen2.5-7B-Instruct --require-exact-tokens. This calls hosted /v1/compress over real prompt rows and fails the run if receipts return estimated token counts. Use this before claiming real-token savings.
  • Full deployment receipt: python3 scripts/verify_compression_space.py --expected-sha <sha> --out reports/generated/compression_space/hf_space_verification.json validates HF runtime metadata, repo/runtime SHA agreement, API smoke, and remote/local Space file parity.
  • Fresh local receipts are written under reports/generated/compression_space/; run the full verifier with the current Space SHA to check runtime, API smoke, and remote/local file parity. Current live receipt: reports/generated/compression_space/hf_space_verification_2026-06-11-managed-messages.json.
  • Latest live result: /v1/compress saved 27/102 estimated tokens; managed inputs[] returned input_count=2, succeeded=2, failed=0, managed messages[] returned message_count=2 with system-role protection, gzip transport returned response_content_encoding=gzip, and /v1/classify returned KEEP-only DeBERTa tokenizer labels. Receipts include removed-span/char totals, classifier DROP block reasons, tool-schema preservation counts when tools or tool_schemas are supplied, and /health idempotency TTL reporting. Matching Idempotency-Key retries replay the first in-memory response; payload conflicts return HTTP 409. This is per-process memory on the Space, not a durable distributed store. The HTTP surface accepts Content-Encoding: gzip request bodies and gzip responses for Accept-Encoding: gzip or gzipped requests. If an ingress strips the standard content-encoding header, also send X-Touchdown-Content-Encoding: gzip.
  • /v1/classify is tokenizer/fallback KEEP-only until a trained KEEP/DROP head is mounted. /v1/compress is rules-first deletion-only compression with safety receipts. The Space app supports both single input requests and managed inputs[] batches with per-item receipts and partial-error rows. /v1/compress now accepts tokenizer_model; when the tokenizer loads, receipts report token_count_exact=true, token_count_method=tokenizer, and the requested model. If it cannot load, receipts remain estimated and the benchmark --require-exact-tokens gate fails.
  • Mount classifier_manifest.json, tokenizer files, and optional model.onnx; set TOUCHDOWN_CLASSIFIER_ARTIFACT_DIR to let the Space use artifact DROP labels through ONNX Runtime or the manifest fallback. ONNX labels are evaluated in chunked windows using manifest max_length and stride; mounted ONNX labels expose keep_score, drop_score, and drop_score_threshold. DROP spans still pass through protected-span and deletion-only safety gates.

Deploy:

hf auth login
./deploy.sh <namespace>/touchdown-compression-classifier

Free CPU Spaces are enough for this scaffold; production traffic should move to paid or owned infrastructure after validation.