Spaces:

wchen22
/

touchdown-compression-classifier

Sleeping

App Files Files Community

wchen22 commited on 23 days ago

Commit

1212e7d

verified ·

1 Parent(s): 0dfe65a

Add exact tokenizer accounting to compression API

Browse files

Files changed (2) hide show

README.md +11 -2
app.py +43 -2

README.md CHANGED Viewed

@@ -27,7 +27,7 @@ Live Space:
 - `https://wchen22-touchdown-compression-classifier.hf.space`
 - Verified 2026-06-11 with HF CLI: runtime stage `RUNNING`, hardware
   `cpu-basic`, domain `READY`, repo/runtime SHA
-  `b402ba63bf08ce65bd30da071256555382be4fe0`.
 - The deployed scaffold supports chunked ONNX artifact inference for long
   prompts. Use `hf spaces info wchen22/touchdown-compression-classifier --format
   json` for the current repo/runtime SHA.
@@ -36,6 +36,11 @@ Live Space:
   validates `/health`, `/v1/classify`, single `/v1/compress`, and managed
   `inputs[]` batch, managed `messages[]`, plus gzipped JSON request/response
   transport.
 - Full deployment receipt:
   `python3 scripts/verify_compression_space.py --expected-sha <sha> --out reports/generated/compression_space/hf_space_verification.json`
   validates HF runtime metadata, repo/runtime SHA agreement, API smoke, and
@@ -44,7 +49,7 @@ Live Space:
   `reports/generated/compression_space/`; run the full verifier with the
   current Space SHA to check runtime, API smoke, and remote/local file parity.
   Current live receipt:
-  `reports/generated/compression_space/hf_space_verification_2026-06-11-idempotency-replay-health.json`.
 - Latest live result: `/v1/compress` saved 27/102 estimated tokens;
   managed `inputs[]` returned `input_count=2`, `succeeded=2`, `failed=0`,
   managed `messages[]` returned `message_count=2` with system-role protection,
@@ -64,6 +69,10 @@ Live Space:
   is mounted. `/v1/compress` is rules-first deletion-only compression with
   safety receipts. The Space app supports both single `input` requests and
   managed `inputs[]` batches with per-item receipts and partial-error rows.
 - Mount `classifier_manifest.json`, tokenizer files, and optional `model.onnx`;
   set `TOUCHDOWN_CLASSIFIER_ARTIFACT_DIR` to let the Space use artifact DROP
   labels through ONNX Runtime or the manifest fallback. ONNX labels are

 - `https://wchen22-touchdown-compression-classifier.hf.space`
 - Verified 2026-06-11 with HF CLI: runtime stage `RUNNING`, hardware
   `cpu-basic`, domain `READY`, repo/runtime SHA
+  `0dfe65a6c82c9e7fa37d2c4a32c8eda3ed4e96d7`.
 - The deployed scaffold supports chunked ONNX artifact inference for long
   prompts. Use `hf spaces info wchen22/touchdown-compression-classifier --format
   json` for the current repo/runtime SHA.
   validates `/health`, `/v1/classify`, single `/v1/compress`, and managed
   `inputs[]` batch, managed `messages[]`, plus gzipped JSON request/response
   transport.
+- Real-corpus API benchmark:
+  `python3 scripts/benchmark_compression_api.py --base-url https://wchen22-touchdown-compression-classifier.hf.space --input-jsonl benchmarks/prompts/real/kv_stress_seed.jsonl --limit 4 --tokenizer-model Qwen/Qwen2.5-7B-Instruct --require-exact-tokens`.
+  This calls hosted `/v1/compress` over real prompt rows and fails the run if
+  receipts return estimated token counts. Use this before claiming real-token
+  savings.
 - Full deployment receipt:
   `python3 scripts/verify_compression_space.py --expected-sha <sha> --out reports/generated/compression_space/hf_space_verification.json`
   validates HF runtime metadata, repo/runtime SHA agreement, API smoke, and
   `reports/generated/compression_space/`; run the full verifier with the
   current Space SHA to check runtime, API smoke, and remote/local file parity.
   Current live receipt:
+  `reports/generated/compression_space/hf_space_verification_2026-06-11-managed-messages.json`.
 - Latest live result: `/v1/compress` saved 27/102 estimated tokens;
   managed `inputs[]` returned `input_count=2`, `succeeded=2`, `failed=0`,
   managed `messages[]` returned `message_count=2` with system-role protection,
   is mounted. `/v1/compress` is rules-first deletion-only compression with
   safety receipts. The Space app supports both single `input` requests and
   managed `inputs[]` batches with per-item receipts and partial-error rows.
+  `/v1/compress` now accepts `tokenizer_model`; when the tokenizer loads,
+  receipts report `token_count_exact=true`, `token_count_method=tokenizer`, and
+  the requested model. If it cannot load, receipts remain estimated and the
+  benchmark `--require-exact-tokens` gate fails.
 - Mount `classifier_manifest.json`, tokenizer files, and optional `model.onnx`;
   set `TOUCHDOWN_CLASSIFIER_ARTIFACT_DIR` to let the Space use artifact DROP
   labels through ONNX Runtime or the manifest fallback. ONNX labels are

app.py CHANGED Viewed

@@ -380,6 +380,13 @@ def _get_tokenizer():
     return AutoTokenizer.from_pretrained(CLASSIFIER_MODEL)
 @lru_cache(maxsize=1)
 def _classifier_manifest() -> dict[str, Any] | None:
     if not CLASSIFIER_ARTIFACT_DIR:
@@ -772,6 +779,29 @@ def _tool_schema_missing_groups(
     return missing
 def _protected_spans(
     text: str,
     protected_values: list[str],
@@ -834,6 +864,7 @@ def _compress_text(payload: dict[str, Any]) -> dict[str, Any]:
     idempotency_key = payload.get("idempotency_key")
     if idempotency_key is not None and not isinstance(idempotency_key, str):
         raise HTTPException(status_code=400, detail="idempotency_key must be a string")
     protected_values = payload.get("protected_spans") or []
     if not isinstance(protected_values, list) or not all(
         isinstance(value, str) for value in protected_values
@@ -907,8 +938,11 @@ def _compress_text(payload: dict[str, Any]) -> dict[str, Any]:
         cursor = end
     chunks.append(text[cursor:])
     output = "".join(chunks)
-    before = max(1, round(len(text) / 4.0))
-    after = max(1, round(len(output) / 4.0))
     saved = max(0, before - after)
     missing = [value for value in protected_values if value and value not in output]
     code_preserved = all(text[start:end] in output for start, end in code_spans)
@@ -983,6 +1017,9 @@ def _compress_text(payload: dict[str, Any]) -> dict[str, Any]:
         "classifier_drop_chars": sum(end - start for start, end in classifier_ranges),
         "dropped_segments_count": len(drops),
         "dropped_segments": dropped_segments,
     }
     receipt["input_sha256"] = _sha256_text(text)
     receipt["output_sha256"] = _sha256_text(output)
@@ -1027,6 +1064,7 @@ def _merge_batch_item_payload(
             "compression_settings": payload.get("compression_settings"),
             "protected_spans": payload.get("protected_spans"),
             "tool_schemas": payload.get("tool_schemas", payload.get("tools")),
             "request_id": payload.get("request_id"),
             "idempotency_key": payload.get("idempotency_key"),
         }
@@ -1058,6 +1096,7 @@ def _merge_batch_item_payload(
             "tool_schemas",
             item.get("tools", payload.get("tool_schemas", payload.get("tools"))),
         ),
         "request_id": item.get("request_id", payload.get("request_id")),
         "idempotency_key": item.get(
             "idempotency_key",
@@ -1180,6 +1219,7 @@ def _handle_messages(payload: dict[str, Any]) -> dict[str, Any]:
     idempotency_key = payload.get("idempotency_key")
     if idempotency_key is not None and not isinstance(idempotency_key, str):
         raise HTTPException(status_code=400, detail="idempotency_key must be a string")
     output_messages: list[dict[str, Any]] = []
     receipts: list[dict[str, Any]] = []
@@ -1220,6 +1260,7 @@ def _handle_messages(payload: dict[str, Any]) -> dict[str, Any]:
             "compression_settings": settings,
             "protected_spans": item_protected,
             "tool_schemas": payload.get("tool_schemas", payload.get("tools")),
             "request_id": request_id,
             "idempotency_key": idempotency_key,
         })

     return AutoTokenizer.from_pretrained(CLASSIFIER_MODEL)
+@lru_cache(maxsize=8)
+def _get_count_tokenizer(model_name: str):
+    from transformers import AutoTokenizer
+    return AutoTokenizer.from_pretrained(model_name)
 @lru_cache(maxsize=1)
 def _classifier_manifest() -> dict[str, Any] | None:
     if not CLASSIFIER_ARTIFACT_DIR:
     return missing
+def _optional_string(payload: dict[str, Any], key: str) -> str | None:
+    value = payload.get(key)
+    if value is None:
+        return None
+    if not isinstance(value, str):
+        raise HTTPException(status_code=400, detail=f"{key} must be a string")
+    return value
+def _count_tokens(text: str, tokenizer_model: str | None) -> tuple[int, bool, str]:
+    if tokenizer_model:
+        try:
+            tokenizer = _get_count_tokenizer(tokenizer_model)
+            return (
+                len(tokenizer.encode(text, add_special_tokens=False)),
+                True,
+                "tokenizer",
+            )
+        except Exception:
+            pass
+    return max(1, round(len(text) / 4.0)), False, "chars_per_token_estimate"
 def _protected_spans(
     text: str,
     protected_values: list[str],
     idempotency_key = payload.get("idempotency_key")
     if idempotency_key is not None and not isinstance(idempotency_key, str):
         raise HTTPException(status_code=400, detail="idempotency_key must be a string")
+    tokenizer_model = _optional_string(payload, "tokenizer_model")
     protected_values = payload.get("protected_spans") or []
     if not isinstance(protected_values, list) or not all(
         isinstance(value, str) for value in protected_values
         cursor = end
     chunks.append(text[cursor:])
     output = "".join(chunks)
+    before, before_exact, token_method = _count_tokens(text, tokenizer_model)
+    after, after_exact, after_method = _count_tokens(output, tokenizer_model)
+    token_count_exact = before_exact and after_exact
+    if after_method != token_method:
+        token_method = "chars_per_token_estimate"
     saved = max(0, before - after)
     missing = [value for value in protected_values if value and value not in output]
     code_preserved = all(text[start:end] in output for start, end in code_spans)
         "classifier_drop_chars": sum(end - start for start, end in classifier_ranges),
         "dropped_segments_count": len(drops),
         "dropped_segments": dropped_segments,
+        "token_count_exact": token_count_exact,
+        "token_count_method": token_method,
+        "tokenizer_model": tokenizer_model,
     }
     receipt["input_sha256"] = _sha256_text(text)
     receipt["output_sha256"] = _sha256_text(output)
             "compression_settings": payload.get("compression_settings"),
             "protected_spans": payload.get("protected_spans"),
             "tool_schemas": payload.get("tool_schemas", payload.get("tools")),
+            "tokenizer_model": payload.get("tokenizer_model"),
             "request_id": payload.get("request_id"),
             "idempotency_key": payload.get("idempotency_key"),
         }
             "tool_schemas",
             item.get("tools", payload.get("tool_schemas", payload.get("tools"))),
         ),
+        "tokenizer_model": item.get("tokenizer_model", payload.get("tokenizer_model")),
         "request_id": item.get("request_id", payload.get("request_id")),
         "idempotency_key": item.get(
             "idempotency_key",
     idempotency_key = payload.get("idempotency_key")
     if idempotency_key is not None and not isinstance(idempotency_key, str):
         raise HTTPException(status_code=400, detail="idempotency_key must be a string")
+    tokenizer_model = _optional_string(payload, "tokenizer_model")
     output_messages: list[dict[str, Any]] = []
     receipts: list[dict[str, Any]] = []
             "compression_settings": settings,
             "protected_spans": item_protected,
             "tool_schemas": payload.get("tool_schemas", payload.get("tools")),
+            "tokenizer_model": tokenizer_model,
             "request_id": request_id,
             "idempotency_key": idempotency_key,
         })