kompress-v2-base

Extractive prompt compressor for LLM proxies. Predicts a keep/drop label per token; the surviving tokens form a compressed version of the input that preserves meaning while reducing token count.

Based on ModernBERT-base (149M params) with a LoRA adapter (3.4M trainable params, 2.2%) plus a custom dual head (token classifier + 1-D span conv). Trained on 126,617 accepted Pipeline A+B labels (compressor + faithfulness judge) across 17 domains: narrative, dialog, code, agent traces, healthcare, finance, government, scientific, web, summary, and tool-calling.

Quick start

import torch
from transformers import AutoTokenizer

# Option A: load the merged checkpoint (no LoRA needed)
state = torch.load("merged.pt", map_location="cpu")

# Option B: load via the kompress package
from kompress.model.architecture import HeadroomCompressorV2
from kompress.model.config import V2_BASE
import json

with open("config.json") as f:
    cfg_dict = json.load(f)
cfg = V2_BASE  # or rebuild from cfg_dict
model = HeadroomCompressorV2(cfg)
model.load_state_dict(torch.load("merged.pt", map_location="cpu"), strict=False)
model.eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("chopratejas/kompress-v2-base")

# Compress
text = "The quick brown fox jumps over the lazy dog."
enc = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model(**enc)
scores = out["final_scores"][0]    # P(keep) per subword
keep = (scores >= 0.5)
kept_tokens = enc["input_ids"][0][keep]
print(tokenizer.decode(kept_tokens, skip_special_tokens=True))

Threshold tuning

The model emits final_scores โˆˆ [0, 1] per subword. Adjust the threshold to trade compression aggressiveness for must-keep recall.

Threshold keep_rate must_keep_recall F1 best for
0.30 0.917 (8% drop) 0.994 0.904 Conservative
0.40 0.867 (13% drop) 0.987 0.913 Safe
0.50 (default) 0.815 (18% drop) 0.974 0.918 Balanced
0.60 0.765 (23% drop) 0.950 0.915 Aggressive
0.70 0.705 (30% drop) 0.908 0.898 Very aggressive

Evaluated on the held-out test split (n=7,037 examples, stratified by domain).

Training data

  • 126,617 labeled examples after min_drop_ratio=0.05 filtering and same-conversation packing.
  • Sources: arxiv, pubmed-scientific, govreport, swe-smith, swe-gym-openhands, toolmind, xlam-fc, fineweb-edu, cnn-dailymail, xsum, glaive-fc, lmsys-chat, claude-code-sessions, meetingbank, the-stack-smol-md, samsum, swe-bench-verified.
  • Labeler: DeepSeek-V4-Flash (compressor) + DeepSeek-V4-Pro (judge) with Pipeline A + B faithfulness loop. Hard-keep overlay enforces names, dates, numbers, URLs, code identifiers via GLiNER + regex + lexicons.
  • Bucket split: short=48%, mid=31%, long=21% (max_length 8,192 native ModernBERT context).
  • Split: train=126,617 / val=7,037 / test=7,037.

Training details

  • Base: ModernBERT-base (149M params)
  • Encoder fine-tuning: LoRA (r=16, alpha=32, target_modules=Wqkv/Wi/Wo)
  • Heads: per-token CE (must-keep loss weight = 3.0) + 1-D span conv (BCE, weight 0.3 on total loss)
  • Trainable params: 3.4M (2.2% of total)
  • Loss: weighted cross-entropy on token head + BCE-with-logits on span head
  • Optim: AdamW (lr=2e-4 cosine, warmup_ratio=0.06, weight_decay=0.01)
  • Effective batch: 48 (12 ร— 4 grad-accum)
  • Epochs: 3
  • Precision: bf16 with FlashAttention-2 + gradient checkpointing
  • Hardware: 1ร—H100 80GB, ~39 min wall-clock

Final metrics (test split, threshold=0.5)

  • eval_f1: 0.918
  • eval_must_keep_recall: 0.974
  • eval_keep_rate: 0.815 (18% compression)
  • eval_loss: 0.34

Files in this repo

config.json           # KompressV2Config + arch metadata
model.safetensors     # ~600 MB โ€” best checkpoint, LoRA merged into the encoder
merged.pt             # ~600 MB โ€” full state dict, alias for safetensors load
tokenizer.json        # ModernBERT-base tokenizer
tokenizer_config.json
special_tokens_map.json
adapter/              # LoRA adapter ONLY (~30 MB), for stacking per-org adapters
  adapter_config.json
  adapter_model.safetensors
  token_head.pt
  span_conv.pt
README.md             # this file

License

Apache 2.0. Free for commercial use. ModernBERT base is also Apache 2.0.

See also

Downloads last month
19
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for chopratejas/kompress-v2-base

Adapter
(33)
this model

Space using chopratejas/kompress-v2-base 1