Instructions to use chopratejas/kompress-v2-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chopratejas/kompress-v2-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="chopratejas/kompress-v2-base")# Load model directly from transformers import HeadroomCompressorV2 model = HeadroomCompressorV2.from_pretrained("chopratejas/kompress-v2-base", dtype="auto") - Notebooks
- Google Colab
- Kaggle
# Load model directly
from transformers import HeadroomCompressorV2
model = HeadroomCompressorV2.from_pretrained("chopratejas/kompress-v2-base", dtype="auto")kompress-v2-base
Extractive prompt compressor for LLM proxies. Predicts a keep/drop label per token; the surviving tokens form a compressed version of the input that preserves meaning while reducing token count.
Based on ModernBERT-base (149M params) with a LoRA adapter (3.4M trainable params, 2.2%) plus a custom dual head (token classifier + 1-D span conv). Trained on 126,617 accepted Pipeline A+B labels (compressor + faithfulness judge) across 17 domains: narrative, dialog, code, agent traces, healthcare, finance, government, scientific, web, summary, and tool-calling.
Quick start
import torch
from transformers import AutoTokenizer
# Option A: load the merged checkpoint (no LoRA needed)
state = torch.load("merged.pt", map_location="cpu")
# Option B: load via the kompress package
from kompress.model.architecture import HeadroomCompressorV2
from kompress.model.config import V2_BASE
import json
with open("config.json") as f:
cfg_dict = json.load(f)
cfg = V2_BASE # or rebuild from cfg_dict
model = HeadroomCompressorV2(cfg)
model.load_state_dict(torch.load("merged.pt", map_location="cpu"), strict=False)
model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained("chopratejas/kompress-v2-base")
# Compress
text = "The quick brown fox jumps over the lazy dog."
enc = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
out = model(**enc)
scores = out["final_scores"][0] # P(keep) per subword
keep = (scores >= 0.5)
kept_tokens = enc["input_ids"][0][keep]
print(tokenizer.decode(kept_tokens, skip_special_tokens=True))
Threshold tuning
The model emits final_scores โ [0, 1] per subword. Adjust the threshold to
trade compression aggressiveness for must-keep recall.
| Threshold | keep_rate | must_keep_recall | F1 | best for |
|---|---|---|---|---|
| 0.30 | 0.917 (8% drop) | 0.994 | 0.904 | Conservative |
| 0.40 | 0.867 (13% drop) | 0.987 | 0.913 | Safe |
| 0.50 (default) | 0.815 (18% drop) | 0.974 | 0.918 | Balanced |
| 0.60 | 0.765 (23% drop) | 0.950 | 0.915 | Aggressive |
| 0.70 | 0.705 (30% drop) | 0.908 | 0.898 | Very aggressive |
Evaluated on the held-out test split (n=7,037 examples, stratified by domain).
Training data
- 126,617 labeled examples after
min_drop_ratio=0.05filtering and same-conversation packing. - Sources: arxiv, pubmed-scientific, govreport, swe-smith, swe-gym-openhands, toolmind, xlam-fc, fineweb-edu, cnn-dailymail, xsum, glaive-fc, lmsys-chat, claude-code-sessions, meetingbank, the-stack-smol-md, samsum, swe-bench-verified.
- Labeler: DeepSeek-V4-Flash (compressor) + DeepSeek-V4-Pro (judge) with Pipeline A + B faithfulness loop. Hard-keep overlay enforces names, dates, numbers, URLs, code identifiers via GLiNER + regex + lexicons.
- Bucket split: short=48%, mid=31%, long=21% (max_length 8,192 native ModernBERT context).
- Split: train=126,617 / val=7,037 / test=7,037.
Training details
- Base: ModernBERT-base (149M params)
- Encoder fine-tuning: LoRA (r=16, alpha=32, target_modules=Wqkv/Wi/Wo)
- Heads: per-token CE (must-keep loss weight = 3.0) + 1-D span conv (BCE, weight 0.3 on total loss)
- Trainable params: 3.4M (2.2% of total)
- Loss: weighted cross-entropy on token head + BCE-with-logits on span head
- Optim: AdamW (lr=2e-4 cosine, warmup_ratio=0.06, weight_decay=0.01)
- Effective batch: 48 (12 ร 4 grad-accum)
- Epochs: 3
- Precision: bf16 with FlashAttention-2 + gradient checkpointing
- Hardware: 1รH100 80GB, ~39 min wall-clock
Final metrics (test split, threshold=0.5)
- eval_f1: 0.918
- eval_must_keep_recall: 0.974
- eval_keep_rate: 0.815 (18% compression)
- eval_loss: 0.34
Files in this repo
config.json # KompressV2Config + arch metadata
model.safetensors # ~600 MB โ best checkpoint, LoRA merged into the encoder
merged.pt # ~600 MB โ full state dict, alias for safetensors load
tokenizer.json # ModernBERT-base tokenizer
tokenizer_config.json
special_tokens_map.json
adapter/ # LoRA adapter ONLY (~30 MB), for stacking per-org adapters
adapter_config.json
adapter_model.safetensors
token_head.pt
span_conv.pt
README.md # this file
License
Apache 2.0. Free for commercial use. ModernBERT base is also Apache 2.0.
See also
chopratejas/kompress-v2-largeโ larger variant (ModernBERT-large, 395M params, private/enterprise)- Headroom proxy integration guide: docs/CUSTOMER_QUICKSTART.md
- Per-org fine-tuning (LoRA stacking): docs/DEPLOYMENT_STRATEGY.md
- Downloads last month
- 19
Model tree for chopratejas/kompress-v2-base
Base model
answerdotai/ModernBERT-base
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="chopratejas/kompress-v2-base")