license: mit base_model: FacebookAI/roberta-base tags: - roberta - text-classification - regression - inference-routing - llm-scheduling - vllm - odyn - block - generated_from_trainer library_name: transformers pipeline_tag: text-classification

block-length-tagger

A RoBERTa-base regression model that predicts the output token count of an LLM response given an input prompt. Trained as part of the Odyn Network Phase 3 inference scheduler, implementing the query-length tagger described in Block: Scalable LLM Inference (arXiv 2508.03611, §4.3).

Model Description

The scheduler needs to know before generation how long a response will be, so it can route long-response requests to less-loaded instances before KV-cache pressure builds. This model replaces the 7B prompt-based LLM used as baseline in the Block paper (Table 1: 24.4% avg error, 77.15% Acc-100) with a lightweight 124M-parameter RoBERTa-base CrossEncoder that runs in milliseconds.

Architecture: FacebookAI/roberta-base fine-tuned with a single regression head via HuggingFace Trainer and sentence-transformers CrossEncoder.

Label normalisation: Response lengths are log1p-normalised (log1p(tokens) / log1p(2048)) before training to handle the heavy right-tail of ShareGPT response distributions. Predictions are clamped to [0, 1] before the inverse transform (expm1) to prevent out-of-range logits inflating error metrics.

Training

Parameter	Value
Base model	`FacebookAI/roberta-base`
Dataset	`Aeala/ShareGPT_Vicuna_unfiltered`
Train / eval split	96,540 / 24,135 examples
Max input length	512 tokens
Max token cap (label)	2048 tokens
Batch size	16
Learning rate	2e-5
Epochs	3
Warmup steps	500
Weight decay	0.01
Hardware	NVIDIA GB10 Blackwell (DGX Spark), CUDA 13
Precision	fp32

Training data: first human→gpt turn extracted from each ShareGPT conversation. Shuffled with seed 42 before splitting.

Evaluation Metrics

Metrics follow Block paper Table 1 definitions:

Metric	Description	Block baseline
`eval/mae`	Mean absolute token-count error	78.8 tokens
`eval/error_rate`	`\|pred − actual\| / actual` (lower is better)	24.4%
`eval/acc_50`	Fraction with error < 50 tokens	69.93%
`eval/acc_100`	Fraction with error < 100 tokens	77.15%

Best checkpoint selected by eval/mae (lower is better).

Usage

from transformers import pipeline
import math

pipe = pipeline("text-classification", model="michael-sigamani-odyn/block-length-tagger")
result = pipe("Explain the difference between TCP and UDP in detail.")

MAX_TOKEN_CAP, LOG_MAX = 2048, math.log1p(2048)
predicted_tokens = math.expm1(min(max(result[0]["score"], 0.0), 1.0) * LOG_MAX)
print(f"Predicted response length: {predicted_tokens:.0f} tokens")

Or using sentence-transformers:

from sentence_transformers import CrossEncoder
import math

model = CrossEncoder("michael-sigamani-odyn/block-length-tagger", num_labels=1, max_length=512)
score = model.predict("Explain the difference between TCP and UDP in detail.")

MAX_TOKEN_CAP, LOG_MAX = 2048, math.log1p(2048)
predicted_tokens = math.expm1(min(max(float(score), 0.0), 1.0) * LOG_MAX)
print(f"Predicted response length: {predicted_tokens:.0f} tokens")

Integration with Odyn Scheduler

In the Odyn distributed inference stack, this model runs on the head node to tag each incoming request with a predicted output length before it is dispatched to a vLLM worker. Requests
predicted to generate long responses are preferentially routed to workers with available KV-cache capacity, reducing queue latency at high concurrency.

Client → Head Node
           ├─ block-length-tagger (predict output tokens)
           └─ Route to worker with capacity → vLLM Engine

Citation

@article{block2025,
  title   = {Block: Scalable LLM Inference},
  journal = {arXiv preprint arXiv:2508.03611},
  year    = {2025},
  note    = {§4.3 Query-Length Tagger}
}

</details>

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for michael-sigamani-odyn/block-length-tagger

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

Paper • 2508.03611 • Published Aug 5, 2025 • 1