YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.


license: mit base_model: FacebookAI/roberta-base tags: - roberta - text-classification - regression - inference-routing - llm-scheduling - vllm - odyn - block - generated_from_trainer library_name: transformers pipeline_tag: text-classification

block-length-tagger

A RoBERTa-base regression model that predicts the output token count of an LLM response given an input prompt. Trained as part of the Odyn Network Phase 3 inference scheduler, implementing the query-length tagger described in Block: Scalable LLM Inference (arXiv 2508.03611, §4.3).

Model Description

The scheduler needs to know before generation how long a response will be, so it can route long-response requests to less-loaded instances before KV-cache pressure builds. This model replaces the 7B prompt-based LLM used as baseline in the Block paper (Table 1: 24.4% avg error, 77.15% Acc-100) with a lightweight 124M-parameter RoBERTa-base CrossEncoder that runs in milliseconds.

Architecture: FacebookAI/roberta-base fine-tuned with a single regression head via HuggingFace Trainer and sentence-transformers CrossEncoder.

Label normalisation: Response lengths are log1p-normalised (log1p(tokens) / log1p(2048)) before training to handle the heavy right-tail of ShareGPT response distributions. Predictions are clamped to [0, 1] before the inverse transform (expm1) to prevent out-of-range logits inflating error metrics.

Training

Parameter Value
Base model FacebookAI/roberta-base
Dataset Aeala/ShareGPT_Vicuna_unfiltered
Train / eval split 96,540 / 24,135 examples
Max input length 512 tokens
Max token cap (label) 2048 tokens
Batch size 16
Learning rate 2e-5
Epochs 3
Warmup steps 500
Weight decay 0.01
Hardware NVIDIA GB10 Blackwell (DGX Spark), CUDA 13
Precision fp32

Training data: first human→gpt turn extracted from each ShareGPT conversation. Shuffled with seed 42 before splitting.

Evaluation Metrics

Metrics follow Block paper Table 1 definitions:

Metric Description Block baseline
eval/mae Mean absolute token-count error 78.8 tokens
eval/error_rate |pred − actual| / actual (lower is better) 24.4%
eval/acc_50 Fraction with error < 50 tokens 69.93%
eval/acc_100 Fraction with error < 100 tokens 77.15%

Best checkpoint selected by eval/mae (lower is better).

Usage

from transformers import pipeline
import math

pipe = pipeline("text-classification", model="michael-sigamani-odyn/block-length-tagger")
result = pipe("Explain the difference between TCP and UDP in detail.")

MAX_TOKEN_CAP, LOG_MAX = 2048, math.log1p(2048)
predicted_tokens = math.expm1(min(max(result[0]["score"], 0.0), 1.0) * LOG_MAX)
print(f"Predicted response length: {predicted_tokens:.0f} tokens")

Or using sentence-transformers:

from sentence_transformers import CrossEncoder
import math

model = CrossEncoder("michael-sigamani-odyn/block-length-tagger", num_labels=1, max_length=512)
score = model.predict("Explain the difference between TCP and UDP in detail.")

MAX_TOKEN_CAP, LOG_MAX = 2048, math.log1p(2048)
predicted_tokens = math.expm1(min(max(float(score), 0.0), 1.0) * LOG_MAX)
print(f"Predicted response length: {predicted_tokens:.0f} tokens")

Integration with Odyn Scheduler

In the Odyn distributed inference stack, this model runs on the head node to tag each incoming request with a predicted output length before it is dispatched to a vLLM worker. Requests
predicted to generate long responses are preferentially routed to workers with available KV-cache capacity, reducing queue latency at high concurrency.

Client → Head Node
           ├─ block-length-tagger (predict output tokens)
           └─ Route to worker with capacity → vLLM Engine

Citation

@article{block2025,
  title   = {Block: Scalable LLM Inference},
  journal = {arXiv preprint arXiv:2508.03611},
  year    = {2025},
  note    = {§4.3 Query-Length Tagger}
}

</details>
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for michael-sigamani-odyn/block-length-tagger