YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
license: mit base_model: FacebookAI/roberta-base tags: - roberta - text-classification - regression - inference-routing - llm-scheduling - vllm - odyn - block - generated_from_trainer library_name: transformers pipeline_tag: text-classification
block-length-tagger
A RoBERTa-base regression model that predicts the output token count of an LLM response given an input prompt. Trained as part of the Odyn Network Phase 3 inference scheduler, implementing the query-length tagger described in Block: Scalable LLM Inference (arXiv 2508.03611, §4.3).
Model Description
The scheduler needs to know before generation how long a response will be, so it can route long-response requests to less-loaded instances before KV-cache pressure builds. This model replaces the 7B prompt-based LLM used as baseline in the Block paper (Table 1: 24.4% avg error, 77.15% Acc-100) with a lightweight 124M-parameter RoBERTa-base CrossEncoder that runs in milliseconds.
Architecture: FacebookAI/roberta-base fine-tuned with a single regression head via HuggingFace Trainer and sentence-transformers CrossEncoder.
Label normalisation: Response lengths are log1p-normalised (log1p(tokens) / log1p(2048)) before training to handle the heavy right-tail of ShareGPT response distributions.
Predictions are clamped to [0, 1] before the inverse transform (expm1) to prevent out-of-range logits inflating error metrics.
Training
| Parameter | Value |
|---|---|
| Base model | FacebookAI/roberta-base |
| Dataset | Aeala/ShareGPT_Vicuna_unfiltered |
| Train / eval split | 96,540 / 24,135 examples |
| Max input length | 512 tokens |
| Max token cap (label) | 2048 tokens |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Epochs | 3 |
| Warmup steps | 500 |
| Weight decay | 0.01 |
| Hardware | NVIDIA GB10 Blackwell (DGX Spark), CUDA 13 |
| Precision | fp32 |
Training data: first human→gpt turn extracted from each ShareGPT conversation. Shuffled with seed 42 before splitting.
Evaluation Metrics
Metrics follow Block paper Table 1 definitions:
| Metric | Description | Block baseline |
|---|---|---|
eval/mae |
Mean absolute token-count error | 78.8 tokens |
eval/error_rate |
|pred − actual| / actual (lower is better) |
24.4% |
eval/acc_50 |
Fraction with error < 50 tokens | 69.93% |
eval/acc_100 |
Fraction with error < 100 tokens | 77.15% |
Best checkpoint selected by eval/mae (lower is better).
Usage
from transformers import pipeline
import math
pipe = pipeline("text-classification", model="michael-sigamani-odyn/block-length-tagger")
result = pipe("Explain the difference between TCP and UDP in detail.")
MAX_TOKEN_CAP, LOG_MAX = 2048, math.log1p(2048)
predicted_tokens = math.expm1(min(max(result[0]["score"], 0.0), 1.0) * LOG_MAX)
print(f"Predicted response length: {predicted_tokens:.0f} tokens")
Or using sentence-transformers:
from sentence_transformers import CrossEncoder
import math
model = CrossEncoder("michael-sigamani-odyn/block-length-tagger", num_labels=1, max_length=512)
score = model.predict("Explain the difference between TCP and UDP in detail.")
MAX_TOKEN_CAP, LOG_MAX = 2048, math.log1p(2048)
predicted_tokens = math.expm1(min(max(float(score), 0.0), 1.0) * LOG_MAX)
print(f"Predicted response length: {predicted_tokens:.0f} tokens")
Integration with Odyn Scheduler
In the Odyn distributed inference stack, this model runs on the head node to tag each incoming request with a predicted output length before it is dispatched to a vLLM worker. Requests
predicted to generate long responses are preferentially routed to workers with available KV-cache capacity, reducing queue latency at high concurrency.
Client → Head Node
├─ block-length-tagger (predict output tokens)
└─ Route to worker with capacity → vLLM Engine
Citation
@article{block2025,
title = {Block: Scalable LLM Inference},
journal = {arXiv preprint arXiv:2508.03611},
year = {2025},
note = {§4.3 Query-Length Tagger}
}
</details>
- Downloads last month
- -