OmniScore mxbai
QCRI/OmniScore-mxbai is a multi-output regression model for automatic text quality evaluation.
It predicts four scalar scores in the range [1, 5]:
informativenessclarityplausibilityfaithfulness
The model is built on top of mixedbread-ai/mxbai-embed-large-v1 and published with custom model code (AutoModel + trust_remote_code=True).
Model Details
- Base model:
mixedbread-ai/mxbai-embed-large-v1 - Architecture:
ScorePredictorModel(customtransformersmodel) - Model type: encoder-only text regression
- Max sequence length: 512
- Number of outputs: 4
- Output range:
[1, 5](sigmoid-scaled in model head) - Backbone hidden size: 768
- Saved dtype:
float32
Quick Access
Model page: https://huggingface.co/QCRI/OmniScore-mxbai
from transformers import AutoTokenizer, AutoModel
repo_id = "QCRI/OmniScore-mxbai"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
What Input To Provide
The model takes a single text string and returns four quality scores. For best results, keep a consistent prompt/input format during inference.
Recommended flat format:
Task: <task_name>
Source: <source text, if available>
Reference: <reference text, if available>
Candidate: <model output being evaluated>
Chat-style input can be flattened as:
System: ...
User: ...
Assistant: ...
Usage Examples
Install dependencies:
pip install -U torch transformers sentencepiece
1) Single Text Example
import torch
from transformers import AutoTokenizer, AutoModel
repo_id = "QCRI/OmniScore-mxbai"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
text = """Task: headline_evaluation
Source: Full article text goes here.
Candidate: Microsoft releases detailed model documentation."""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
scores = {
name: float(outputs.predictions[0, i])
for i, name in enumerate(model.config.score_names)
}
print(scores)
2) Batch Example (GPU/CPU)
import torch
from transformers import AutoTokenizer, AutoModel
repo_id = "QCRI/OmniScore-mxbai"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device).eval()
texts = [
"Task: summarization\nSource: ...\nCandidate: ...",
"Task: translation_evaluation\nSource: ...\nReference: ...\nCandidate: ...",
]
batch = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
pred = model(**batch).predictions
results = []
for row in pred.cpu():
results.append({name: float(row[i]) for i, name in enumerate(model.config.score_names)})
print(results)
3) Chat Messages Helper
from transformers import AutoTokenizer, AutoModel
import torch
repo_id = "QCRI/OmniScore-mxbai"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a concise summary of this article."},
{"role": "assistant", "content": "Here is a short summary..."},
]
flat_text = " ".join([f"{m['role'].capitalize()}: {m['content']}" for m in messages])
inputs = tokenizer(flat_text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
print(dict((n, float(outputs.predictions[0, i])) for i, n in enumerate(model.config.score_names)))
Programmatic Download (Optional)
from huggingface_hub import snapshot_download
local_dir = snapshot_download("QCRI/OmniScore-mxbai")
print(local_dir)
Data and Task Coverage
This checkpoint is for multi-task text quality scoring and is evaluated on the test set covering:
- Chat evaluation
- Headline evaluation
- Paraphrase evaluation
- QA evaluation
- Summarization evaluation
- Translation evaluation
The underlying project data is multilingual and multi-domain.
Intended Use
Use this model to score generated text quality (or response quality) as a supporting signal in:
- evaluation dashboards
- ranking experiments
- offline model comparison
- human-in-the-loop workflows
Not intended as a sole decision maker for high-stakes or safety-critical settings.
Limitations
- Scores are continuous estimates and should not be treated as absolute truth.
- Performance differs by task, language, and domain.
- The model can inherit annotation noise and dataset biases.
- Long inputs beyond 512 tokens are truncated.
- Low correlation metrics on some dimensions indicate that rank ordering can be weak for certain subsets.
Responsible Use
Recommended:
- Use as a decision-support signal, not as a sole decision maker.
- Calibrate thresholds on your own validation set before production use.
- Monitor by language/task slices for fairness and reliability.
Not recommended:
- High-stakes automated decisions without human oversight.
- Out-of-domain deployment without re-validation.
Reproducibility Notes
Published artifacts include:
model.safetensorsconfig.jsonconfiguration_score_predictor.pymodeling_score_predictor.py- tokenizer files
metrics_final.jsonpredictions.jsonl
Load with trust_remote_code=True because the architecture is custom.
Citation
If you use this model, please cite the project/repository and this model URL:
@misc{qcri_omniscore_deberta_v3,
title = {OmniScore mxbai},
author = {QCRI},
year = {2026},
howpublished = {\url{https://huggingface.co/QCRI/OmniScore-mxbai}}
}
- Downloads last month
- 36