OmniScore mxbai

QCRI/OmniScore-mxbai is a multi-output regression model for automatic text quality evaluation. It predicts four scalar scores in the range [1, 5]:

informativeness
clarity
plausibility
faithfulness

The model is built on top of mixedbread-ai/mxbai-embed-large-v1 and published with custom model code (AutoModel + trust_remote_code=True).

Model Details

Base model: mixedbread-ai/mxbai-embed-large-v1
Architecture: ScorePredictorModel (custom transformers model)
Model type: encoder-only text regression
Max sequence length: 512
Number of outputs: 4
Output range: [1, 5] (sigmoid-scaled in model head)
Backbone hidden size: 768
Saved dtype: float32

Quick Access

Model page: https://huggingface.co/QCRI/OmniScore-mxbai

from transformers import AutoTokenizer, AutoModel

repo_id = "QCRI/OmniScore-mxbai"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)

What Input To Provide

The model takes a single text string and returns four quality scores. For best results, keep a consistent prompt/input format during inference.

Recommended flat format:

Task: <task_name>
Source: <source text, if available>
Reference: <reference text, if available>
Candidate: <model output being evaluated>

Chat-style input can be flattened as:

System: ...
User: ...
Assistant: ...

Usage Examples

Install dependencies:

pip install -U torch transformers sentencepiece

1) Single Text Example

import torch
from transformers import AutoTokenizer, AutoModel

repo_id = "QCRI/OmniScore-mxbai"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()

text = """Task: headline_evaluation
Source: Full article text goes here.
Candidate: Microsoft releases detailed model documentation."""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

scores = {
    name: float(outputs.predictions[0, i])
    for i, name in enumerate(model.config.score_names)
}
print(scores)

2) Batch Example (GPU/CPU)

import torch
from transformers import AutoTokenizer, AutoModel

repo_id = "QCRI/OmniScore-mxbai"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device).eval()

texts = [
    "Task: summarization\nSource: ...\nCandidate: ...",
    "Task: translation_evaluation\nSource: ...\nReference: ...\nCandidate: ...",
]

batch = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
batch = {k: v.to(device) for k, v in batch.items()}

with torch.no_grad():
    pred = model(**batch).predictions

results = []
for row in pred.cpu():
    results.append({name: float(row[i]) for i, name in enumerate(model.config.score_names)})

print(results)

3) Chat Messages Helper

from transformers import AutoTokenizer, AutoModel
import torch

repo_id = "QCRI/OmniScore-mxbai"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a concise summary of this article."},
    {"role": "assistant", "content": "Here is a short summary..."},
]

flat_text = " ".join([f"{m['role'].capitalize()}: {m['content']}" for m in messages])
inputs = tokenizer(flat_text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

print(dict((n, float(outputs.predictions[0, i])) for i, n in enumerate(model.config.score_names)))

Programmatic Download (Optional)

from huggingface_hub import snapshot_download

local_dir = snapshot_download("QCRI/OmniScore-mxbai")
print(local_dir)

Data and Task Coverage

This checkpoint is for multi-task text quality scoring and is evaluated on the test set covering:

Chat evaluation
Headline evaluation
Paraphrase evaluation
QA evaluation
Summarization evaluation
Translation evaluation

The underlying project data is multilingual and multi-domain.

Intended Use

Use this model to score generated text quality (or response quality) as a supporting signal in:

evaluation dashboards
ranking experiments
offline model comparison
human-in-the-loop workflows

Not intended as a sole decision maker for high-stakes or safety-critical settings.

Limitations

Scores are continuous estimates and should not be treated as absolute truth.
Performance differs by task, language, and domain.
The model can inherit annotation noise and dataset biases.
Long inputs beyond 512 tokens are truncated.
Low correlation metrics on some dimensions indicate that rank ordering can be weak for certain subsets.

Responsible Use

Recommended:

Use as a decision-support signal, not as a sole decision maker.
Calibrate thresholds on your own validation set before production use.
Monitor by language/task slices for fairness and reliability.

Not recommended:

High-stakes automated decisions without human oversight.
Out-of-domain deployment without re-validation.

Reproducibility Notes

Published artifacts include:

model.safetensors
config.json
configuration_score_predictor.py
modeling_score_predictor.py
tokenizer files
metrics_final.json
predictions.jsonl

Load with trust_remote_code=True because the architecture is custom.

Citation

If you use this model, please cite the project/repository and this model URL:

@misc{qcri_omniscore_deberta_v3,
  title        = {OmniScore mxbai},
  author       = {QCRI},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/QCRI/OmniScore-mxbai}}
}

Downloads last month: 13

Safetensors

Model size

0.3B params

Tensor type

F32

Collection including QCRI/OmniScore-mxbai

OmniScore

Collection

4 items • Updated Apr 14 • 3