ORCA — OLMo-2-1B-Instruct (Multinomial, seed 99)

ORCA (Open-ended Response Correctness Assessment) scores the correctness of open-ended audio QA responses. Given a question, reference answer, candidate answer, and an LLM-generated rationale, it outputs a correctness score in [0, 1] and an uncertainty estimate.

Paper: ORCA: Open-ended Response Correctness Assessment for Audio Question Answering — accepted to TACL 2026
Code & usage: github.com/BUTSpeechFIT/ORCA
Training data: BUT-FIT/orca-audio-qa-annotations

Model details

Property	Value
Base model	`allenai/OLMo-2-0425-1B-Instruct`
LoRA rank / alpha	128 / 128
Loss function	Multinomial log-likelihood (5-class Likert)
Training seed	99
Training curriculum	Stage 1 (synthetic) → Stage 2 (LLM-judge) → Stage 3 (human)
Precision	bfloat16

Quick start

pip install git+https://github.com/BUTSpeechFIT/ORCA.git
hf download BUT-FIT/orca-olmo-2-1b-multinomial --local-dir orca-olmo-1b
orca-infer --model_path orca-olmo-1b/model --data_jsonl your_data.jsonl --output_dir results/

See the repository for full usage, evaluation scripts, and the download_and_infer.py convenience script.

Citation

@article{sedlacek-etal-2026-orca,
  title={ORCA: Open-ended Response Correctness Assessment for Audio Question Answering},
  author={Sedl\'{a}\v{c}ek, \v{S}imon and Barahona, Sara and Bola\~{n}os, Cecilia and
          Herrera-Alarc\'{o}n, Laura and Udupa, Sathvik and L\'{o}pez, Fernando and
          Ferner, Allison and Lozano-Diez, Alicia and Yusuf, Bolaji and Kesiraju, Santosh and
          Duraiswami, Ramani and \v{C}ernock\'{y}, Jan},
  howpublished={Accepted to Transactions of the Association for Computational Linguistics},
  year={2026},
  url={https://arxiv.org/abs/2512.09066}
}