multilingual-e5-small-RoD-TAL
Fine-tuned sentence embedding model for Romanian legal retrieval in the driving-license domain.
- Base model:
intfloat/multilingual-e5-small - Fine-tuned model:
unstpb-nlp/multilingual-e5-small-RoD-TAL - Dataset:
unstpb-nlp/RoD-TAL - Primary task: dense retrieval of Romanian legal references for exam-style questions
Model details
This checkpoint is domain-adapted for the RoD-TAL benchmark (Romanian driving-law QA/retrieval). It is intended for retrieval and RAG pipelines where legal passages must be matched to user questions.
Setup highlights:
- Contrastive training (
MultipleNegativesRankingLoss/ InfoNCE) - Positive pairs: question ↔ correct legal references
- Hard negatives mined from top candidates of a base retriever
- Document encoding with 512-token truncation
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("unstpb-nlp/multilingual-e5-small-RoD-TAL")
queries = [
"query: În ce situație trebuie să acord prioritate pietonilor?"
]
passages = [
"passage: Art. X - Conducătorul este obligat să acorde prioritate pietonilor angajați regulamentar în traversare.",
"passage: Art. Y - ..."
]
q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = (q_emb @ p_emb.T)[0]
print(scores)
Query formatting recommendation
This is an E5-family model, so using explicit query: / passage: prefixes is recommended in inference.
For RoD-TAL-style retrieval, best reported performance used question + answer options as query text (concatenated), rather than question-only.
Intended uses
- Legal article retrieval in Romanian traffic-law assistants
- Retrieval stage for Romanian legal RAG systems
- Domain-specific benchmarking on RoD-TAL IR/VIR tasks
Limitations
- Truncation at 512 tokens can reduce recall for long legal documents
- Performance is domain-focused (Romanian driving-law); transfer to other legal domains may drop
- Reported gains depend on retrieval protocol and corpus preparation
Training and evaluation summary
Reported IR test metrics for the fine-tuned retriever on RoD-TAL:
- Recall@10: 88.14
- Precision@10: 23.28
- nDCG@10: 81.41
Citation
If you use this model, please cite:
@misc{man2025rodtalbenchmarkansweringquestions,
title={RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams},
author={Andrei Vlad Man and Răzvan-Alexandru Smădu and Cristian-George Craciun and Dumitru-Clementin Cercel and Florin Pop and Mihaela-Claudia Cercel},
year={2025},
eprint={2507.19666},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.19666}
}
- Downloads last month
- 2
Model tree for unstpb-nlp/multilingual-e5-small-RoD-TAL
Base model
intfloat/multilingual-e5-smallDataset used to train unstpb-nlp/multilingual-e5-small-RoD-TAL
Collection including unstpb-nlp/multilingual-e5-small-RoD-TAL
Collection
We introduce a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, alongside annotated legal references • 2 items • Updated