metadata
language:
- ro
license: mit
library_name: sentence-transformers
base_model: intfloat/multilingual-e5-small
datasets:
- unstpb-nlp/RoD-TAL
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- information-retrieval
- legal
- romanian
- rag
extra_gated_prompt: >-
By requesting access you confirm that you will use this model exclusively for
academic research purposes, will not use it for commercial products and will
not redistribute the model
extra_gated_fields:
Full name: text
Email: text
Institution / University / Company: text
Country: country
Intended use:
type: select
options:
- Academic research
- University coursework / thesis
- Non-commercial experimentation
- label: Other
value: other
Project description: text
Expected usage date: date_picker
multilingual-e5-small-RoD-TAL
Fine-tuned sentence embedding model for Romanian legal retrieval in the driving-license domain.
- Base model:
intfloat/multilingual-e5-small - Fine-tuned model:
unstpb-nlp/multilingual-e5-small-RoD-TAL - Dataset:
unstpb-nlp/RoD-TAL - Primary task: dense retrieval of Romanian legal references for exam-style questions
Model details
This checkpoint is domain-adapted for the RoD-TAL benchmark (Romanian driving-law QA/retrieval). It is intended for retrieval and RAG pipelines where legal passages must be matched to user questions.
Setup highlights:
- Contrastive training (
MultipleNegativesRankingLoss/ InfoNCE) - Positive pairs: question ↔ correct legal references
- Hard negatives mined from top candidates of a base retriever
- Document encoding with 512-token truncation
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("unstpb-nlp/multilingual-e5-small-RoD-TAL")
queries = [
"query: În ce situație trebuie să acord prioritate pietonilor?"
]
passages = [
"passage: Art. X - Conducătorul este obligat să acorde prioritate pietonilor angajați regulamentar în traversare.",
"passage: Art. Y - ..."
]
q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = (q_emb @ p_emb.T)[0]
print(scores)
Query formatting recommendation
This is an E5-family model, so using explicit query: / passage: prefixes is recommended in inference.
For RoD-TAL-style retrieval, best reported performance used question + answer options as query text (concatenated), rather than question-only.
Intended uses
- Legal article retrieval in Romanian traffic-law assistants
- Retrieval stage for Romanian legal RAG systems
- Domain-specific benchmarking on RoD-TAL IR/VIR tasks
Limitations
- Truncation at 512 tokens can reduce recall for long legal documents
- Performance is domain-focused (Romanian driving-law); transfer to other legal domains may drop
- Reported gains depend on retrieval protocol and corpus preparation
Training and evaluation summary
Reported IR test metrics for the fine-tuned retriever on RoD-TAL:
- Recall@10: 88.14
- Precision@10: 23.28
- nDCG@10: 81.41
Citation
If you use this model, please cite:
@misc{man2025rodtalbenchmarkansweringquestions,
title={RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams},
author={Andrei Vlad Man and Răzvan-Alexandru Smădu and Cristian-George Craciun and Dumitru-Clementin Cercel and Florin Pop and Mihaela-Claudia Cercel},
year={2025},
eprint={2507.19666},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.19666}
}