You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

multilingual-e5-small-RoD-TAL

Fine-tuned sentence embedding model for Romanian legal retrieval in the driving-license domain.

  • Base model: intfloat/multilingual-e5-small
  • Fine-tuned model: unstpb-nlp/multilingual-e5-small-RoD-TAL
  • Dataset: unstpb-nlp/RoD-TAL
  • Primary task: dense retrieval of Romanian legal references for exam-style questions

Model details

This checkpoint is domain-adapted for the RoD-TAL benchmark (Romanian driving-law QA/retrieval). It is intended for retrieval and RAG pipelines where legal passages must be matched to user questions.

Setup highlights:

  • Contrastive training (MultipleNegativesRankingLoss / InfoNCE)
  • Positive pairs: question ↔ correct legal references
  • Hard negatives mined from top candidates of a base retriever
  • Document encoding with 512-token truncation

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("unstpb-nlp/multilingual-e5-small-RoD-TAL")

queries = [
    "query: În ce situație trebuie să acord prioritate pietonilor?"
]
passages = [
    "passage: Art. X - Conducătorul este obligat să acorde prioritate pietonilor angajați regulamentar în traversare.",
    "passage: Art. Y - ..."
]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)

scores = (q_emb @ p_emb.T)[0]
print(scores)

Query formatting recommendation

This is an E5-family model, so using explicit query: / passage: prefixes is recommended in inference.

For RoD-TAL-style retrieval, best reported performance used question + answer options as query text (concatenated), rather than question-only.

Intended uses

  • Legal article retrieval in Romanian traffic-law assistants
  • Retrieval stage for Romanian legal RAG systems
  • Domain-specific benchmarking on RoD-TAL IR/VIR tasks

Limitations

  • Truncation at 512 tokens can reduce recall for long legal documents
  • Performance is domain-focused (Romanian driving-law); transfer to other legal domains may drop
  • Reported gains depend on retrieval protocol and corpus preparation

Training and evaluation summary

Reported IR test metrics for the fine-tuned retriever on RoD-TAL:

  • Recall@10: 88.14
  • Precision@10: 23.28
  • nDCG@10: 81.41

Citation

If you use this model, please cite:

@misc{man2025rodtalbenchmarkansweringquestions,
  title={RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams},
  author={Andrei Vlad Man and Răzvan-Alexandru Smădu and Cristian-George Craciun and Dumitru-Clementin Cercel and Florin Pop and Mihaela-Claudia Cercel},
  year={2025},
  eprint={2507.19666},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.19666}
}
Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for unstpb-nlp/multilingual-e5-small-RoD-TAL

Finetuned
(136)
this model

Dataset used to train unstpb-nlp/multilingual-e5-small-RoD-TAL

Collection including unstpb-nlp/multilingual-e5-small-RoD-TAL

Paper for unstpb-nlp/multilingual-e5-small-RoD-TAL