Update README.md

711fe36 verified about 2 months ago

3.64 kB

language:
  - ro
license: mit
library_name: sentence-transformers
base_model: intfloat/multilingual-e5-small
datasets:
  - unstpb-nlp/RoD-TAL
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - information-retrieval
  - legal
  - romanian
  - rag
extra_gated_prompt: >-
  By requesting access you confirm that you will use this model exclusively for
  academic research purposes, will not use it for commercial products and will
  not redistribute the model
extra_gated_fields:
  Full name: text
  Email: text
  Institution / University / Company: text
  Country: country
  Intended use:
    type: select
    options:
      - Academic research
      - University coursework / thesis
      - Non-commercial experimentation
      - label: Other
        value: other
  Project description: text
  Expected usage date: date_picker

multilingual-e5-small-RoD-TAL

Fine-tuned sentence embedding model for Romanian legal retrieval in the driving-license domain.

Base model: intfloat/multilingual-e5-small
Fine-tuned model: unstpb-nlp/multilingual-e5-small-RoD-TAL
Dataset: unstpb-nlp/RoD-TAL
Primary task: dense retrieval of Romanian legal references for exam-style questions

Model details

This checkpoint is domain-adapted for the RoD-TAL benchmark (Romanian driving-law QA/retrieval). It is intended for retrieval and RAG pipelines where legal passages must be matched to user questions.

Setup highlights:

Contrastive training (MultipleNegativesRankingLoss / InfoNCE)
Positive pairs: question ↔ correct legal references
Hard negatives mined from top candidates of a base retriever
Document encoding with 512-token truncation

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("unstpb-nlp/multilingual-e5-small-RoD-TAL")

queries = [
    "query: În ce situație trebuie să acord prioritate pietonilor?"
]
passages = [
    "passage: Art. X - Conducătorul este obligat să acorde prioritate pietonilor angajați regulamentar în traversare.",
    "passage: Art. Y - ..."
]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)

scores = (q_emb @ p_emb.T)[0]
print(scores)

Query formatting recommendation

This is an E5-family model, so using explicit query: / passage: prefixes is recommended in inference.

For RoD-TAL-style retrieval, best reported performance used question + answer options as query text (concatenated), rather than question-only.

Intended uses

Legal article retrieval in Romanian traffic-law assistants
Retrieval stage for Romanian legal RAG systems
Domain-specific benchmarking on RoD-TAL IR/VIR tasks

Limitations

Truncation at 512 tokens can reduce recall for long legal documents
Performance is domain-focused (Romanian driving-law); transfer to other legal domains may drop
Reported gains depend on retrieval protocol and corpus preparation

Training and evaluation summary

Reported IR test metrics for the fine-tuned retriever on RoD-TAL:

Recall@10: 88.14
Precision@10: 23.28
nDCG@10: 81.41

Citation

If you use this model, please cite:

@misc{man2025rodtalbenchmarkansweringquestions,
  title={RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams},
  author={Andrei Vlad Man and Răzvan-Alexandru Smădu and Cristian-George Craciun and Dumitru-Clementin Cercel and Florin Pop and Mihaela-Claudia Cercel},
  year={2025},
  eprint={2507.19666},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.19666}
}