jua-4B-mixed
jua-4B-mixed is a Brazilian Portuguese legal embedding model based on Qwen/Qwen3-Embedding-4B. It was adapted with a mixed supervision regime that combines legal-domain supervision with broader question-passage supervision, and is intended for heterogeneous retrieval settings where legal specialization and broader semantic robustness are both important.
This model is presented in the paper Domain-Adaptive Dense Retrieval for Brazilian Legal Search. It is the mixed condition discussed in the paper.
Model Overview
- Base model:
Qwen/Qwen3-Embedding-4B - Model type: text embedding
- Primary language: Brazilian Portuguese
- Intended use: dense retrieval for Brazilian legal search
- Training profile: mixed supervision
The mixed training regime uses:
JUÁ-Juristraining pairs- Ulysses-derived legislative supervision
- a small synthetic legislative extension based on alternative query formulations
SQuAD-ptas a broader question-passage supervision source
Intended Use
This model is best suited for:
- heterogeneous legal retrieval
- question-driven legal search
- retrieval setups that need a stronger balance between legal specialization and semantic robustness
- upstream retrieval for RAG-like legal pipelines
If your use case is narrowly specialized and more institutionally framed, the legal-only model may be preferable:
ufca-llms/jua-4B-legal-only
Usage
Sentence Transformers
# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ufca-llms/jua-4B-mixed")
queries = [
"Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: aposentadoria por pensão estatutária",
"Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: por que dividir um país em estados?",
]
documents = [
"O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.",
"A divisão de um país em estados distribui competências administrativas e políticas em sistemas federativos.",
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
Transformers
# Requires transformers>=4.51.0
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoModel, AutoTokenizer
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[
torch.arange(batch_size, device=last_hidden_states.device),
sequence_lengths,
]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f"Instruct: {task_description}\nQuery: {query}"
task = "Given a Brazilian legal search query, retrieve relevant legal passages or documents."
queries = [
get_detailed_instruct(task, "aposentadoria por pensão estatutária"),
get_detailed_instruct(task, "por que dividir um país em estados?"),
]
documents = [
"O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.",
"A divisão de um país em estados distribui competências administrativas e políticas em sistemas federativos.",
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained(
"ufca-llms/jua-4B-mixed",
padding_side="left",
)
model = AutoModel.from_pretrained("ufca-llms/jua-4B-mixed")
batch_dict = tokenizer(
input_texts,
padding=True,
truncation=True,
max_length=8192,
return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = embeddings[: len(queries)] @ embeddings[len(queries) :].T
print(scores.tolist())
Evaluation
JUÁ + Quati
The table below reproduces the mixed results reported in the paper over the five legal datasets in the JUÁ evaluation environment plus Quati.
| Dataset | NDCG@10 | MRR@10 | MAP@10 |
|---|---|---|---|
| JUÁ-Juris | 0.290 | 0.230 | 0.231 |
| JurisTCU | 0.363 | 0.641 | 0.170 |
| NormasTCU | 0.305 | 0.474 | 0.184 |
| Ulysses-RFCorpus | 0.441 | 0.624 | 0.315 |
| BR-TaxQA-R | 0.777 | 0.800 | 0.701 |
| Quati | 0.503 | 0.799 | 0.247 |
| Average | 0.447 | 0.595 | 0.308 |
Shared legal comparison against broader baselines
On the four legal datasets shared by all baselines in the paper's broader comparison (JUÁ-Juris, JurisTCU, NormasTCU, and BR-TaxQA-R), this model obtains:
NDCG@10:0.434MRR@10:0.536MAP@10:0.321
Notes
- Query-side instructions are recommended.
- This model is intended as the more robust of the two adapted variants discussed in the paper.
- It preserves most of the legal-only model's legal-domain effectiveness while improving broader and more question-driven retrieval settings.
Citation
If you use this model, please cite:
@misc{pereira2026domainadaptivedenseretrievalbrazilian,
title={Domain-Adaptive Dense Retrieval for Brazilian Legal Search},
author={Jayr Pereira and Roberto Lotufo and Luiz Bonifacio},
year={2026},
eprint={2605.04005},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2605.04005},
}
- Downloads last month
- 39