Sentence Similarity
Safetensors
Portuguese
English
qwen3
embeddings
pt-br
law
jurisprudency

jua-4B-mixed

jua-4B-mixed is a Brazilian Portuguese legal embedding model based on Qwen/Qwen3-Embedding-4B. It was adapted with a mixed supervision regime that combines legal-domain supervision with broader question-passage supervision, and is intended for heterogeneous retrieval settings where legal specialization and broader semantic robustness are both important.

This model is presented in the paper Domain-Adaptive Dense Retrieval for Brazilian Legal Search. It is the mixed condition discussed in the paper.

Model Overview

  • Base model: Qwen/Qwen3-Embedding-4B
  • Model type: text embedding
  • Primary language: Brazilian Portuguese
  • Intended use: dense retrieval for Brazilian legal search
  • Training profile: mixed supervision

The mixed training regime uses:

  • JUÁ-Juris training pairs
  • Ulysses-derived legislative supervision
  • a small synthetic legislative extension based on alternative query formulations
  • SQuAD-pt as a broader question-passage supervision source

Intended Use

This model is best suited for:

  • heterogeneous legal retrieval
  • question-driven legal search
  • retrieval setups that need a stronger balance between legal specialization and semantic robustness
  • upstream retrieval for RAG-like legal pipelines

If your use case is narrowly specialized and more institutionally framed, the legal-only model may be preferable:

  • ufca-llms/jua-4B-legal-only

Usage

Sentence Transformers

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ufca-llms/jua-4B-mixed")

queries = [
    "Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: aposentadoria por pensão estatutária",
    "Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: por que dividir um país em estados?",
]

documents = [
    "O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.",
    "A divisão de um país em estados distribui competências administrativas e políticas em sistemas federativos.",
]

query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)

Transformers

# Requires transformers>=4.51.0

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoModel, AutoTokenizer


def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    sequence_lengths = attention_mask.sum(dim=1) - 1
    batch_size = last_hidden_states.shape[0]
    return last_hidden_states[
        torch.arange(batch_size, device=last_hidden_states.device),
        sequence_lengths,
    ]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f"Instruct: {task_description}\nQuery: {query}"


task = "Given a Brazilian legal search query, retrieve relevant legal passages or documents."
queries = [
    get_detailed_instruct(task, "aposentadoria por pensão estatutária"),
    get_detailed_instruct(task, "por que dividir um país em estados?"),
]

documents = [
    "O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.",
    "A divisão de um país em estados distribui competências administrativas e políticas em sistemas federativos.",
]

input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained(
    "ufca-llms/jua-4B-mixed",
    padding_side="left",
)
model = AutoModel.from_pretrained("ufca-llms/jua-4B-mixed")

batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=8192,
    return_tensors="pt",
)
batch_dict.to(model.device)

outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)

scores = embeddings[: len(queries)] @ embeddings[len(queries) :].T
print(scores.tolist())

Evaluation

JUÁ + Quati

The table below reproduces the mixed results reported in the paper over the five legal datasets in the JUÁ evaluation environment plus Quati.

Dataset NDCG@10 MRR@10 MAP@10
JUÁ-Juris 0.290 0.230 0.231
JurisTCU 0.363 0.641 0.170
NormasTCU 0.305 0.474 0.184
Ulysses-RFCorpus 0.441 0.624 0.315
BR-TaxQA-R 0.777 0.800 0.701
Quati 0.503 0.799 0.247
Average 0.447 0.595 0.308

Shared legal comparison against broader baselines

On the four legal datasets shared by all baselines in the paper's broader comparison (JUÁ-Juris, JurisTCU, NormasTCU, and BR-TaxQA-R), this model obtains:

  • NDCG@10: 0.434
  • MRR@10: 0.536
  • MAP@10: 0.321

Notes

  • Query-side instructions are recommended.
  • This model is intended as the more robust of the two adapted variants discussed in the paper.
  • It preserves most of the legal-only model's legal-domain effectiveness while improving broader and more question-driven retrieval settings.

Citation

If you use this model, please cite:

@misc{pereira2026domainadaptivedenseretrievalbrazilian,
      title={Domain-Adaptive Dense Retrieval for Brazilian Legal Search}, 
      author={Jayr Pereira and Roberto Lotufo and Luiz Bonifacio},
      year={2026},
      eprint={2605.04005},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2605.04005}, 
}
Downloads last month
39
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ufca-llms/jua-4B-mixed

Finetuned
(48)
this model
Quantizations
1 model

Datasets used to train ufca-llms/jua-4B-mixed

Paper for ufca-llms/jua-4B-mixed