⚖️🇮🇹​ Legal-Embedding-Ita-0.6b

Legal-Embedding-Ita-0.6b is a Sentence Transformers embedding model fine-tuned from Qwen/Qwen3-Embedding-0.6B for Italian retrieval tasks, with a particular focus on the legal domain.

The model maps queries and documents into a 1024-dimensional dense vector space and is designed for semantic search, retrieval-augmented generation (RAG), document ranking, and legal information retrieval in Italian.

GGUF version here

⚠️ DISCLAIMER

This model has been created for research purposes. Is under no circumstances intended for use in production environments. By using this model, you accept all liability.


Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base Model: Qwen/Qwen3-Embedding-0.6B
  • Fine-tuned Model: Legal-Embedding-Ita-0.6b
  • Organization: ReDiX
  • Language: Italian
  • Primary Domain: Legal
  • Maximum Sequence Length: 1024 tokens
  • Output Dimensionality: 1024
  • Similarity Function: Cosine Similarity
  • Supported Modality: Text

Intended Use

This model is intended for Italian text retrieval tasks, especially in legal and domain-specific RAG pipelines.

Recommended use cases:

  • Legal semantic search
  • Italian legal document retrieval
  • Retrieval-augmented generation over Italian documents
  • Dense retrieval benchmarking
  • Domain-specific document ranking
  • Question-answering retrieval pipelines

This model is not a generative language model. It only produces dense embeddings.


Performance Summary

The model was evaluated against the base qwen3-embedding-0.6b model on Italian MTEB datasets and internal ReDiX domain-specific retrieval benchmarks.

The main improvement is observed on the legal retrieval dataset.

Key Result

On the legal MTEB dataset MuPLeR-retrieval, Legal-Embedding-Ita-0.6b outperforms the base qwen3-embedding-0.6b model by:

+11.45% main score

This indicates a clear gain in Italian legal retrieval performance after fine-tuning.


MTEB Results — Italian Datasets

Dataset qwen3-embedding-0.6b Legal-Embedding-Ita-0.6b Difference
MintakaRetrieval 0.36852 0.36433 -0.4%
MKQARetrieval 0.10112 0.09974 -0.13%
MuPLeR-retrieval — Legal 0.76233 0.87685 +11.45%
WikipediaRetrievalMultilingual 0.88135 0.90066 +1.931%

ReDiX Domain Benchmark Results

Evaluation metric: nDCG@10.

Domain qwen3-embedding-0.6b Legal-Embedding-Ita-0.6b Difference
Legal 0.6281 0.6751 +4.70%
Finance 0.6155 0.6819 +6.64%
Medical 0.5855 0.6243 +3.87%
STEM 0.6807 0.7258 +4.51%

Although the model improves across the internal ReDiX benchmark domains, the model should primarily be considered a legal-focused Italian embedding model, since the fine-tuning process was designed around Italian legal retrieval.


Full Model Architecture

SentenceTransformer(
  (0): Transformer({
      'transformer_task': 'feature-extraction',
      'modality_config': {
          'text': {
              'method': 'forward',
              'method_output_name': 'last_hidden_state'
          }
      },
      'module_output_name': 'token_embeddings',
      'architecture': 'Qwen3Model'
  })
  (1): Pooling({
      'embedding_dimension': 1024,
      'pooling_mode': 'lasttoken',
      'include_prompt': True
  })
  (2): Normalize({})
)

Usage

Installation

pip install -U sentence-transformers

Direct Usage with Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ReDiX/Legal-Embedding-Ita-0.6b")

queries = [
    "Qual è il lasso di tempo obbligatorio prima di ripresentare una domanda di adesione al codice dopo un rifiuto?"
]

documents = [
    "L’eventuale mancata conferma della adesione al Codice di condotta presentata da parte di un Produttore del Software deve essere motivata da parte dell’OdM, fermo restando che tale diniego non preclude la possibilità per il Produttore di successiva presentazione della domanda di adesione che può avvenire non prima di un anno unitamente ad una breve nota che illustri le misure adottate per superare le ragioni che avevano condotto al precedente diniego.",
    "La Corte costituzionale ha affrontato il tema delle intercettazioni indirette relative ai parlamentari, distinguendo tra intercettazioni fortuite e mirate.",
    "Il danneggiato è tenuto a dimostrare davanti al giudice civile la sussistenza del nesso di causalità tra condotta e danno e a quantificare quest’ultimo."
]

query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)

similarities = model.similarity(query_embeddings, document_embeddings)

print(similarities)

Prompt Format

Since the model is based on Qwen3 Embedding, asymmetric retrieval prompting is recommended.

Query Prompt

Instruct: Given an Italian legal search query, retrieve the most relevant legal passage that answers the query.
Query:
Document Prompt:

The document prompt is intentionally empty.

Training Details

Training Dataset

The model was fine-tuned on an Italian retrieval dataset containing:

244,907 training samples Columns: anchor, positive, negative Approximately 120,000 strictly legal-related samples Training format: query, relevant passage, hard negative passage

The evaluation dataset contains:

3,310 evaluation samples Columns: anchor, positive Domain: Italian legal retrieval

Training Loss

The model was trained using a patched version of CachedGISTEmbedLoss.

{
  "guide": "SentenceTransformer('intfloat/multilingual-e5-large-instruct')",
  "temperature": 0.01,
  "mini_batch_size": 32,
  "margin_strategy": "absolute",
  "margin": 0.0,
  "contrast_anchors": true,
  "contrast_positives": false,
  "gather_across_devices": false
}

Training Hyperparameters

Non-Default Hyperparameters
  • learning_rate: 2e-06
  • lr_scheduler_type: cosine
  • warmup_steps: 0.03
  • weight_decay: 0.01
  • gradient_accumulation_steps: 4
  • bf16: True
  • load_best_model_at_end: True
  • data_seed: 42
  • dataloader_num_workers: 4
  • remove_unused_columns: False
  • prompts: {'anchor': 'Instruct: Given an Italian legal search query, retrieve the most relevant legal passage that answers the query.\nQuery: ', 'positive': '', 'negative': ''}
  • batch_sampler: no_duplicates
All Hyperparameters
Click to expand
  • per_device_train_batch_size: 8
  • num_train_epochs: 3.0
  • max_steps: -1
  • learning_rate: 2e-06
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.03
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 4
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • per_device_eval_batch_size: 8
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: 42
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: False
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: {'anchor': 'Instruct: Given an Italian legal search query, retrieve the most relevant legal passage that answers the query.\nQuery: ', 'positive': '', 'negative': ''}
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Time

  • Training: 1.3 days
  • Evaluation: 7.2 hours
  • Total: 1.6 days

Framework Versions

  • Python: 3.12.3
  • Sentence Transformers: 5.4.1
  • Transformers: 5.5.4
  • PyTorch: 2.11.0+cu130
  • Accelerate: 1.13.0
  • Datasets: 4.8.4
  • Tokenizers: 0.22.2

Limitations

The model is optimized primarily for Italian legal retrieval. Performance gains on non-legal datasets should be interpreted cautiously. The model may underperform on domains or languages not represented in the fine-tuning data. The model does not generate answers; it only produces embeddings for retrieval. Legal retrieval results do not imply legal correctness or legal advice.

Citation

BibTeX

Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
237
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ReDiX/Legal-Embedding-ita-0.6B

Finetuned
(177)
this model
Quantizations
2 models

Collection including ReDiX/Legal-Embedding-ita-0.6B

Paper for ReDiX/Legal-Embedding-ita-0.6B

Evaluation results