File size: 4,430 Bytes
72659d5 b62a8d0 72659d5 61eb3c1 72659d5 81aaccb f958735 72659d5 61eb3c1 72659d5 61eb3c1 72659d5 61eb3c1 72659d5 61eb3c1 72659d5 688ac07 72659d5 688ac07 72659d5 688ac07 72659d5 61eb3c1 72659d5 61eb3c1 72659d5 688ac07 61eb3c1 72659d5 61eb3c1 72659d5 829a3ee 61eb3c1 829a3ee 72659d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
language: pt
license: apache-2.0
tags:
- information-retrieval
- sparse-retrieval
- splade
- portuguese
- bert
datasets:
- unicamp-dl/mmarco
- unicamp-dl/mrobust
base_model: neuralmind/bert-base-portuguese-cased
---
# SPLADE-PT-BR
SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval. Based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) and trained on Portuguese question-answering datasets.
**GitHub Repository**: https://github.com/AxelPCG/SPLADE-PT-BR
## Model Description
SPLADE is a neural retrieval model that learns to expand queries and documents with contextually relevant terms while maintaining sparsity. Unlike dense retrievers, SPLADE produces sparse vectors (typically ~99% sparse) that are:
- **Interpretable**: Each dimension corresponds to a vocabulary token
- **Efficient**: Can use inverted indexes for fast retrieval
- **Effective**: Combines lexical matching with semantic expansion
### Key Features
- **Base Model**: `neuralmind/bert-base-portuguese-cased` (BERTimbau)
- **Vocabulary Size**: 29,794 tokens (Portuguese-optimized)
- **Training Iterations**: 150,000
- **Final Training Loss**: 0.000047
- **Sparsity**: ~99.5% (100-150 active dimensions per vector)
- **Max Sequence Length**: 256 tokens
## Training Details
### Training Data
- **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`)
- **Validation Dataset**: mRobust (`unicamp-dl/mrobust`)
- **Format**: Triplets (query, positive document, negative document)
### Training Configuration
```yaml
Learning Rate: 2e-5
Batch Size: 8 (effective: 32 with gradient accumulation)
Gradient Accumulation Steps: 4
Weight Decay: 0.01
Warmup Steps: 6,000
Mixed Precision: FP16
Optimizer: AdamW
```
### Regularization
FLOPS regularization is applied to enforce sparsity:
- **Lambda Query**: 0.0003 (queries are more sparse)
- **Lambda Document**: 0.0001 (documents less sparse for better recall)
## Performance
**Dataset**: mRobust (528k docs, 250 queries)
| Metric | Score |
|--------|-------|
| **MRR@10** | **0.453** |
## Usage
### Installation
```bash
pip install torch transformers
```
### Basic Usage
**Option 1: Using HuggingFace Hub (Recommended)**
```python
import torch
from transformers import AutoTokenizer
from modeling_splade import Splade
# Load model and tokenizer
model = Splade.from_pretrained("AxelPCG/splade-pt-br")
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
model.eval()
# Encode a query
query = "Qual é a capital do Brasil?"
with torch.no_grad():
query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True)
query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze()
# Encode a document
document = "Brasília é a capital federal do Brasil desde 1960."
with torch.no_grad():
doc_tokens = tokenizer(document, return_tensors="pt", max_length=256, truncation=True)
doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()
# Calculate similarity (dot product)
similarity = torch.dot(query_vec, doc_vec).item()
print(f"Similarity: {similarity:.4f}")
# Get sparse representation
indices = torch.nonzero(query_vec).squeeze().tolist()
values = query_vec[indices].tolist()
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
```
**Option 2: Using SPLADE Library**
```python
from splade.models.transformer_rep import Splade
from transformers import AutoTokenizer
# Load model by pointing to HuggingFace repo
model = Splade(model_type_or_dir="AxelPCG/splade-pt-br", agg="max", fp16=True)
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
```
## Limitations and Bias
- Model trained on machine-translated Portuguese data (mMARCO)
- May not capture all socio-cultural aspects of native Brazilian Portuguese
- Performance may vary on domain-specific tasks
- Inherits biases from BERTimbau base model and training data
## Citation
```bibtex
@misc{splade-pt-br-2025,
author = {Axel Chepanski},
title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/AxelPCG/splade-pt-br}
}
```
## Acknowledgments
- **SPLADE** by NAVER Labs and [leobavila/splade](https://github.com/leobavila/splade) fork
- **BERTimbau** by Neuralmind
- **mMARCO & mRobust Portuguese** by UNICAMP-DL
- **Quati Dataset** research - Inspiration for native Portuguese IR
## License
Apache 2.0
|