File size: 4,430 Bytes
72659d5
 
 
 
 
 
 
 
 
 
 
b62a8d0
72659d5
 
 
 
 
61eb3c1
72659d5
81aaccb
f958735
72659d5
 
 
 
 
 
 
 
 
 
 
 
 
61eb3c1
72659d5
 
 
 
 
 
61eb3c1
 
72659d5
 
 
 
 
61eb3c1
 
 
 
 
 
 
72659d5
 
 
 
 
 
 
 
61eb3c1
 
 
 
 
 
 
 
72659d5
 
 
 
 
 
 
 
 
 
688ac07
 
72659d5
 
 
688ac07
72659d5
 
 
688ac07
72659d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61eb3c1
72659d5
 
 
 
 
61eb3c1
72659d5
 
688ac07
 
 
 
 
 
 
 
 
 
 
61eb3c1
72659d5
61eb3c1
 
 
 
72659d5
 
 
 
 
 
 
 
 
 
 
 
 
829a3ee
 
61eb3c1
 
 
 
829a3ee
72659d5
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
language: pt
license: apache-2.0
tags:
- information-retrieval
- sparse-retrieval
- splade
- portuguese
- bert
datasets:
- unicamp-dl/mmarco
- unicamp-dl/mrobust
base_model: neuralmind/bert-base-portuguese-cased
---

# SPLADE-PT-BR

SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval. Based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) and trained on Portuguese question-answering datasets.

**GitHub Repository**: https://github.com/AxelPCG/SPLADE-PT-BR

## Model Description

SPLADE is a neural retrieval model that learns to expand queries and documents with contextually relevant terms while maintaining sparsity. Unlike dense retrievers, SPLADE produces sparse vectors (typically ~99% sparse) that are:
- **Interpretable**: Each dimension corresponds to a vocabulary token
- **Efficient**: Can use inverted indexes for fast retrieval
- **Effective**: Combines lexical matching with semantic expansion

### Key Features

- **Base Model**: `neuralmind/bert-base-portuguese-cased` (BERTimbau)
- **Vocabulary Size**: 29,794 tokens (Portuguese-optimized)
- **Training Iterations**: 150,000
- **Final Training Loss**: 0.000047
- **Sparsity**: ~99.5% (100-150 active dimensions per vector)
- **Max Sequence Length**: 256 tokens

## Training Details

### Training Data

- **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`)
- **Validation Dataset**: mRobust (`unicamp-dl/mrobust`)
- **Format**: Triplets (query, positive document, negative document)

### Training Configuration

```yaml
Learning Rate: 2e-5
Batch Size: 8 (effective: 32 with gradient accumulation)
Gradient Accumulation Steps: 4
Weight Decay: 0.01
Warmup Steps: 6,000
Mixed Precision: FP16
Optimizer: AdamW
```

### Regularization

FLOPS regularization is applied to enforce sparsity:
- **Lambda Query**: 0.0003 (queries are more sparse)
- **Lambda Document**: 0.0001 (documents less sparse for better recall)

## Performance

**Dataset**: mRobust (528k docs, 250 queries)

| Metric | Score |
|--------|-------|
| **MRR@10** | **0.453** |

## Usage

### Installation

```bash
pip install torch transformers
```

### Basic Usage

**Option 1: Using HuggingFace Hub (Recommended)**

```python
import torch
from transformers import AutoTokenizer
from modeling_splade import Splade

# Load model and tokenizer
model = Splade.from_pretrained("AxelPCG/splade-pt-br")
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
model.eval()

# Encode a query
query = "Qual é a capital do Brasil?"
with torch.no_grad():
    query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True)
    query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze()

# Encode a document
document = "Brasília é a capital federal do Brasil desde 1960."
with torch.no_grad():
    doc_tokens = tokenizer(document, return_tensors="pt", max_length=256, truncation=True)
    doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()

# Calculate similarity (dot product)
similarity = torch.dot(query_vec, doc_vec).item()
print(f"Similarity: {similarity:.4f}")

# Get sparse representation
indices = torch.nonzero(query_vec).squeeze().tolist()
values = query_vec[indices].tolist()
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
```

**Option 2: Using SPLADE Library**

```python
from splade.models.transformer_rep import Splade
from transformers import AutoTokenizer

# Load model by pointing to HuggingFace repo
model = Splade(model_type_or_dir="AxelPCG/splade-pt-br", agg="max", fp16=True)
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
```

## Limitations and Bias

- Model trained on machine-translated Portuguese data (mMARCO)
- May not capture all socio-cultural aspects of native Brazilian Portuguese
- Performance may vary on domain-specific tasks
- Inherits biases from BERTimbau base model and training data

## Citation

```bibtex
@misc{splade-pt-br-2025,
  author = {Axel Chepanski},
  title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/AxelPCG/splade-pt-br}
}
```

## Acknowledgments

- **SPLADE** by NAVER Labs and [leobavila/splade](https://github.com/leobavila/splade) fork
- **BERTimbau** by Neuralmind
- **mMARCO & mRobust Portuguese** by UNICAMP-DL
- **Quati Dataset** research - Inspiration for native Portuguese IR

## License

Apache 2.0