Upload SPLADE-PT-BR model v1.0.0
Browse files
README.md
CHANGED
|
@@ -15,7 +15,7 @@ base_model: neuralmind/bert-base-portuguese-cased
|
|
| 15 |
|
| 16 |
# SPLADE-PT-BR
|
| 17 |
|
| 18 |
-
SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval.
|
| 19 |
|
| 20 |
## Model Description
|
| 21 |
|
|
@@ -30,33 +30,27 @@ SPLADE is a neural retrieval model that learns to expand queries and documents w
|
|
| 30 |
- **Vocabulary Size**: 29,794 tokens (Portuguese-optimized)
|
| 31 |
- **Training Iterations**: 150,000
|
| 32 |
- **Final Training Loss**: 0.000047
|
| 33 |
-
- **Sparsity**: ~99% (100-150 active dimensions per vector)
|
| 34 |
- **Max Sequence Length**: 256 tokens
|
| 35 |
|
| 36 |
## Training Details
|
| 37 |
|
| 38 |
### Training Data
|
| 39 |
|
| 40 |
-
- **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`)
|
| 41 |
-
|
| 42 |
-
- Created by UNICAMP-DL team as part of their Portuguese IR research
|
| 43 |
-
- **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - TREC Robust04 translated to Portuguese
|
| 44 |
-
- Used for validation and evaluation during training
|
| 45 |
-
- Part of the UNICAMP-DL Portuguese IR datasets collection
|
| 46 |
- **Format**: Triplets (query, positive document, negative document)
|
| 47 |
|
| 48 |
-
**Note**: This model was inspired by research on native Portuguese information retrieval, particularly the [Quati dataset](https://arxiv.org/abs/2404.06976) work by Bueno et al. (2024), which demonstrated the importance of native Portuguese datasets over translated ones for better capturing socio-cultural aspects of Brazilian Portuguese.
|
| 49 |
-
|
| 50 |
### Training Configuration
|
| 51 |
|
| 52 |
```yaml
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
```
|
| 61 |
|
| 62 |
### Regularization
|
|
@@ -65,6 +59,14 @@ FLOPS regularization is applied to enforce sparsity:
|
|
| 65 |
- **Lambda Query**: 0.0003 (queries are more sparse)
|
| 66 |
- **Lambda Document**: 0.0001 (documents less sparse for better recall)
|
| 67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
## Usage
|
| 69 |
|
| 70 |
### Installation
|
|
@@ -98,113 +100,24 @@ with torch.no_grad():
|
|
| 98 |
doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()
|
| 99 |
|
| 100 |
# Calculate similarity (dot product)
|
| 101 |
-
similarity = (query_vec
|
| 102 |
print(f"Similarity: {similarity:.4f}")
|
| 103 |
|
| 104 |
# Get sparse representation
|
| 105 |
indices = torch.nonzero(query_vec).squeeze().tolist()
|
| 106 |
values = query_vec[indices].tolist()
|
| 107 |
-
print(f"
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
### Using Sparse Vectors for Retrieval
|
| 111 |
-
|
| 112 |
-
```python
|
| 113 |
-
# Build inverted index from documents
|
| 114 |
-
inverted_index = {}
|
| 115 |
-
|
| 116 |
-
def add_to_index(doc_id, text):
|
| 117 |
-
"""Add document to inverted index"""
|
| 118 |
-
sparse_vec = encode_sparse(text, is_query=False)
|
| 119 |
-
|
| 120 |
-
for idx, value in zip(sparse_vec["indices"], sparse_vec["values"]):
|
| 121 |
-
if idx not in inverted_index:
|
| 122 |
-
inverted_index[idx] = []
|
| 123 |
-
inverted_index[idx].append((doc_id, value))
|
| 124 |
-
|
| 125 |
-
# Index documents
|
| 126 |
-
docs = {
|
| 127 |
-
1: "Brasília é a capital do Brasil",
|
| 128 |
-
2: "São Paulo é a maior cidade do Brasil",
|
| 129 |
-
3: "Python é uma linguagem de programação"
|
| 130 |
-
}
|
| 131 |
-
|
| 132 |
-
for doc_id, text in docs.items():
|
| 133 |
-
add_to_index(doc_id, text)
|
| 134 |
-
|
| 135 |
-
# Search using inverted index
|
| 136 |
-
def search(query, top_k=5):
|
| 137 |
-
"""Search documents using sparse vectors"""
|
| 138 |
-
query_vec = encode_sparse(query, is_query=True)
|
| 139 |
-
|
| 140 |
-
# Calculate scores for each document
|
| 141 |
-
scores = {}
|
| 142 |
-
for idx, q_value in zip(query_vec["indices"], query_vec["values"]):
|
| 143 |
-
if idx in inverted_index:
|
| 144 |
-
for doc_id, d_value in inverted_index[idx]:
|
| 145 |
-
scores[doc_id] = scores.get(doc_id, 0) + (q_value * d_value)
|
| 146 |
-
|
| 147 |
-
# Sort by score
|
| 148 |
-
results = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
|
| 149 |
-
return [(doc_id, docs[doc_id], score) for doc_id, score in results]
|
| 150 |
-
|
| 151 |
-
# Example search
|
| 152 |
-
results = search("capital brasileira", top_k=3)
|
| 153 |
-
for doc_id, text, score in results:
|
| 154 |
-
print(f"Score: {score:.2f} - {text}")
|
| 155 |
```
|
| 156 |
|
| 157 |
-
##
|
| 158 |
-
|
| 159 |
-
### Evaluation Metrics
|
| 160 |
-
|
| 161 |
-
*Metrics will be updated after complete evaluation on validation set.*
|
| 162 |
-
|
| 163 |
-
Expected performance on Portuguese retrieval tasks:
|
| 164 |
-
- **MRR@10**: ~0.25-0.35
|
| 165 |
-
- **Recall@100**: ~0.85-0.95
|
| 166 |
-
- **L0 (Sparsity)**: ~100-150 active dimensions
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
| Aspect | Original SPLADE | SPLADE-PT-BR |
|
| 173 |
-
|--------|----------------|--------------|
|
| 174 |
-
| Language | English | Portuguese |
|
| 175 |
-
| Base Model | BERT-base-uncased | BERTimbau (BERT-base-cased-pt) |
|
| 176 |
-
| Vocabulary | 30,522 tokens | 29,794 tokens |
|
| 177 |
-
| Training Data | MS MARCO | mMARCO Portuguese |
|
| 178 |
-
| Query Expansion | English context | Portuguese context |
|
| 179 |
-
|
| 180 |
-
**Advantages for Portuguese:**
|
| 181 |
-
- Native vocabulary tokens (no subword splitting for Portuguese words)
|
| 182 |
-
- Semantic expansion using Portuguese linguistic patterns
|
| 183 |
-
- Better performance on Brazilian Portuguese queries
|
| 184 |
-
|
| 185 |
-
## Model Architecture
|
| 186 |
-
|
| 187 |
-
```
|
| 188 |
-
Input Text → BERTimbau Tokenizer → BERT Encoder → MLM Head →
|
| 189 |
-
ReLU → log(1 + x) → Attention Masking → Max/Sum Pooling → Sparse Vector
|
| 190 |
-
```
|
| 191 |
-
|
| 192 |
-
The model outputs a vector of size 29,794 (vocabulary size) where:
|
| 193 |
-
- Most values are exactly 0 (sparse)
|
| 194 |
-
- Non-zero values represent term importance + learned expansions
|
| 195 |
-
- Can be used directly with inverted indexes
|
| 196 |
-
|
| 197 |
-
## Limitations
|
| 198 |
-
|
| 199 |
-
- **Language**: Optimized for Brazilian Portuguese; may work for European Portuguese but not tested
|
| 200 |
-
- **Domain**: Trained on general question-answering; may need fine-tuning for specific domains
|
| 201 |
-
- **Sequence Length**: Maximum 256 tokens; longer documents should be split
|
| 202 |
-
- **Computational Cost**: Requires GPU for efficient encoding of large collections
|
| 203 |
|
| 204 |
## Citation
|
| 205 |
|
| 206 |
-
If you use this model, please cite:
|
| 207 |
-
|
| 208 |
```bibtex
|
| 209 |
@misc{splade-pt-br-2025,
|
| 210 |
author = {Axel Chepanski},
|
|
@@ -215,46 +128,14 @@ If you use this model, please cite:
|
|
| 215 |
}
|
| 216 |
```
|
| 217 |
|
| 218 |
-
Original SPLADE paper:
|
| 219 |
-
|
| 220 |
-
```bibtex
|
| 221 |
-
@inproceedings{formal2021splade,
|
| 222 |
-
title={SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking},
|
| 223 |
-
author={Formal, Thibault and Piwowarski, Benjamin and Clinchant, St{\'e}phane},
|
| 224 |
-
booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
|
| 225 |
-
pages={2288--2292},
|
| 226 |
-
year={2021}
|
| 227 |
-
}
|
| 228 |
-
```
|
| 229 |
-
|
| 230 |
-
## References
|
| 231 |
-
|
| 232 |
-
This work builds upon the following research:
|
| 233 |
-
|
| 234 |
-
1. **Quati Dataset**: Bueno, M., de Oliveira, E. S., Nogueira, R., Lotufo, R., & Pereira, J. (2024). *Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers*. arXiv:2404.06976. [https://arxiv.org/abs/2404.06976](https://arxiv.org/abs/2404.06976)
|
| 235 |
-
|
| 236 |
-
2. **mMARCO**: Bonifacio, L., Campiotti, I., Lotufo, R., & Nogueira, R. (2021). *mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset*. Proceedings of STIL 2021. [https://sol.sbc.org.br/index.php/stil/article/view/31136](https://sol.sbc.org.br/index.php/stil/article/view/31136)
|
| 237 |
-
|
| 238 |
-
3. **SPLADE**: Formal, T., Piwowarski, B., & Clinchant, S. (2021). *SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking*. SIGIR 2021.
|
| 239 |
-
- Original implementation: [naver/splade](https://github.com/naver/splade)
|
| 240 |
-
- Fork used in this project: [leobavila/splade](https://github.com/leobavila/splade)
|
| 241 |
-
|
| 242 |
-
4. **BERTimbau**: Souza, F., Nogueira, R., & Lotufo, R. (2020). *BERTimbau: Pretrained BERT Models for Brazilian Portuguese*. BRACIS 2020.
|
| 243 |
-
|
| 244 |
## Acknowledgments
|
| 245 |
|
| 246 |
-
|
| 247 |
-
- **
|
| 248 |
-
- **
|
| 249 |
-
- **
|
| 250 |
-
- **NAVER Labs** for the original SPLADE implementation
|
| 251 |
-
- **leobavila** for the SPLADE fork that enabled Portuguese adaptations
|
| 252 |
|
| 253 |
## License
|
| 254 |
|
| 255 |
Apache 2.0
|
| 256 |
|
| 257 |
-
## Contact
|
| 258 |
-
|
| 259 |
-
For questions or issues, please open an issue on the [GitHub repository](https://github.com/AxelPCG/SPLADE-PT-BR).
|
| 260 |
-
|
|
|
|
| 15 |
|
| 16 |
# SPLADE-PT-BR
|
| 17 |
|
| 18 |
+
SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval. Based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) and trained on Portuguese question-answering datasets.
|
| 19 |
|
| 20 |
## Model Description
|
| 21 |
|
|
|
|
| 30 |
- **Vocabulary Size**: 29,794 tokens (Portuguese-optimized)
|
| 31 |
- **Training Iterations**: 150,000
|
| 32 |
- **Final Training Loss**: 0.000047
|
| 33 |
+
- **Sparsity**: ~99.5% (100-150 active dimensions per vector)
|
| 34 |
- **Max Sequence Length**: 256 tokens
|
| 35 |
|
| 36 |
## Training Details
|
| 37 |
|
| 38 |
### Training Data
|
| 39 |
|
| 40 |
+
- **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`)
|
| 41 |
+
- **Validation Dataset**: mRobust (`unicamp-dl/mrobust`)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
- **Format**: Triplets (query, positive document, negative document)
|
| 43 |
|
|
|
|
|
|
|
| 44 |
### Training Configuration
|
| 45 |
|
| 46 |
```yaml
|
| 47 |
+
Learning Rate: 2e-5
|
| 48 |
+
Batch Size: 8 (effective: 32 with gradient accumulation)
|
| 49 |
+
Gradient Accumulation Steps: 4
|
| 50 |
+
Weight Decay: 0.01
|
| 51 |
+
Warmup Steps: 6,000
|
| 52 |
+
Mixed Precision: FP16
|
| 53 |
+
Optimizer: AdamW
|
| 54 |
```
|
| 55 |
|
| 56 |
### Regularization
|
|
|
|
| 59 |
- **Lambda Query**: 0.0003 (queries are more sparse)
|
| 60 |
- **Lambda Document**: 0.0001 (documents less sparse for better recall)
|
| 61 |
|
| 62 |
+
## Performance
|
| 63 |
+
|
| 64 |
+
**Dataset**: mRobust (528k docs, 250 queries)
|
| 65 |
+
|
| 66 |
+
| Metric | Score |
|
| 67 |
+
|--------|-------|
|
| 68 |
+
| **MRR@10** | **0.453** |
|
| 69 |
+
|
| 70 |
## Usage
|
| 71 |
|
| 72 |
### Installation
|
|
|
|
| 100 |
doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()
|
| 101 |
|
| 102 |
# Calculate similarity (dot product)
|
| 103 |
+
similarity = torch.dot(query_vec, doc_vec).item()
|
| 104 |
print(f"Similarity: {similarity:.4f}")
|
| 105 |
|
| 106 |
# Get sparse representation
|
| 107 |
indices = torch.nonzero(query_vec).squeeze().tolist()
|
| 108 |
values = query_vec[indices].tolist()
|
| 109 |
+
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
```
|
| 111 |
|
| 112 |
+
## Limitations and Bias
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+
- Model trained on machine-translated Portuguese data (mMARCO)
|
| 115 |
+
- May not capture all socio-cultural aspects of native Brazilian Portuguese
|
| 116 |
+
- Performance may vary on domain-specific tasks
|
| 117 |
+
- Inherits biases from BERTimbau base model and training data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
## Citation
|
| 120 |
|
|
|
|
|
|
|
| 121 |
```bibtex
|
| 122 |
@misc{splade-pt-br-2025,
|
| 123 |
author = {Axel Chepanski},
|
|
|
|
| 128 |
}
|
| 129 |
```
|
| 130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
## Acknowledgments
|
| 132 |
|
| 133 |
+
- **SPLADE** by NAVER Labs and [leobavila/splade](https://github.com/leobavila/splade) fork
|
| 134 |
+
- **BERTimbau** by Neuralmind
|
| 135 |
+
- **mMARCO & mRobust Portuguese** by UNICAMP-DL
|
| 136 |
+
- **Quati Dataset** research - Inspiration for native Portuguese IR
|
|
|
|
|
|
|
| 137 |
|
| 138 |
## License
|
| 139 |
|
| 140 |
Apache 2.0
|
| 141 |
|
|
|
|
|
|
|
|
|
|
|
|