Upload SPLADE-PT-BR model v1.0.0
Browse files
README.md
CHANGED
|
@@ -39,10 +39,14 @@ SPLADE is a neural retrieval model that learns to expand queries and documents w
|
|
| 39 |
|
| 40 |
- **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) - MS MARCO translated to Portuguese
|
| 41 |
- Used for training with triplets (query, positive document, negative document)
|
|
|
|
| 42 |
- **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - TREC Robust04 translated to Portuguese
|
| 43 |
- Used for validation and evaluation during training
|
|
|
|
| 44 |
- **Format**: Triplets (query, positive document, negative document)
|
| 45 |
|
|
|
|
|
|
|
| 46 |
### Training Configuration
|
| 47 |
|
| 48 |
```yaml
|
|
@@ -223,6 +227,26 @@ Original SPLADE paper:
|
|
| 223 |
}
|
| 224 |
```
|
| 225 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
## License
|
| 227 |
|
| 228 |
Apache 2.0
|
|
|
|
| 39 |
|
| 40 |
- **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) - MS MARCO translated to Portuguese
|
| 41 |
- Used for training with triplets (query, positive document, negative document)
|
| 42 |
+
- Created by UNICAMP-DL team as part of their Portuguese IR research
|
| 43 |
- **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - TREC Robust04 translated to Portuguese
|
| 44 |
- Used for validation and evaluation during training
|
| 45 |
+
- Part of the UNICAMP-DL Portuguese IR datasets collection
|
| 46 |
- **Format**: Triplets (query, positive document, negative document)
|
| 47 |
|
| 48 |
+
**Note**: This model was inspired by research on native Portuguese information retrieval, particularly the [Quati dataset](https://arxiv.org/abs/2404.06976) work by Bueno et al. (2024), which demonstrated the importance of native Portuguese datasets over translated ones for better capturing socio-cultural aspects of Brazilian Portuguese.
|
| 49 |
+
|
| 50 |
### Training Configuration
|
| 51 |
|
| 52 |
```yaml
|
|
|
|
| 227 |
}
|
| 228 |
```
|
| 229 |
|
| 230 |
+
## References
|
| 231 |
+
|
| 232 |
+
This work builds upon the following research:
|
| 233 |
+
|
| 234 |
+
1. **Quati Dataset**: Bueno, M., de Oliveira, E. S., Nogueira, R., Lotufo, R., & Pereira, J. (2024). *Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers*. arXiv:2404.06976. [https://arxiv.org/abs/2404.06976](https://arxiv.org/abs/2404.06976)
|
| 235 |
+
|
| 236 |
+
2. **mMARCO**: Bonifacio, L., Campiotti, I., Lotufo, R., & Nogueira, R. (2021). *mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset*. Proceedings of STIL 2021. [https://sol.sbc.org.br/index.php/stil/article/view/31136](https://sol.sbc.org.br/index.php/stil/article/view/31136)
|
| 237 |
+
|
| 238 |
+
3. **SPLADE**: Formal, T., Piwowarski, B., & Clinchant, S. (2021). *SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking*. SIGIR 2021.
|
| 239 |
+
|
| 240 |
+
4. **BERTimbau**: Souza, F., Nogueira, R., & Lotufo, R. (2020). *BERTimbau: Pretrained BERT Models for Brazilian Portuguese*. BRACIS 2020.
|
| 241 |
+
|
| 242 |
+
## Acknowledgments
|
| 243 |
+
|
| 244 |
+
Special thanks to:
|
| 245 |
+
- **UNICAMP-DL team** for the mMARCO and mRobust Portuguese datasets
|
| 246 |
+
- **Quati dataset authors** for pioneering native Portuguese IR research
|
| 247 |
+
- **NeuralMind** for the BERTimbau model
|
| 248 |
+
- **Original SPLADE authors** for the model architecture
|
| 249 |
+
|
| 250 |
## License
|
| 251 |
|
| 252 |
Apache 2.0
|