AxelPCG
/

splade-pt-br

information-retrieval

sparse-retrieval

Model card Files Files and versions

AxelPCG commited on Dec 1, 2025

Commit

829a3ee

·

verified ·

1 Parent(s): b62a8d0

Upload SPLADE-PT-BR model v1.0.0

Files changed (1) hide show

README.md +24 -0

README.md CHANGED Viewed

@@ -39,10 +39,14 @@ SPLADE is a neural retrieval model that learns to expand queries and documents w
 - **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) - MS MARCO translated to Portuguese
   - Used for training with triplets (query, positive document, negative document)
 - **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - TREC Robust04 translated to Portuguese
   - Used for validation and evaluation during training
 - **Format**: Triplets (query, positive document, negative document)
 ### Training Configuration
 ```yaml
@@ -223,6 +227,26 @@ Original SPLADE paper:
 }
 ```
 ## License
 Apache 2.0

 - **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) - MS MARCO translated to Portuguese
   - Used for training with triplets (query, positive document, negative document)
+  - Created by UNICAMP-DL team as part of their Portuguese IR research
 - **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - TREC Robust04 translated to Portuguese
   - Used for validation and evaluation during training
+  - Part of the UNICAMP-DL Portuguese IR datasets collection
 - **Format**: Triplets (query, positive document, negative document)
+**Note**: This model was inspired by research on native Portuguese information retrieval, particularly the [Quati dataset](https://arxiv.org/abs/2404.06976) work by Bueno et al. (2024), which demonstrated the importance of native Portuguese datasets over translated ones for better capturing socio-cultural aspects of Brazilian Portuguese.
 ### Training Configuration
 ```yaml
 }
 ```
+## References
+This work builds upon the following research:
+1. **Quati Dataset**: Bueno, M., de Oliveira, E. S., Nogueira, R., Lotufo, R., & Pereira, J. (2024). *Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers*. arXiv:2404.06976. [https://arxiv.org/abs/2404.06976](https://arxiv.org/abs/2404.06976)
+2. **mMARCO**: Bonifacio, L., Campiotti, I., Lotufo, R., & Nogueira, R. (2021). *mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset*. Proceedings of STIL 2021. [https://sol.sbc.org.br/index.php/stil/article/view/31136](https://sol.sbc.org.br/index.php/stil/article/view/31136)
+3. **SPLADE**: Formal, T., Piwowarski, B., & Clinchant, S. (2021). *SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking*. SIGIR 2021.
+4. **BERTimbau**: Souza, F., Nogueira, R., & Lotufo, R. (2020). *BERTimbau: Pretrained BERT Models for Brazilian Portuguese*. BRACIS 2020.
+## Acknowledgments
+Special thanks to:
+- **UNICAMP-DL team** for the mMARCO and mRobust Portuguese datasets
+- **Quati dataset authors** for pioneering native Portuguese IR research
+- **NeuralMind** for the BERTimbau model
+- **Original SPLADE authors** for the model architecture
 ## License
 Apache 2.0