AxelPCG commited on
Commit
829a3ee
·
verified ·
1 Parent(s): b62a8d0

Upload SPLADE-PT-BR model v1.0.0

Browse files
Files changed (1) hide show
  1. README.md +24 -0
README.md CHANGED
@@ -39,10 +39,14 @@ SPLADE is a neural retrieval model that learns to expand queries and documents w
39
 
40
  - **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) - MS MARCO translated to Portuguese
41
  - Used for training with triplets (query, positive document, negative document)
 
42
  - **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - TREC Robust04 translated to Portuguese
43
  - Used for validation and evaluation during training
 
44
  - **Format**: Triplets (query, positive document, negative document)
45
 
 
 
46
  ### Training Configuration
47
 
48
  ```yaml
@@ -223,6 +227,26 @@ Original SPLADE paper:
223
  }
224
  ```
225
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
  ## License
227
 
228
  Apache 2.0
 
39
 
40
  - **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) - MS MARCO translated to Portuguese
41
  - Used for training with triplets (query, positive document, negative document)
42
+ - Created by UNICAMP-DL team as part of their Portuguese IR research
43
  - **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - TREC Robust04 translated to Portuguese
44
  - Used for validation and evaluation during training
45
+ - Part of the UNICAMP-DL Portuguese IR datasets collection
46
  - **Format**: Triplets (query, positive document, negative document)
47
 
48
+ **Note**: This model was inspired by research on native Portuguese information retrieval, particularly the [Quati dataset](https://arxiv.org/abs/2404.06976) work by Bueno et al. (2024), which demonstrated the importance of native Portuguese datasets over translated ones for better capturing socio-cultural aspects of Brazilian Portuguese.
49
+
50
  ### Training Configuration
51
 
52
  ```yaml
 
227
  }
228
  ```
229
 
230
+ ## References
231
+
232
+ This work builds upon the following research:
233
+
234
+ 1. **Quati Dataset**: Bueno, M., de Oliveira, E. S., Nogueira, R., Lotufo, R., & Pereira, J. (2024). *Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers*. arXiv:2404.06976. [https://arxiv.org/abs/2404.06976](https://arxiv.org/abs/2404.06976)
235
+
236
+ 2. **mMARCO**: Bonifacio, L., Campiotti, I., Lotufo, R., & Nogueira, R. (2021). *mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset*. Proceedings of STIL 2021. [https://sol.sbc.org.br/index.php/stil/article/view/31136](https://sol.sbc.org.br/index.php/stil/article/view/31136)
237
+
238
+ 3. **SPLADE**: Formal, T., Piwowarski, B., & Clinchant, S. (2021). *SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking*. SIGIR 2021.
239
+
240
+ 4. **BERTimbau**: Souza, F., Nogueira, R., & Lotufo, R. (2020). *BERTimbau: Pretrained BERT Models for Brazilian Portuguese*. BRACIS 2020.
241
+
242
+ ## Acknowledgments
243
+
244
+ Special thanks to:
245
+ - **UNICAMP-DL team** for the mMARCO and mRobust Portuguese datasets
246
+ - **Quati dataset authors** for pioneering native Portuguese IR research
247
+ - **NeuralMind** for the BERTimbau model
248
+ - **Original SPLADE authors** for the model architecture
249
+
250
  ## License
251
 
252
  Apache 2.0