AxelPCG commited on
Commit
61eb3c1
·
verified ·
1 Parent(s): d4c65e4

Upload SPLADE-PT-BR model v1.0.0

Browse files
Files changed (1) hide show
  1. README.md +30 -149
README.md CHANGED
@@ -15,7 +15,7 @@ base_model: neuralmind/bert-base-portuguese-cased
15
 
16
  # SPLADE-PT-BR
17
 
18
- SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval. This model is based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) and trained on Portuguese question-answering datasets.
19
 
20
  ## Model Description
21
 
@@ -30,33 +30,27 @@ SPLADE is a neural retrieval model that learns to expand queries and documents w
30
  - **Vocabulary Size**: 29,794 tokens (Portuguese-optimized)
31
  - **Training Iterations**: 150,000
32
  - **Final Training Loss**: 0.000047
33
- - **Sparsity**: ~99% (100-150 active dimensions per vector)
34
  - **Max Sequence Length**: 256 tokens
35
 
36
  ## Training Details
37
 
38
  ### Training Data
39
 
40
- - **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) - MS MARCO translated to Portuguese
41
- - Used for training with triplets (query, positive document, negative document)
42
- - Created by UNICAMP-DL team as part of their Portuguese IR research
43
- - **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - TREC Robust04 translated to Portuguese
44
- - Used for validation and evaluation during training
45
- - Part of the UNICAMP-DL Portuguese IR datasets collection
46
  - **Format**: Triplets (query, positive document, negative document)
47
 
48
- **Note**: This model was inspired by research on native Portuguese information retrieval, particularly the [Quati dataset](https://arxiv.org/abs/2404.06976) work by Bueno et al. (2024), which demonstrated the importance of native Portuguese datasets over translated ones for better capturing socio-cultural aspects of Brazilian Portuguese.
49
-
50
  ### Training Configuration
51
 
52
  ```yaml
53
- - Learning Rate: 2e-5
54
- - Batch Size: 8 (effective: 32 with gradient accumulation)
55
- - Gradient Accumulation Steps: 4
56
- - Weight Decay: 0.01
57
- - Warmup Steps: 6,000
58
- - Mixed Precision: FP16
59
- - Optimizer: AdamW
60
  ```
61
 
62
  ### Regularization
@@ -65,6 +59,14 @@ FLOPS regularization is applied to enforce sparsity:
65
  - **Lambda Query**: 0.0003 (queries are more sparse)
66
  - **Lambda Document**: 0.0001 (documents less sparse for better recall)
67
 
 
 
 
 
 
 
 
 
68
  ## Usage
69
 
70
  ### Installation
@@ -98,113 +100,24 @@ with torch.no_grad():
98
  doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()
99
 
100
  # Calculate similarity (dot product)
101
- similarity = (query_vec * doc_vec).sum().item()
102
  print(f"Similarity: {similarity:.4f}")
103
 
104
  # Get sparse representation
105
  indices = torch.nonzero(query_vec).squeeze().tolist()
106
  values = query_vec[indices].tolist()
107
- print(f"Query sparsity: {len(indices)} / {query_vec.shape[0]} active dimensions")
108
- ```
109
-
110
- ### Using Sparse Vectors for Retrieval
111
-
112
- ```python
113
- # Build inverted index from documents
114
- inverted_index = {}
115
-
116
- def add_to_index(doc_id, text):
117
- """Add document to inverted index"""
118
- sparse_vec = encode_sparse(text, is_query=False)
119
-
120
- for idx, value in zip(sparse_vec["indices"], sparse_vec["values"]):
121
- if idx not in inverted_index:
122
- inverted_index[idx] = []
123
- inverted_index[idx].append((doc_id, value))
124
-
125
- # Index documents
126
- docs = {
127
- 1: "Brasília é a capital do Brasil",
128
- 2: "São Paulo é a maior cidade do Brasil",
129
- 3: "Python é uma linguagem de programação"
130
- }
131
-
132
- for doc_id, text in docs.items():
133
- add_to_index(doc_id, text)
134
-
135
- # Search using inverted index
136
- def search(query, top_k=5):
137
- """Search documents using sparse vectors"""
138
- query_vec = encode_sparse(query, is_query=True)
139
-
140
- # Calculate scores for each document
141
- scores = {}
142
- for idx, q_value in zip(query_vec["indices"], query_vec["values"]):
143
- if idx in inverted_index:
144
- for doc_id, d_value in inverted_index[idx]:
145
- scores[doc_id] = scores.get(doc_id, 0) + (q_value * d_value)
146
-
147
- # Sort by score
148
- results = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
149
- return [(doc_id, docs[doc_id], score) for doc_id, score in results]
150
-
151
- # Example search
152
- results = search("capital brasileira", top_k=3)
153
- for doc_id, text, score in results:
154
- print(f"Score: {score:.2f} - {text}")
155
  ```
156
 
157
- ## Performance
158
-
159
- ### Evaluation Metrics
160
-
161
- *Metrics will be updated after complete evaluation on validation set.*
162
-
163
- Expected performance on Portuguese retrieval tasks:
164
- - **MRR@10**: ~0.25-0.35
165
- - **Recall@100**: ~0.85-0.95
166
- - **L0 (Sparsity)**: ~100-150 active dimensions
167
 
168
- ### Comparison with Original SPLADE
169
-
170
- The original SPLADE model was trained on English data. Key differences:
171
-
172
- | Aspect | Original SPLADE | SPLADE-PT-BR |
173
- |--------|----------------|--------------|
174
- | Language | English | Portuguese |
175
- | Base Model | BERT-base-uncased | BERTimbau (BERT-base-cased-pt) |
176
- | Vocabulary | 30,522 tokens | 29,794 tokens |
177
- | Training Data | MS MARCO | mMARCO Portuguese |
178
- | Query Expansion | English context | Portuguese context |
179
-
180
- **Advantages for Portuguese:**
181
- - Native vocabulary tokens (no subword splitting for Portuguese words)
182
- - Semantic expansion using Portuguese linguistic patterns
183
- - Better performance on Brazilian Portuguese queries
184
-
185
- ## Model Architecture
186
-
187
- ```
188
- Input Text → BERTimbau Tokenizer → BERT Encoder → MLM Head →
189
- ReLU → log(1 + x) → Attention Masking → Max/Sum Pooling → Sparse Vector
190
- ```
191
-
192
- The model outputs a vector of size 29,794 (vocabulary size) where:
193
- - Most values are exactly 0 (sparse)
194
- - Non-zero values represent term importance + learned expansions
195
- - Can be used directly with inverted indexes
196
-
197
- ## Limitations
198
-
199
- - **Language**: Optimized for Brazilian Portuguese; may work for European Portuguese but not tested
200
- - **Domain**: Trained on general question-answering; may need fine-tuning for specific domains
201
- - **Sequence Length**: Maximum 256 tokens; longer documents should be split
202
- - **Computational Cost**: Requires GPU for efficient encoding of large collections
203
 
204
  ## Citation
205
 
206
- If you use this model, please cite:
207
-
208
  ```bibtex
209
  @misc{splade-pt-br-2025,
210
  author = {Axel Chepanski},
@@ -215,46 +128,14 @@ If you use this model, please cite:
215
  }
216
  ```
217
 
218
- Original SPLADE paper:
219
-
220
- ```bibtex
221
- @inproceedings{formal2021splade,
222
- title={SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking},
223
- author={Formal, Thibault and Piwowarski, Benjamin and Clinchant, St{\'e}phane},
224
- booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
225
- pages={2288--2292},
226
- year={2021}
227
- }
228
- ```
229
-
230
- ## References
231
-
232
- This work builds upon the following research:
233
-
234
- 1. **Quati Dataset**: Bueno, M., de Oliveira, E. S., Nogueira, R., Lotufo, R., & Pereira, J. (2024). *Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers*. arXiv:2404.06976. [https://arxiv.org/abs/2404.06976](https://arxiv.org/abs/2404.06976)
235
-
236
- 2. **mMARCO**: Bonifacio, L., Campiotti, I., Lotufo, R., & Nogueira, R. (2021). *mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset*. Proceedings of STIL 2021. [https://sol.sbc.org.br/index.php/stil/article/view/31136](https://sol.sbc.org.br/index.php/stil/article/view/31136)
237
-
238
- 3. **SPLADE**: Formal, T., Piwowarski, B., & Clinchant, S. (2021). *SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking*. SIGIR 2021.
239
- - Original implementation: [naver/splade](https://github.com/naver/splade)
240
- - Fork used in this project: [leobavila/splade](https://github.com/leobavila/splade)
241
-
242
- 4. **BERTimbau**: Souza, F., Nogueira, R., & Lotufo, R. (2020). *BERTimbau: Pretrained BERT Models for Brazilian Portuguese*. BRACIS 2020.
243
-
244
  ## Acknowledgments
245
 
246
- Special thanks to:
247
- - **UNICAMP-DL team** for the mMARCO and mRobust Portuguese datasets
248
- - **Quati dataset authors** for pioneering native Portuguese IR research
249
- - **NeuralMind** for the BERTimbau model
250
- - **NAVER Labs** for the original SPLADE implementation
251
- - **leobavila** for the SPLADE fork that enabled Portuguese adaptations
252
 
253
  ## License
254
 
255
  Apache 2.0
256
 
257
- ## Contact
258
-
259
- For questions or issues, please open an issue on the [GitHub repository](https://github.com/AxelPCG/SPLADE-PT-BR).
260
-
 
15
 
16
  # SPLADE-PT-BR
17
 
18
+ SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval. Based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) and trained on Portuguese question-answering datasets.
19
 
20
  ## Model Description
21
 
 
30
  - **Vocabulary Size**: 29,794 tokens (Portuguese-optimized)
31
  - **Training Iterations**: 150,000
32
  - **Final Training Loss**: 0.000047
33
+ - **Sparsity**: ~99.5% (100-150 active dimensions per vector)
34
  - **Max Sequence Length**: 256 tokens
35
 
36
  ## Training Details
37
 
38
  ### Training Data
39
 
40
+ - **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`)
41
+ - **Validation Dataset**: mRobust (`unicamp-dl/mrobust`)
 
 
 
 
42
  - **Format**: Triplets (query, positive document, negative document)
43
 
 
 
44
  ### Training Configuration
45
 
46
  ```yaml
47
+ Learning Rate: 2e-5
48
+ Batch Size: 8 (effective: 32 with gradient accumulation)
49
+ Gradient Accumulation Steps: 4
50
+ Weight Decay: 0.01
51
+ Warmup Steps: 6,000
52
+ Mixed Precision: FP16
53
+ Optimizer: AdamW
54
  ```
55
 
56
  ### Regularization
 
59
  - **Lambda Query**: 0.0003 (queries are more sparse)
60
  - **Lambda Document**: 0.0001 (documents less sparse for better recall)
61
 
62
+ ## Performance
63
+
64
+ **Dataset**: mRobust (528k docs, 250 queries)
65
+
66
+ | Metric | Score |
67
+ |--------|-------|
68
+ | **MRR@10** | **0.453** |
69
+
70
  ## Usage
71
 
72
  ### Installation
 
100
  doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()
101
 
102
  # Calculate similarity (dot product)
103
+ similarity = torch.dot(query_vec, doc_vec).item()
104
  print(f"Similarity: {similarity:.4f}")
105
 
106
  # Get sparse representation
107
  indices = torch.nonzero(query_vec).squeeze().tolist()
108
  values = query_vec[indices].tolist()
109
+ print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ```
111
 
112
+ ## Limitations and Bias
 
 
 
 
 
 
 
 
 
113
 
114
+ - Model trained on machine-translated Portuguese data (mMARCO)
115
+ - May not capture all socio-cultural aspects of native Brazilian Portuguese
116
+ - Performance may vary on domain-specific tasks
117
+ - Inherits biases from BERTimbau base model and training data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  ## Citation
120
 
 
 
121
  ```bibtex
122
  @misc{splade-pt-br-2025,
123
  author = {Axel Chepanski},
 
128
  }
129
  ```
130
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  ## Acknowledgments
132
 
133
+ - **SPLADE** by NAVER Labs and [leobavila/splade](https://github.com/leobavila/splade) fork
134
+ - **BERTimbau** by Neuralmind
135
+ - **mMARCO & mRobust Portuguese** by UNICAMP-DL
136
+ - **Quati Dataset** research - Inspiration for native Portuguese IR
 
 
137
 
138
  ## License
139
 
140
  Apache 2.0
141