splade-pt-br / README.md
AxelPCG's picture
Upload SPLADE-PT-BR model v1.0.0
81aaccb verified
---
language: pt
license: apache-2.0
tags:
- information-retrieval
- sparse-retrieval
- splade
- portuguese
- bert
datasets:
- unicamp-dl/mmarco
- unicamp-dl/mrobust
base_model: neuralmind/bert-base-portuguese-cased
---
# SPLADE-PT-BR
SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval. Based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) and trained on Portuguese question-answering datasets.
**GitHub Repository**: https://github.com/AxelPCG/SPLADE-PT-BR
## Model Description
SPLADE is a neural retrieval model that learns to expand queries and documents with contextually relevant terms while maintaining sparsity. Unlike dense retrievers, SPLADE produces sparse vectors (typically ~99% sparse) that are:
- **Interpretable**: Each dimension corresponds to a vocabulary token
- **Efficient**: Can use inverted indexes for fast retrieval
- **Effective**: Combines lexical matching with semantic expansion
### Key Features
- **Base Model**: `neuralmind/bert-base-portuguese-cased` (BERTimbau)
- **Vocabulary Size**: 29,794 tokens (Portuguese-optimized)
- **Training Iterations**: 150,000
- **Final Training Loss**: 0.000047
- **Sparsity**: ~99.5% (100-150 active dimensions per vector)
- **Max Sequence Length**: 256 tokens
## Training Details
### Training Data
- **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`)
- **Validation Dataset**: mRobust (`unicamp-dl/mrobust`)
- **Format**: Triplets (query, positive document, negative document)
### Training Configuration
```yaml
Learning Rate: 2e-5
Batch Size: 8 (effective: 32 with gradient accumulation)
Gradient Accumulation Steps: 4
Weight Decay: 0.01
Warmup Steps: 6,000
Mixed Precision: FP16
Optimizer: AdamW
```
### Regularization
FLOPS regularization is applied to enforce sparsity:
- **Lambda Query**: 0.0003 (queries are more sparse)
- **Lambda Document**: 0.0001 (documents less sparse for better recall)
## Performance
**Dataset**: mRobust (528k docs, 250 queries)
| Metric | Score |
|--------|-------|
| **MRR@10** | **0.453** |
## Usage
### Installation
```bash
pip install torch transformers
```
### Basic Usage
**Option 1: Using HuggingFace Hub (Recommended)**
```python
import torch
from transformers import AutoTokenizer
from modeling_splade import Splade
# Load model and tokenizer
model = Splade.from_pretrained("AxelPCG/splade-pt-br")
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
model.eval()
# Encode a query
query = "Qual é a capital do Brasil?"
with torch.no_grad():
query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True)
query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze()
# Encode a document
document = "Brasília é a capital federal do Brasil desde 1960."
with torch.no_grad():
doc_tokens = tokenizer(document, return_tensors="pt", max_length=256, truncation=True)
doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze()
# Calculate similarity (dot product)
similarity = torch.dot(query_vec, doc_vec).item()
print(f"Similarity: {similarity:.4f}")
# Get sparse representation
indices = torch.nonzero(query_vec).squeeze().tolist()
values = query_vec[indices].tolist()
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
```
**Option 2: Using SPLADE Library**
```python
from splade.models.transformer_rep import Splade
from transformers import AutoTokenizer
# Load model by pointing to HuggingFace repo
model = Splade(model_type_or_dir="AxelPCG/splade-pt-br", agg="max", fp16=True)
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br")
```
## Limitations and Bias
- Model trained on machine-translated Portuguese data (mMARCO)
- May not capture all socio-cultural aspects of native Brazilian Portuguese
- Performance may vary on domain-specific tasks
- Inherits biases from BERTimbau base model and training data
## Citation
```bibtex
@misc{splade-pt-br-2025,
author = {Axel Chepanski},
title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/AxelPCG/splade-pt-br}
}
```
## Acknowledgments
- **SPLADE** by NAVER Labs and [leobavila/splade](https://github.com/leobavila/splade) fork
- **BERTimbau** by Neuralmind
- **mMARCO & mRobust Portuguese** by UNICAMP-DL
- **Quati Dataset** research - Inspiration for native Portuguese IR
## License
Apache 2.0