|
|
--- |
|
|
language: pt |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- information-retrieval |
|
|
- sparse-retrieval |
|
|
- splade |
|
|
- portuguese |
|
|
- bert |
|
|
datasets: |
|
|
- unicamp-dl/mmarco |
|
|
- unicamp-dl/mrobust |
|
|
base_model: neuralmind/bert-base-portuguese-cased |
|
|
--- |
|
|
|
|
|
# SPLADE-PT-BR |
|
|
|
|
|
SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval. Based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) and trained on Portuguese question-answering datasets. |
|
|
|
|
|
**GitHub Repository**: https://github.com/AxelPCG/SPLADE-PT-BR |
|
|
|
|
|
## Model Description |
|
|
|
|
|
SPLADE is a neural retrieval model that learns to expand queries and documents with contextually relevant terms while maintaining sparsity. Unlike dense retrievers, SPLADE produces sparse vectors (typically ~99% sparse) that are: |
|
|
- **Interpretable**: Each dimension corresponds to a vocabulary token |
|
|
- **Efficient**: Can use inverted indexes for fast retrieval |
|
|
- **Effective**: Combines lexical matching with semantic expansion |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Base Model**: `neuralmind/bert-base-portuguese-cased` (BERTimbau) |
|
|
- **Vocabulary Size**: 29,794 tokens (Portuguese-optimized) |
|
|
- **Training Iterations**: 150,000 |
|
|
- **Final Training Loss**: 0.000047 |
|
|
- **Sparsity**: ~99.5% (100-150 active dimensions per vector) |
|
|
- **Max Sequence Length**: 256 tokens |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) |
|
|
- **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) |
|
|
- **Format**: Triplets (query, positive document, negative document) |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
```yaml |
|
|
Learning Rate: 2e-5 |
|
|
Batch Size: 8 (effective: 32 with gradient accumulation) |
|
|
Gradient Accumulation Steps: 4 |
|
|
Weight Decay: 0.01 |
|
|
Warmup Steps: 6,000 |
|
|
Mixed Precision: FP16 |
|
|
Optimizer: AdamW |
|
|
``` |
|
|
|
|
|
### Regularization |
|
|
|
|
|
FLOPS regularization is applied to enforce sparsity: |
|
|
- **Lambda Query**: 0.0003 (queries are more sparse) |
|
|
- **Lambda Document**: 0.0001 (documents less sparse for better recall) |
|
|
|
|
|
## Performance |
|
|
|
|
|
**Dataset**: mRobust (528k docs, 250 queries) |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| **MRR@10** | **0.453** | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
**Option 1: Using HuggingFace Hub (Recommended)** |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer |
|
|
from modeling_splade import Splade |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = Splade.from_pretrained("AxelPCG/splade-pt-br") |
|
|
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br") |
|
|
model.eval() |
|
|
|
|
|
# Encode a query |
|
|
query = "Qual é a capital do Brasil?" |
|
|
with torch.no_grad(): |
|
|
query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True) |
|
|
query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze() |
|
|
|
|
|
# Encode a document |
|
|
document = "Brasília é a capital federal do Brasil desde 1960." |
|
|
with torch.no_grad(): |
|
|
doc_tokens = tokenizer(document, return_tensors="pt", max_length=256, truncation=True) |
|
|
doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze() |
|
|
|
|
|
# Calculate similarity (dot product) |
|
|
similarity = torch.dot(query_vec, doc_vec).item() |
|
|
print(f"Similarity: {similarity:.4f}") |
|
|
|
|
|
# Get sparse representation |
|
|
indices = torch.nonzero(query_vec).squeeze().tolist() |
|
|
values = query_vec[indices].tolist() |
|
|
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}") |
|
|
``` |
|
|
|
|
|
**Option 2: Using SPLADE Library** |
|
|
|
|
|
```python |
|
|
from splade.models.transformer_rep import Splade |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Load model by pointing to HuggingFace repo |
|
|
model = Splade(model_type_or_dir="AxelPCG/splade-pt-br", agg="max", fp16=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br") |
|
|
``` |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
- Model trained on machine-translated Portuguese data (mMARCO) |
|
|
- May not capture all socio-cultural aspects of native Brazilian Portuguese |
|
|
- Performance may vary on domain-specific tasks |
|
|
- Inherits biases from BERTimbau base model and training data |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{splade-pt-br-2025, |
|
|
author = {Axel Chepanski}, |
|
|
title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/AxelPCG/splade-pt-br} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **SPLADE** by NAVER Labs and [leobavila/splade](https://github.com/leobavila/splade) fork |
|
|
- **BERTimbau** by Neuralmind |
|
|
- **mMARCO & mRobust Portuguese** by UNICAMP-DL |
|
|
- **Quati Dataset** research - Inspiration for native Portuguese IR |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
|