--- language: pt license: apache-2.0 tags: - information-retrieval - sparse-retrieval - splade - portuguese - bert datasets: - unicamp-dl/mmarco - unicamp-dl/mrobust base_model: neuralmind/bert-base-portuguese-cased --- # SPLADE-PT-BR SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for **Portuguese** text retrieval. Based on [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) and trained on Portuguese question-answering datasets. **GitHub Repository**: https://github.com/AxelPCG/SPLADE-PT-BR ## Model Description SPLADE is a neural retrieval model that learns to expand queries and documents with contextually relevant terms while maintaining sparsity. Unlike dense retrievers, SPLADE produces sparse vectors (typically ~99% sparse) that are: - **Interpretable**: Each dimension corresponds to a vocabulary token - **Efficient**: Can use inverted indexes for fast retrieval - **Effective**: Combines lexical matching with semantic expansion ### Key Features - **Base Model**: `neuralmind/bert-base-portuguese-cased` (BERTimbau) - **Vocabulary Size**: 29,794 tokens (Portuguese-optimized) - **Training Iterations**: 150,000 - **Final Training Loss**: 0.000047 - **Sparsity**: ~99.5% (100-150 active dimensions per vector) - **Max Sequence Length**: 256 tokens ## Training Details ### Training Data - **Training Dataset**: mMARCO Portuguese (`unicamp-dl/mmarco`) - **Validation Dataset**: mRobust (`unicamp-dl/mrobust`) - **Format**: Triplets (query, positive document, negative document) ### Training Configuration ```yaml Learning Rate: 2e-5 Batch Size: 8 (effective: 32 with gradient accumulation) Gradient Accumulation Steps: 4 Weight Decay: 0.01 Warmup Steps: 6,000 Mixed Precision: FP16 Optimizer: AdamW ``` ### Regularization FLOPS regularization is applied to enforce sparsity: - **Lambda Query**: 0.0003 (queries are more sparse) - **Lambda Document**: 0.0001 (documents less sparse for better recall) ## Performance **Dataset**: mRobust (528k docs, 250 queries) | Metric | Score | |--------|-------| | **MRR@10** | **0.453** | ## Usage ### Installation ```bash pip install torch transformers ``` ### Basic Usage **Option 1: Using HuggingFace Hub (Recommended)** ```python import torch from transformers import AutoTokenizer from modeling_splade import Splade # Load model and tokenizer model = Splade.from_pretrained("AxelPCG/splade-pt-br") tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br") model.eval() # Encode a query query = "Qual é a capital do Brasil?" with torch.no_grad(): query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True) query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze() # Encode a document document = "Brasília é a capital federal do Brasil desde 1960." with torch.no_grad(): doc_tokens = tokenizer(document, return_tensors="pt", max_length=256, truncation=True) doc_vec = model(d_kwargs=doc_tokens)["d_rep"].squeeze() # Calculate similarity (dot product) similarity = torch.dot(query_vec, doc_vec).item() print(f"Similarity: {similarity:.4f}") # Get sparse representation indices = torch.nonzero(query_vec).squeeze().tolist() values = query_vec[indices].tolist() print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}") ``` **Option 2: Using SPLADE Library** ```python from splade.models.transformer_rep import Splade from transformers import AutoTokenizer # Load model by pointing to HuggingFace repo model = Splade(model_type_or_dir="AxelPCG/splade-pt-br", agg="max", fp16=True) tokenizer = AutoTokenizer.from_pretrained("AxelPCG/splade-pt-br") ``` ## Limitations and Bias - Model trained on machine-translated Portuguese data (mMARCO) - May not capture all socio-cultural aspects of native Brazilian Portuguese - Performance may vary on domain-specific tasks - Inherits biases from BERTimbau base model and training data ## Citation ```bibtex @misc{splade-pt-br-2025, author = {Axel Chepanski}, title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/AxelPCG/splade-pt-br} } ``` ## Acknowledgments - **SPLADE** by NAVER Labs and [leobavila/splade](https://github.com/leobavila/splade) fork - **BERTimbau** by Neuralmind - **mMARCO & mRobust Portuguese** by UNICAMP-DL - **Quati Dataset** research - Inspiration for native Portuguese IR ## License Apache 2.0