🤏 Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing • 14 items • Updated 12 days ago • 12
RexBERT: Context Specialized Bidirectional Encoders for E-commerce Paper • 2602.04605 • Published Feb 4 • 2
view article Article Nano-BEIR: A Multilingual Information Retrieval Benchmark with Quality-Enhanced Queries Dec 22, 2025 • 9
Bharat-NanoBEIR: Indian Language Retrieval Benchmarks Collection NanoBEIR retrieval benchmarks translated into 22 Indian languages across 13 datasets. • 22 items • Updated Dec 13, 2025 • 5
Bharat-NanoBEIR Collection Indian Language Information Retrieval Dataset • 286 items • Updated Jan 26, 2025 • 2
M3DR: Towards Universal Multilingual Multimodal Document Retrieval Paper • 2512.03514 • Published Dec 3, 2025 • 9