|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen3-Embedding-4B |
|
|
library_name: sentence-transformers |
|
|
--- |
|
|
## Description |
|
|
This is one [CSRv2](https://arxiv.org/abs/2602.05735) model finetuned on [MTEB](https://huggingface.co/mteb) |
|
|
sts datasets with [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) as backbone. |
|
|
|
|
|
For more details, including benchmark evaluation, hardware requirements, and inference performance, please |
|
|
refer to our [Github](https://github.com/Y-Research-SBU/CSRv2). |
|
|
|
|
|
## Sentence Transformer Usage |
|
|
You can evaluate this model loaded by Sentence Transformers with the following code snippet (take STS12 as one example): |
|
|
```python |
|
|
import mteb |
|
|
from sentence_transformers import SparseEncoder |
|
|
model = SparseEncoder( |
|
|
"Y-Research-Group/CSRv2-sts", |
|
|
trust_remote_code=True |
|
|
) |
|
|
model.prompts = { |
|
|
"STS12": "Instruct: Retrieve semantically similar text\n Query:" |
|
|
} |
|
|
task = mteb.get_tasks(tasks=["STS12"]) |
|
|
evaluation = mteb.MTEB(tasks=task) |
|
|
evaluation.run( |
|
|
model, |
|
|
eval_splits=["test"], |
|
|
output_folder="./results/STS12", |
|
|
show_progress_bar=True |
|
|
encode_kwargs={"convert_to_sparse_tensor": False, "batch_size": 8} |
|
|
) # MTEB don't support sparse tensors yet, so we need to convert to dense tensors |
|
|
``` |
|
|
|
|
|
It is suggested that you use our [default prompts](https://github.com/Y-Research-SBU/CSRv2/blob/main/text/dataset_to_prompt.json) |
|
|
in evaluation. |
|
|
|
|
|
## Multi-TopK Support |
|
|
|
|
|
Our model supports different sparsity levels due to the utilization of **Multi-TopK** loss in training. |
|
|
You can change sparsity model by adjusting the `k` parameter` in the file `3_SparseAutoEncoder/config.json`. |
|
|
We set sparsity level to 2 by default. |
|
|
|
|
|
For instance, if you want to evaluate with sparsity level $K=8$ (which means there are 8 activated neurons in |
|
|
each embedding vector), the `3_SparseAutoEncoder/config.json` should look like this: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"input_dim": 2560, |
|
|
"hidden_dim": 10240, |
|
|
"k": 8, |
|
|
"k_aux": 1024, |
|
|
"normalize": false, |
|
|
"dead_threshold": 30 |
|
|
} |
|
|
``` |
|
|
|
|
|
## CSRv2 Qwen Series |
|
|
We will release a series of [CSRv2](https://arxiv.org/abs/2602.05735) models finetuned on common tasks in |
|
|
[MTEB](https://huggingface.co/mteb) with [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) |
|
|
as backbone. These tasks are: |
|
|
|
|
|
- **[Classification](https://huggingface.co/Y-Research-Group/CSRv2-classification)** |
|
|
- **[Clustering](https://huggingface.co/Y-Research-Group/CSRv2-clustering)** |
|
|
- **[Retrieval](https://huggingface.co/Y-Research-Group/CSRv2-retrieval)** |
|
|
- **[STS](https://huggingface.co/Y-Research-Group/CSRv2-sts)** |
|
|
- **[Pair_classification](https://huggingface.co/Y-Research-Group/CSRv2-pair_classification)** |
|
|
- **[Reranking](https://huggingface.co/Y-Research-Group/CSRv2-reranking)** |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@inproceedings{guo2026csrv2, |
|
|
title={{CSR}v2: Unlocking Ultra-sparse Embeddings}, |
|
|
author={Guo, Lixuan and Wang, Yifei and Wen, Tiansheng and Wang, Yifan and Feng, Aosong and Chen, Bo and Jegelka, Stefanie and You, Chenyu}, |
|
|
booktitle={International Conference on Learning Representations (ICLR)}, |
|
|
year={2026} |
|
|
} |
|
|
``` |