Y-Research-Group
/

CSRv2-sts

sentence-transformers

Model card Files Files and versions

CSRv2-sts / README.md

Veritas2025's picture

Create README.md

3179e73 verified 4 days ago

|

history blame contribute delete

3.13 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen3-Embedding-4B
	library_name: sentence-transformers
	---
	## Description
	This is one [CSRv2](https://arxiv.org/abs/2602.05735) model finetuned on [MTEB](https://huggingface.co/mteb)
	sts datasets with [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) as backbone.

	For more details, including benchmark evaluation, hardware requirements, and inference performance, please
	refer to our [Github](https://github.com/Y-Research-SBU/CSRv2).

	## Sentence Transformer Usage
	You can evaluate this model loaded by Sentence Transformers with the following code snippet (take STS12 as one example):
	```python
	import mteb
	from sentence_transformers import SparseEncoder
	model = SparseEncoder(
	"Y-Research-Group/CSRv2-sts",
	trust_remote_code=True
	)
	model.prompts = {
	"STS12": "Instruct: Retrieve semantically similar text\n Query:"
	}
	task = mteb.get_tasks(tasks=["STS12"])
	evaluation = mteb.MTEB(tasks=task)
	evaluation.run(
	model,
	eval_splits=["test"],
	output_folder="./results/STS12",
	show_progress_bar=True
	encode_kwargs={"convert_to_sparse_tensor": False, "batch_size": 8}
	) # MTEB don't support sparse tensors yet, so we need to convert to dense tensors
	```

	It is suggested that you use our [default prompts](https://github.com/Y-Research-SBU/CSRv2/blob/main/text/dataset_to_prompt.json)
	in evaluation.

	## Multi-TopK Support

	Our model supports different sparsity levels due to the utilization of Multi-TopK loss in training.
	You can change sparsity model by adjusting the `k` parameter` in the file `3_SparseAutoEncoder/config.json`.
	We set sparsity level to 2 by default.

	For instance, if you want to evaluate with sparsity level $K=8$ (which means there are 8 activated neurons in
	each embedding vector), the `3_SparseAutoEncoder/config.json` should look like this:

	```json
	{
	"input_dim": 2560,
	"hidden_dim": 10240,
	"k": 8,
	"k_aux": 1024,
	"normalize": false,
	"dead_threshold": 30
	}
	```

	## CSRv2 Qwen Series
	We will release a series of [CSRv2](https://arxiv.org/abs/2602.05735) models finetuned on common tasks in
	[MTEB](https://huggingface.co/mteb) with [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)
	as backbone. These tasks are:

	- [Classification](https://huggingface.co/Y-Research-Group/CSRv2-classification)
	- [Clustering](https://huggingface.co/Y-Research-Group/CSRv2-clustering)
	- [Retrieval](https://huggingface.co/Y-Research-Group/CSRv2-retrieval)
	- [STS](https://huggingface.co/Y-Research-Group/CSRv2-sts)
	- [Pair_classification](https://huggingface.co/Y-Research-Group/CSRv2-pair_classification)
	- [Reranking](https://huggingface.co/Y-Research-Group/CSRv2-reranking)

	## Citation
	```bibtex
	@inproceedings{guo2026csrv2,
	title={{CSR}v2: Unlocking Ultra-sparse Embeddings},
	author={Guo, Lixuan and Wang, Yifei and Wen, Tiansheng and Wang, Yifan and Feng, Aosong and Chen, Bo and Jegelka, Stefanie and You, Chenyu},
	booktitle={International Conference on Learning Representations (ICLR)},
	year={2026}
	}
	```