GIRCSE: Generative Iterative Refinement for Contrastive Sentence Embeddings
GIRCSE is a novel generative embedding framework that transforms Large Language Models (LLMs) into text encoders by leveraging their autoregressive generative capabilities. Unlike traditional encoder-only embeddings, GIRCSE generates a sequence of "soft refinement tokens" and utilizes an Iterative Contrastive Refinement (ICR) objective to progressively distill semantics into high-quality embeddings.
Model Details
Model Description
GIRCSE addresses the limitations of static LLM-based embeddings by treating the representation learning process as an iterative refinement task.
Key Innovation: Instead of a single forward pass, the model generates $k$ auxiliary soft tokens. These tokens capture latent concepts and implicit semantics (e.g., task-specific instructions) that are often missed by standard pooling methods.
Iterative Contrastive Refinement (ICR): A stepwise objective that ensures each additional generated token monotonically improves the embedding quality.
Test-time Scaling: An emergent property where generating more tokens at inference time (e.g., 5 to 20 tokens) leads to better performance on downstream tasks, analogous to "Chain-of-Thought" for embeddings.
Developed by: Yu-Che (Roy) Tsai, et al.
Model type: Generative Text Embedding (based on Decoder-only LLM)
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: mistralai/Mistral-7B-v0.1
Model Sources
- Repository: Roytsai27/GIRCSE
- Paper: Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement
Training Details
Training Data
The model was trained on a curated mix of contrastive datasets (e.g., MS-MARCO, NLI) totaling approximately 200K samples.
Training Procedure
- Method: LoRA (Low-Rank Adaptation)
- Objective: Iterative Contrastive Refinement (ICR) with Stepwise Contrastive Loss.
- Steps: 5 refinement steps were used during training.
- Framework: PEFT 0.15.2 + Transformers.
Citation
If you find this work helpful, please cite:
BibTeX:
@article{tsai2025gircse,
title={Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement},
author={Tsai, Yu-Che and others},
journal={arXiv preprint arXiv:2509.24291},
year={2025}
}
Contact
For questions, please open an issue in the GitHub Repository.
- Downloads last month
- 27
Model tree for Roytsai27/GIRCSE-Mistral7B
Base model
mistralai/Mistral-7B-v0.1