GIRCSE: Generative Iterative Refinement for Contrastive Sentence Embeddings

GIRCSE is a novel generative embedding framework that transforms Large Language Models (LLMs) into text encoders by leveraging their autoregressive generative capabilities. Unlike traditional encoder-only embeddings, GIRCSE generates a sequence of "soft refinement tokens" and utilizes an Iterative Contrastive Refinement (ICR) objective to progressively distill semantics into high-quality embeddings.

Model Details

Model Description

GIRCSE addresses the limitations of static LLM-based embeddings by treating the representation learning process as an iterative refinement task.

  • Key Innovation: Instead of a single forward pass, the model generates $k$ auxiliary soft tokens. These tokens capture latent concepts and implicit semantics (e.g., task-specific instructions) that are often missed by standard pooling methods.

  • Iterative Contrastive Refinement (ICR): A stepwise objective that ensures each additional generated token monotonically improves the embedding quality.

  • Test-time Scaling: An emergent property where generating more tokens at inference time (e.g., 5 to 20 tokens) leads to better performance on downstream tasks, analogous to "Chain-of-Thought" for embeddings.

  • Developed by: Yu-Che (Roy) Tsai, et al.

  • Model type: Generative Text Embedding (based on Decoder-only LLM)

  • Language(s) (NLP): English

  • License: Apache 2.0

  • Finetuned from model: mistralai/Mistral-7B-v0.1

Model Sources

Training Details

Training Data

The model was trained on a curated mix of contrastive datasets (e.g., MS-MARCO, NLI) totaling approximately 200K samples.

Training Procedure

  • Method: LoRA (Low-Rank Adaptation)
  • Objective: Iterative Contrastive Refinement (ICR) with Stepwise Contrastive Loss.
  • Steps: 5 refinement steps were used during training.
  • Framework: PEFT 0.15.2 + Transformers.

Citation

If you find this work helpful, please cite:

BibTeX:

@article{tsai2025gircse,
  title={Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement},
  author={Tsai, Yu-Che and others},
  journal={arXiv preprint arXiv:2509.24291},
  year={2025}
}

Contact

For questions, please open an issue in the GitHub Repository.


Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Roytsai27/GIRCSE-Mistral7B

Adapter
(2365)
this model

Paper for Roytsai27/GIRCSE-Mistral7B