RTriever-4B

RTriever-4B is a 4-billion-parameter dense retriever based on Qwen/Qwen3-Embedding-4B, specialized for reasoning-intensive information retrieval.

Model details

Base model Qwen/Qwen3-Embedding-4B
Parameters 4 B
Hidden size 2,560
Layers 36
Attention heads 32
Embedding dimension 2,560
Pooling Last-token + L2 normalization
Max sequence length 32,768 tokens
Tokenizer Qwen3 (vocab size 151,665)
Format safetensors (sharded, fp16); also exposes a sentence-transformers interface
License MIT

The model can be loaded as either a sentence-transformers model or a plain transformers causal-LM with manual last-token pooling.

Quick start

Option A — sentence-transformers (recommended)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("yale-nlp/RTriever-4B")

# Queries should use a query prompt; documents should NOT have a prompt.
query = "Why are insects attracted to light at night?"
docs = [
    "Recent flight-tracking studies show insects orient their dorsal axis toward "
    "the brightest visual region; near a point light source, this dorsal-light "
    "response disrupts flight stability and traps the insect.",
    "Fluorescent lamps emit in the UV range, which can be perceived by some "
    "nocturnal insects as a navigational cue similar to moonlight.",
]

q_emb = model.encode(query, prompt_name="query")
d_emb = model.encode(docs)

# Cosine similarity (the embeddings are L2-normalized, so a dot product suffices)
scores = q_emb @ d_emb.T
print(scores)

The prompt_name="query" flag prepends the default query instruction stored in config_sentence_transformers.json:

Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:

For task-specific instructions (e.g. a domain-tuned prompt template), pass prompt=... directly:

custom_prompt = (
    "Given a Biology post, retrieve relevant passages that help answer the post\nPost: "
)
q_emb = model.encode(query, prompt=custom_prompt)

Option B — transformers with manual pooling

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("yale-nlp/RTriever-4B")
model = AutoModel.from_pretrained(
    "yale-nlp/RTriever-4B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

QUERY_PROMPT = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:"


def last_token_pool(last_hidden_states, attention_mask):
    # Right-padding-aware last-token pooling. Works whether or not the tokenizer
    # left-pads.
    left_padding = attention_mask[:, -1].sum() == attention_mask.shape[0]
    if left_padding:
        return last_hidden_states[:, -1]
    seq_lens = attention_mask.sum(dim=1) - 1
    return last_hidden_states[torch.arange(last_hidden_states.size(0)), seq_lens]


def encode(texts, prompt: str = ""):
    if prompt:
        texts = [prompt + t for t in texts]   # match sentence-transformers behavior: no separator
    batch = tokenizer(texts, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model(**batch)
    pooled = last_token_pool(out.last_hidden_state, batch["attention_mask"])
    return F.normalize(pooled, p=2, dim=1)


queries = ["Why are insects attracted to light at night?"]
docs = [
    "Recent flight-tracking studies show insects orient their dorsal axis toward "
    "the brightest visual region; near a point light source, this dorsal-light "
    "response disrupts flight stability and traps the insect.",
]

q_emb = encode(queries, prompt=QUERY_PROMPT)
d_emb = encode(docs, prompt="")
scores = (q_emb @ d_emb.T).cpu().tolist()
print(scores)

Option C — batched retrieval over a corpus

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("yale-nlp/RTriever-4B")

corpus = [...]                          # list[str], thousands–millions of docs
doc_emb = model.encode(corpus, batch_size=16, show_progress_bar=True)

queries = [...]                         # list[str]
q_emb = model.encode(queries, prompt_name="query", batch_size=16)

# Top-k retrieval (cosine similarity == dot product, since both sides are L2-normalized)
scores = q_emb @ doc_emb.T              # (n_query, n_doc)
top_k = np.argsort(-scores, axis=1)[:, :100]

Notes on inputs

  • Query prompt is required. Use prompt_name="query" (sentence-transformers) or prepend QUERY_PROMPT manually (transformers). Documents are encoded without a prompt.
  • Domain-specific prompts typically improve retrieval quality on reasoning-intensive queries; a generic web-search prompt is provided as the default.
  • Long inputs are supported up to 32 K tokens; for retrieval you usually want to truncate documents to 4–8 K tokens to keep encoding cost manageable.
  • Embeddings are L2-normalized, so cosine similarity reduces to a dot product. Both sentence-transformers (util.cos_sim) and a plain q @ d.T work.

Intended use

RTriever-4B is intended for reasoning-intensive retrieval: queries that require multi-step inference and the integration of complementary evidence rather than surface-level keyword or paraphrase matching. It can also be used as a drop-in replacement for any general-purpose dense retriever in retrieval-augmented generation, scholarly search, and agentic search pipelines.

License

Released under the MIT License. The base model (Qwen/Qwen3-Embedding-4B) retains its original license; consult the Qwen3-Embedding model card for upstream attribution.

Downloads last month
31
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yale-nlp/RTriever-4B

Finetuned
(48)
this model
Quantizations
1 model

Collection including yale-nlp/RTriever-4B