---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - feature-extraction
  - sentence-similarity
  - search
  - retrieval
  - ranking
  - embeddings
  - semantic-search
  - bi-encoder
  - rag
  - transformers
  - pytorch
model_size: 1B
base_model: google/gemma-3-1b-pt
datasets:
  - ms_marco
  - natural_questions
  - hotpot_qa
pipeline_tag: feature-extraction
---

# Rank-Embed-1B

![image](https://cdn-uploads.huggingface.co/production/uploads/69625a973527f984b1d0cec1/M1Z3pkuYe4U0jyT71HEaL.png)


Rank-Embed-1B is a specialized 1B-parameter **bi-encoder** model from GorankLabs, fine-tuned from [`google/gemma-3-1b-pt`](https://huggingface.co/google/gemma-3-1b-pt). It is designed to convert text into dense vector representations so systems can reason about semantic meaning rather than relying solely on keyword overlap.

Built for retrieval-first workloads, Rank-Embed-1B is intended for **complex search**, semantic retrieval, ranking, retrieval-augmented generation, clustering, and duplicate detection. It combines efficient inference with strong language understanding, making it well suited for production retrieval pipelines on practical hardware.

The model uses mean pooling over the final hidden states to produce embeddings and introduces three task-specific special tokens: `<|query_token|>`, `<|document_token|>`, and `<|passage_token|>`. These tokens help distinguish input types and improve alignment between queries and candidate texts.

For best results, prepend the appropriate token to every input before encoding.

## Model Summary

| Property | Value |
|----------|-------|
| Architecture | Custom Gemma-based embedding model |
| Base model | `google/gemma-3-1b-pt` |
| Parameters | ~1.24B |
| Embedding dimension | 2048 |
| Maximum sequence length | 131,072 tokens |
| Pooling | Mean pooling over final hidden states |
| Precision | bfloat16 |
| Framework | PyTorch / Transformers |
| License | Apache 2.0 |

## Key Capabilities

- Dense embedding generation for queries, documents, and passages
- Retrieval and semantic similarity support with task-specific token prefixes
- Strong performance for complex, multi-hop, and semantically rich search queries
- Long-context support up to 131,072 tokens
- Compatibility with Hugging Face Transformers through `trust_remote_code=True`

## Quick Start

### Installation

```bash
pip install transformers torch sentencepiece
```

### Embedding Queries and Documents

```python
import torch
from transformers import AutoModel, AutoTokenizer

model_id = "GorankLabs/Rank-Embed-1B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
model.eval()

def mean_pool(token_embeddings, attention_mask):
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

def embed(texts: list[str], prefix_token: str) -> torch.Tensor:
    prefixed = [f"{prefix_token} {text}" for text in texts]
    encoded = tokenizer(
        prefixed,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt",
    )
    with torch.no_grad():
        outputs = model(**encoded)
    embeddings = mean_pool(outputs.last_hidden_state, encoded["attention_mask"])
    return torch.nn.functional.normalize(embeddings, p=2, dim=-1)

queries = ["What is dense retrieval?"]
documents = [
    "Dense retrieval uses learned embeddings to match queries and documents.",
    "Sparse retrieval relies on exact keyword overlap such as BM25.",
]

query_embeddings = embed(queries, tokenizer.query_token)
document_embeddings = embed(documents, tokenizer.document_token)

scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())
```

### Pipeline Usage

```python
from transformers import pipeline

pipe = pipeline(
    "feature-extraction",
    model="GorankLabs/Rank-Embed-1B",
    trust_remote_code=True,
    torch_dtype="bfloat16",
)

embeddings = pipe("<|query_token|> What causes aurora borealis?", return_tensors=True)
print(embeddings[0].shape)
```

## Special Tokens

The model extends the base Gemma vocabulary with three additional special tokens:

| Token | Token ID | Purpose |
|-------|----------|---------|
| `<|query_token|>` | 128256 | Prefix for search query inputs |
| `<|document_token|>` | 128257 | Prefix for full documents or chunks |
| `<|passage_token|>` | 128258 | Prefix for short passages or sentence-level inputs |

These tokens should always be prepended to the corresponding input type.

## Architecture Details

Rank-Embed-1B is derived from the Gemma architecture and adapted for embedding-focused workloads. The repository includes custom configuration, tokenizer, and model classes exposed through `trust_remote_code=True`.

Key implementation details:

- custom model registration through `trust_remote_code=True`
- `pooling_type`: `mean`
- custom `AutoConfig`, `AutoModel`, and `AutoTokenizer` registration
- long-context support up to 131,072 tokens

## What This Model Is

Rank-Embed-1B is designed to transform text into mathematical vectors, or embeddings, that capture semantic meaning. Instead of depending purely on lexical overlap, it enables systems to compare inputs based on intent, topic, and contextual similarity.

As a compact 1B-parameter model built on Gemma 3 1B PT, it is optimized for efficient deployment while retaining the capacity needed for nuanced retrieval tasks. This makes it a strong fit for teams that need practical inference performance without sacrificing retrieval quality.

Unlike a generative chatbot, Rank-Embed-1B is purpose-built for information retrieval. Its role is not to generate responses, but to identify, compare, and surface the most relevant pieces of information from a corpus.

## What It Can Do

- **Semantic search**: retrieves relevant content even when queries and documents use different wording.
- **Complex search**: handles nuanced, intent-heavy queries where the right result depends on context, relationships, and meaning rather than exact phrasing.
- **Retrieval-augmented generation**: serves as the retrieval layer for RAG systems by selecting relevant context for downstream language models.
- **Clustering and organization**: groups large collections of documents, tickets, or records by semantic similarity.
- **Duplicate detection**: identifies differently phrased inputs that express the same or highly similar meaning.

## Loading the Model Safely

This repository provides custom Python modules, so the model should be loaded with `trust_remote_code=True`:

```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "GorankLabs/Rank-Embed-1B",
    trust_remote_code=True,
)

model = AutoModel.from_pretrained(
    "GorankLabs/Rank-Embed-1B",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
```

Only enable `trust_remote_code=True` for repositories you trust, and review the custom code before deploying in production environments.

## License

This model is released under the **Apache License 2.0**.

The base model weights are derived from [`google/gemma-3-1b-pt`](https://huggingface.co/google/gemma-3-1b-pt). Use of this repository must comply with the applicable Gemma license terms in addition to the license for this repository where required.

## Contact and Citation

Maintained by **GorankLabs**. For questions, issues, or collaboration inquiries, please use the repository