README.md · LinerAI/embeddinggemma-300m-academic at main

File size: 5,503 Bytes

---
license: gemma
language:
- en
- ko
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- embedding
- gemma
- text-embedding
- retrieval
- matryoshka
- academic-search
- scientific-search
library_name: transformers
base_model: google/gemma-3-300m
datasets:
- ms_marco
---

# LinerAI/embeddinggemma-300m-academic for Academic Search

This is a fine-tuned version of [google/gemma-3-300m](https://huggingface.co/google/gemma-3-300m) optimized for **academic and scientific literature search**. The model has been trained using contrastive learning with hard negative mining, specifically curated for academic search scenarios.

## Highlights

- **Optimized for Academic Search**: Fine-tuned on datasets specifically designed for academic literature retrieval
- **Hard Negative Mining**: Trained with carefully mined hard negatives to improve discrimination between similar academic papers
- **Matryoshka Representation Learning (MRL)**: Supports flexible embedding dimensions (768, 512, 256, 128) for efficiency
- **Lightweight**: Based on Gemma-3 300M, offering a good balance between performance and computational efficiency

## Model Description

| Attribute | Value |
|-----------|-------|
| Base Model | google/gemma-3-300m |
| Architecture | Gemma |
| Embedding Dimension | 768 |
| MRL Dimensions | 768, 512, 256, 128 |
| Max Sequence Length | 2048 |
| Pooling | Mean pooling |
| Precision | bfloat16 |

## Evaluation Results

| **Model** | Avg. | SciFact: Recall@10 | TRECCOVID: Recall@10 | NFCorpus: Recall@10 | SCIDOCS: Recall@10 | LitSearch: Recall@10 | QASA: Recall@10 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| embeddinggemma-300m-academic | 0.3767 | 0.8863 | 0.0186 | 0.1735 | 0.1879 | 0.625 | 0.369 |
| embeddinggemma-300m | 0.3732 | 0.9159 | 0.0215 | 0.1954 | 0.2037 | 0.6099 | 0.2926 |

## Training Details

### Training Configuration

| Parameter | Value |
|-----------|-------|
| Learning Rate | 2e-5 |
| Batch Size | 8192 (effective) |
| Per-Device Batch Size | 32 |
| Warmup Steps | 100 |
| Weight Decay | 0.1 |
| Precision | bf16 |
| Max Length | 2048 |
| Loss Function | InfoNCE (Contrastive) |
| Temperature (τ) | 0.02 |

### Training Data

The model was trained on [LEAD (Liner Embedding Academic Dataset)](https://huggingface.co/datasets/LinerAI/LEAD), a combination of ~55,560 samples tailored for academic search:
- **MS MARCO** (49%): General passage retrieval dataset with hard negatives
- **Academic Search Dataset** (51%): Custom dataset built specifically for academic literature search, with two-stage hard negative mining

### Matryoshka Representation Learning (MRL)

This model supports [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147). You can truncate embeddings to smaller dimensions (512, 256, 128) for faster computation and reduced storage.

```python
# Full dimension (768)
full_embedding = embeddings[:, :768]

# MRL dimensions
embedding_512 = embeddings[:, :512]
embedding_256 = embeddings[:, :256]
embedding_128 = embeddings[:, :128]
```

## Usage

### Using Transformers

```python
import torch
from transformers import AutoModel, AutoTokenizer

model_path = "LinerAI/embeddinggemma-300m-academic"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16)
model.eval()

# For queries
def encode_query(text):
    input_text = f"task: search result | query: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    return embeddings

# For passages
def encode_passage(text):
    input_text = f"title: none | text: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    return embeddings

# Example: Academic search
query = "transformer models for protein structure prediction"
abstract = "We introduce AlphaFold, a deep learning system that predicts protein structures..."

query_emb = encode_query(query)
passage_emb = encode_passage(abstract)

similarity = torch.nn.functional.cosine_similarity(query_emb, passage_emb)
print(f"Similarity: {similarity.item():.4f}")
```

## Input Format

### Query Format
```
task: search result | query: {your_query_text}
```

### Passage Format
```
title: none | text: {your_passage_text}
```

## Intended Use

- **Academic Paper Search**: Finding relevant research papers given a research query
- **Literature Review**: Discovering related work in academic literature
- **Scientific Document Retrieval**: Retrieving scientific documents, abstracts, and articles
- **Research Question Answering**: Finding papers that address specific research questions

## Limitations

- Maximum sequence length is 2048 tokens
- Best performance achieved when using the specific input formats described above
- The model uses asymmetric encoding (different prompts for queries vs passages)

## License

This model is released under the [Gemma license](https://ai.google.dev/gemma/terms). Please review Google's usage license before using this model.