| | --- |
| | license: gemma |
| | language: |
| | - en |
| | - ko |
| | pipeline_tag: sentence-similarity |
| | tags: |
| | - sentence-transformers |
| | - feature-extraction |
| | - sentence-similarity |
| | - transformers |
| | - embedding |
| | - gemma |
| | - text-embedding |
| | - retrieval |
| | - matryoshka |
| | - academic-search |
| | - scientific-search |
| | library_name: transformers |
| | base_model: google/gemma-3-300m |
| | datasets: |
| | - ms_marco |
| | --- |
| | |
| | # LinerAI/embeddinggemma-300m-academic for Academic Search |
| |
|
| | This is a fine-tuned version of [google/gemma-3-300m](https://huggingface.co/google/gemma-3-300m) optimized for **academic and scientific literature search**. The model has been trained using contrastive learning with hard negative mining, specifically curated for academic search scenarios. |
| |
|
| | ## Highlights |
| |
|
| | - **Optimized for Academic Search**: Fine-tuned on datasets specifically designed for academic literature retrieval |
| | - **Hard Negative Mining**: Trained with carefully mined hard negatives to improve discrimination between similar academic papers |
| | - **Matryoshka Representation Learning (MRL)**: Supports flexible embedding dimensions (768, 512, 256, 128) for efficiency |
| | - **Lightweight**: Based on Gemma-3 300M, offering a good balance between performance and computational efficiency |
| |
|
| | ## Model Description |
| |
|
| | | Attribute | Value | |
| | |-----------|-------| |
| | | Base Model | google/gemma-3-300m | |
| | | Architecture | Gemma | |
| | | Embedding Dimension | 768 | |
| | | MRL Dimensions | 768, 512, 256, 128 | |
| | | Max Sequence Length | 2048 | |
| | | Pooling | Mean pooling | |
| | | Precision | bfloat16 | |
| |
|
| | ## Evaluation Results |
| |
|
| | | **Model** | Avg. | SciFact: Recall@10 | TRECCOVID: Recall@10 | NFCorpus: Recall@10 | SCIDOCS: Recall@10 | LitSearch: Recall@10 | QASA: Recall@10 | |
| | | --- | --- | --- | --- | --- | --- | --- | --- | |
| | | embeddinggemma-300m-academic | 0.3767 | 0.8863 | 0.0186 | 0.1735 | 0.1879 | 0.625 | 0.369 | |
| | | embeddinggemma-300m | 0.3732 | 0.9159 | 0.0215 | 0.1954 | 0.2037 | 0.6099 | 0.2926 | |
| |
|
| | ## Training Details |
| |
|
| | ### Training Configuration |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Learning Rate | 2e-5 | |
| | | Batch Size | 8192 (effective) | |
| | | Per-Device Batch Size | 32 | |
| | | Warmup Steps | 100 | |
| | | Weight Decay | 0.1 | |
| | | Precision | bf16 | |
| | | Max Length | 2048 | |
| | | Loss Function | InfoNCE (Contrastive) | |
| | | Temperature (τ) | 0.02 | |
| |
|
| | ### Training Data |
| |
|
| | The model was trained on [LEAD (Liner Embedding Academic Dataset)](https://huggingface.co/datasets/LinerAI/LEAD), a combination of ~55,560 samples tailored for academic search: |
| | - **MS MARCO** (49%): General passage retrieval dataset with hard negatives |
| | - **Academic Search Dataset** (51%): Custom dataset built specifically for academic literature search, with two-stage hard negative mining |
| |
|
| | ### Matryoshka Representation Learning (MRL) |
| |
|
| | This model supports [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147). You can truncate embeddings to smaller dimensions (512, 256, 128) for faster computation and reduced storage. |
| |
|
| | ```python |
| | # Full dimension (768) |
| | full_embedding = embeddings[:, :768] |
| | |
| | # MRL dimensions |
| | embedding_512 = embeddings[:, :512] |
| | embedding_256 = embeddings[:, :256] |
| | embedding_128 = embeddings[:, :128] |
| | ``` |
| |
|
| | ## Usage |
| |
|
| | ### Using Transformers |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModel, AutoTokenizer |
| | |
| | model_path = "LinerAI/embeddinggemma-300m-academic" |
| | tokenizer = AutoTokenizer.from_pretrained(model_path) |
| | model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16) |
| | model.eval() |
| | |
| | # For queries |
| | def encode_query(text): |
| | input_text = f"task: search result | query: {text}" |
| | inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True) |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling |
| | embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) |
| | return embeddings |
| | |
| | # For passages |
| | def encode_passage(text): |
| | input_text = f"title: none | text: {text}" |
| | inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True) |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling |
| | embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) |
| | return embeddings |
| | |
| | # Example: Academic search |
| | query = "transformer models for protein structure prediction" |
| | abstract = "We introduce AlphaFold, a deep learning system that predicts protein structures..." |
| | |
| | query_emb = encode_query(query) |
| | passage_emb = encode_passage(abstract) |
| | |
| | similarity = torch.nn.functional.cosine_similarity(query_emb, passage_emb) |
| | print(f"Similarity: {similarity.item():.4f}") |
| | ``` |
| |
|
| | ## Input Format |
| |
|
| | ### Query Format |
| | ``` |
| | task: search result | query: {your_query_text} |
| | ``` |
| |
|
| | ### Passage Format |
| | ``` |
| | title: none | text: {your_passage_text} |
| | ``` |
| |
|
| | ## Intended Use |
| |
|
| | - **Academic Paper Search**: Finding relevant research papers given a research query |
| | - **Literature Review**: Discovering related work in academic literature |
| | - **Scientific Document Retrieval**: Retrieving scientific documents, abstracts, and articles |
| | - **Research Question Answering**: Finding papers that address specific research questions |
| |
|
| | ## Limitations |
| |
|
| | - Maximum sequence length is 2048 tokens |
| | - Best performance achieved when using the specific input formats described above |
| | - The model uses asymmetric encoding (different prompts for queries vs passages) |
| |
|
| | ## License |
| |
|
| | This model is released under the [Gemma license](https://ai.google.dev/gemma/terms). Please review Google's usage license before using this model. |
| |
|