--- license: gemma language: - en - ko pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - embedding - gemma - text-embedding - retrieval - matryoshka - academic-search - scientific-search library_name: transformers base_model: google/gemma-3-300m datasets: - ms_marco --- # LinerAI/embeddinggemma-300m-academic for Academic Search This is a fine-tuned version of [google/gemma-3-300m](https://huggingface.co/google/gemma-3-300m) optimized for **academic and scientific literature search**. The model has been trained using contrastive learning with hard negative mining, specifically curated for academic search scenarios. ## Highlights - **Optimized for Academic Search**: Fine-tuned on datasets specifically designed for academic literature retrieval - **Hard Negative Mining**: Trained with carefully mined hard negatives to improve discrimination between similar academic papers - **Matryoshka Representation Learning (MRL)**: Supports flexible embedding dimensions (768, 512, 256, 128) for efficiency - **Lightweight**: Based on Gemma-3 300M, offering a good balance between performance and computational efficiency ## Model Description | Attribute | Value | |-----------|-------| | Base Model | google/gemma-3-300m | | Architecture | Gemma | | Embedding Dimension | 768 | | MRL Dimensions | 768, 512, 256, 128 | | Max Sequence Length | 2048 | | Pooling | Mean pooling | | Precision | bfloat16 | ## Evaluation Results | **Model** | Avg. | SciFact: Recall@10 | TRECCOVID: Recall@10 | NFCorpus: Recall@10 | SCIDOCS: Recall@10 | LitSearch: Recall@10 | QASA: Recall@10 | | --- | --- | --- | --- | --- | --- | --- | --- | | embeddinggemma-300m-academic | 0.3767 | 0.8863 | 0.0186 | 0.1735 | 0.1879 | 0.625 | 0.369 | | embeddinggemma-300m | 0.3732 | 0.9159 | 0.0215 | 0.1954 | 0.2037 | 0.6099 | 0.2926 | ## Training Details ### Training Configuration | Parameter | Value | |-----------|-------| | Learning Rate | 2e-5 | | Batch Size | 8192 (effective) | | Per-Device Batch Size | 32 | | Warmup Steps | 100 | | Weight Decay | 0.1 | | Precision | bf16 | | Max Length | 2048 | | Loss Function | InfoNCE (Contrastive) | | Temperature (τ) | 0.02 | ### Training Data The model was trained on [LEAD (Liner Embedding Academic Dataset)](https://huggingface.co/datasets/LinerAI/LEAD), a combination of ~55,560 samples tailored for academic search: - **MS MARCO** (49%): General passage retrieval dataset with hard negatives - **Academic Search Dataset** (51%): Custom dataset built specifically for academic literature search, with two-stage hard negative mining ### Matryoshka Representation Learning (MRL) This model supports [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147). You can truncate embeddings to smaller dimensions (512, 256, 128) for faster computation and reduced storage. ```python # Full dimension (768) full_embedding = embeddings[:, :768] # MRL dimensions embedding_512 = embeddings[:, :512] embedding_256 = embeddings[:, :256] embedding_128 = embeddings[:, :128] ``` ## Usage ### Using Transformers ```python import torch from transformers import AutoModel, AutoTokenizer model_path = "LinerAI/embeddinggemma-300m-academic" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16) model.eval() # For queries def encode_query(text): input_text = f"task: search result | query: {text}" inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True) with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) return embeddings # For passages def encode_passage(text): input_text = f"title: none | text: {text}" inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True) with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) return embeddings # Example: Academic search query = "transformer models for protein structure prediction" abstract = "We introduce AlphaFold, a deep learning system that predicts protein structures..." query_emb = encode_query(query) passage_emb = encode_passage(abstract) similarity = torch.nn.functional.cosine_similarity(query_emb, passage_emb) print(f"Similarity: {similarity.item():.4f}") ``` ## Input Format ### Query Format ``` task: search result | query: {your_query_text} ``` ### Passage Format ``` title: none | text: {your_passage_text} ``` ## Intended Use - **Academic Paper Search**: Finding relevant research papers given a research query - **Literature Review**: Discovering related work in academic literature - **Scientific Document Retrieval**: Retrieving scientific documents, abstracts, and articles - **Research Question Answering**: Finding papers that address specific research questions ## Limitations - Maximum sequence length is 2048 tokens - Best performance achieved when using the specific input formats described above - The model uses asymmetric encoding (different prompts for queries vs passages) ## License This model is released under the [Gemma license](https://ai.google.dev/gemma/terms). Please review Google's usage license before using this model.