Generated from Research Library:

https://github.com/peytontolbert/Research_Library

1M Paper Embedding Model

1M is a LoRA adapter for allenai/scibert_scivocab_uncased trained for scientific paper retrieval inside the Research Library project. It embeds paper queries and metadata cards so a user can search, rank, and navigate papers by title, abstract, category, and author metadata.

This repository contains the PEFT adapter only. Load it on top of allenai/scibert_scivocab_uncased.

Intended Use

  • Encode paper search queries for retrieval.
  • Encode paper metadata records for nearest-neighbor search.
  • Rank candidate papers in a research-library interface.

This is not a generative model and should not be used to synthesize paper text.

Training Data

The adapter was trained from PeytonT/1m_papers_text, a 1M-paper full-text and metadata dataset. Training used the metadata fields available in the local Research Library pipeline:

  • title
  • abstract
  • categories
  • authors

The training stream covered one clean full epoch over the 1M-paper corpus using positive and negative contrastive pairs.

Training Procedure

  • Base model: allenai/scibert_scivocab_uncased
  • Adapter: LoRA
  • Task type: FEATURE_EXTRACTION
  • LoRA rank: 8
  • LoRA alpha: 32
  • LoRA dropout: 0.05
  • Target modules: query, value
  • Objective: contrastive metadata retrieval
  • Batch size: 512
  • Max source tokens: 256
  • Precision: bf16
  • Optimizer: adamw
  • Learning rate: 1e-4
  • Warmup steps: 1000
  • Training steps: 3907
  • Epochs: 1.0
  • Final train loss: 0.0250
  • Hardware: single H100-class GPU

Usage

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

repo_id = "PeytonT/1m-paper-embedding-model"
base_id = "allenai/scibert_scivocab_uncased"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
base = AutoModel.from_pretrained(base_id)
model = PeftModel.from_pretrained(base, repo_id)
model.eval()

def embed(texts):
    batch = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors="pt",
    )
    with torch.no_grad():
        outputs = model(**batch)
        mask = batch["attention_mask"].unsqueeze(-1)
        pooled = (outputs.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1)
        return F.normalize(pooled, dim=1)

query = embed(["retrieval augmented generation for scientific literature"])
docs = embed([
    "Title: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\nCategories: cs.CL",
    "Title: Quantum error correction with superconducting qubits\nCategories: quant-ph",
])

scores = query @ docs.T
print(scores)

Limitations

  • The adapter is optimized for metadata retrieval, not full-text semantic chunk retrieval.
  • It depends on the SciBERT base model and PEFT adapter loading.
  • Training was contrastive and library-oriented; external benchmarks have not yet been run.
  • Metadata quality, missing abstracts, and noisy category labels can affect retrieval quality.

Project Context

This model is part of the Research Library system for exploring repositories and scientific papers through search, metadata views, paper graphs, and 3D universe visualizations.

Framework Versions

  • PEFT 0.19.1
Downloads last month
61
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PeytonT/1m-paper-embedding-model

Adapter
(4)
this model

Dataset used to train PeytonT/1m-paper-embedding-model