Generated from Research Library:

https://github.com/peytontolbert/Research_Library

1M Paper Embedding Model

1M is a LoRA adapter for allenai/scibert_scivocab_uncased trained for scientific paper retrieval inside the Research Library project. It embeds paper queries and metadata cards so a user can search, rank, and navigate papers by title, abstract, category, and author metadata.

This repository contains the PEFT adapter only. Load it on top of allenai/scibert_scivocab_uncased.

Intended Use

Encode paper search queries for retrieval.
Encode paper metadata records for nearest-neighbor search.
Rank candidate papers in a research-library interface.

This is not a generative model and should not be used to synthesize paper text.

Training Data

The adapter was trained from PeytonT/1m_papers_text, a 1M-paper full-text and metadata dataset. Training used the metadata fields available in the local Research Library pipeline:

title
abstract
categories
authors

The training stream covered one clean full epoch over the 1M-paper corpus using positive and negative contrastive pairs.

Training Procedure

Base model: allenai/scibert_scivocab_uncased
Adapter: LoRA
Task type: FEATURE_EXTRACTION
LoRA rank: 8
LoRA alpha: 32
LoRA dropout: 0.05
Target modules: query, value
Objective: contrastive metadata retrieval
Batch size: 512
Max source tokens: 256
Precision: bf16
Optimizer: adamw
Learning rate: 1e-4
Warmup steps: 1000
Training steps: 3907
Epochs: 1.0
Final train loss: 0.0250
Hardware: single H100-class GPU

Usage

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

repo_id = "PeytonT/1m-paper-embedding-model"
base_id = "allenai/scibert_scivocab_uncased"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
base = AutoModel.from_pretrained(base_id)
model = PeftModel.from_pretrained(base, repo_id)
model.eval()

def embed(texts):
    batch = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors="pt",
    )
    with torch.no_grad():
        outputs = model(**batch)
        mask = batch["attention_mask"].unsqueeze(-1)
        pooled = (outputs.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1)
        return F.normalize(pooled, dim=1)

query = embed(["retrieval augmented generation for scientific literature"])
docs = embed([
    "Title: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\nCategories: cs.CL",
    "Title: Quantum error correction with superconducting qubits\nCategories: quant-ph",
])

scores = query @ docs.T
print(scores)

Limitations

The adapter is optimized for metadata retrieval, not full-text semantic chunk retrieval.
It depends on the SciBERT base model and PEFT adapter loading.
Training was contrastive and library-oriented; external benchmarks have not yet been run.
Metadata quality, missing abstracts, and noisy category labels can affect retrieval quality.

Project Context

This model is part of the Research Library system for exploring repositories and scientific papers through search, metadata views, paper graphs, and 3D universe visualizations.

Framework Versions

PEFT 0.19.1

Downloads last month: 61

Model tree for PeytonT/1m-paper-embedding-model

Base model

allenai/scibert_scivocab_uncased

Adapter

(4)

this model

PeytonT
/

1m-paper-embedding-model