PapersRAG-1.5B 🧪
A retrieval-augmented generation system for querying recent scientific literature — continuously updated.
PapersRAG-1.5B helps researchers explore and answer questions across a growing corpus of recent NLP papers from arXiv. It pairs a lightweight language model with a curated knowledge base of paper abstracts and a retrieval pipeline that prioritizes faithful, citation-backed answers over hallucination.
The model is automatically refreshed every day with the latest cs.CL papers. The knowledge base expands on its own. No manual upkeep required.
Model description
- Type: Retrieval-augmented generation (RAG)
- Base language model: Qwen 2.5 1.5B — small, fast, coherent when grounded with good context
- Knowledge base: A continuously growing collection of abstracts from the most recent
cs.CLpapers on arXiv, updated daily via an automated pipeline - Retrieval pipeline: Dense embeddings for initial candidate retrieval, cross-encoder for re-ranking — only the most relevant chunks reach the language model
- Answer style: Every answer cites the paper title it draws from. If no relevant paper is found, the model says so instead of fabricating one
Intended use
PapersRAG is a research assistant. It helps scientists and students locate information within indexed NLP papers, ask comparative questions like "What are the latest trends in retrieval-augmented generation?", and surface specific details about a paper's methodology or findings.
It is not a general-purpose chatbot. It does not have access to full paper text. It only knows what has been explicitly indexed. It will tell you when it doesn't know something.
How it works
- Indexing — Paper abstracts are split into overlapping chunks, embedded with a dense bi-encoder, and stored in a FAISS index
- Retrieval — The bi-encoder fetches a pool of candidate chunks for any given question
- Re-ranking — A cross-encoder scores each candidate; only chunks above a confidence threshold are kept
- Generation — Retained chunks are passed as context to the 1.5B model, which generates a cited answer
- Safety — If nothing clears the confidence threshold, the model refuses to answer rather than hallucinate
No relevant chunk, no answer. That's the rule.
Automated daily updates
Every day, the update pipeline:
- Downloads the existing index and chunk store from this repository
- Scrapes the 100 most recent papers from
cs.CLon arXiv - Chunks, embeds, and appends the new papers to the existing knowledge base
- Rebuilds the FAISS index and uploads everything back
The knowledge base grows by roughly 100 papers per day, automatically.
Quick start
from huggingface_hub import snapshot_download
from pipeline import PapersRAG
model_dir = snapshot_download("metaresearch/PapersRAG-1.5B")
rag = PapersRAG(model_dir)
print(rag.ask("What are the latest approaches to retrieval-augmented generation?"))
Requires transformers, sentence-transformers, and faiss. Everything else is in pipeline.py.
Model composition
| Component | Description |
|---|---|
| Language Model | Qwen 2.5 1.5B (float16) |
| Bi-encoder | Dense embedding model for initial retrieval |
| Cross-encoder | Re-ranking model that scores chunks for relevance |
| Vector Index | FAISS index of embedded paper chunks |
| Knowledge Chunks | Processed snippets from indexed arXiv abstracts |
| Pipeline | pipeline.py — one class, handles loading, retrieval, and generation |
Exact model names for the bi-encoder and cross-encoder are in the repository's configuration files.
Limitations
Knowledge base scope. Only cs.CL papers from arXiv. Papers from other fields are not included unless manually added.
Abstracts only. Full paper text is not indexed. Deep methodological comparisons may be incomplete.
Small language model. 1.5B parameters is lightweight. The retrieval pipeline handles factual accuracy well, but nuanced multi-paper synthesis has limits.
English only.
License
Apache-2.0.
PapersRAG is part of the Meta Research initiative — building open tools that accelerate scientific discovery.
- Downloads last month
- 368