---
base_model: sentence-transformers/all-MiniLM-L6-v2
datasets:
- s8frbroy/talk2ref
language: en
library_name: transformers
license: cc-by-4.0
pipeline_tag: feature-extraction
tags:
- scientific-retrieval
- dense-passage-retrieval
- dual-encoder
- citation-prediction
- talk2ref
- SBERT
---

# 📚 Talk2Ref Cited Paper Encoder

This model encodes **scientific papers** (titles, abstracts, and publication years) into dense embeddings for **Reference Prediction from Talks (RPT)** within the [Talk2Ref](https://huggingface.co/datasets/s8frbroy/talk2ref) framework.  
It serves as the **key-side encoder** in a **dual-encoder (DPR-style)** retrieval setup, paired with the [Talk2Ref Query Talk Encoder](https://huggingface.co/s8frbroy/talk2ref_query_talk_encoder).

---

---

## 🎯 Usage

Example with `transformers`:

```python
from transformers import AutoModel
import torch

# Load model
model = AutoModel.from_pretrained("s8frbroy/talk2ref_ref_key_cited_paper_encoder")

# Example input
title = "Attention Is All You Need"
year = 2017
abstract = "The Transformer model replaces recurrence with attention mechanisms for ..."


# Build input in Talk2Ref format
key_text = f"Title: {title}. Published in {year}. Abstract: {abstract}"

# Compute embedding
with torch.no_grad():
    embedding = model([key_text])

print(embedding.shape)  # (1, hidden_dim)

```

## 🧩 Model Overview

| Property | Description |
|-----------|-------------|
| **Architecture** | Sentence-BERT (all-MiniLM-L6-v2 backbone) |
| **Pooling** | Mean pooling |
| **Max sequence length** | 512 tokens |
| **Training data** | Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) |
| **Objective** | Contrastive binary (DPR-style) loss |
| **Task** | Encode cited papers into a shared semantic space with talk transcripts |

---

## 🧠 Input Features

| Feature | Description |
|----------|-------------|
| **Title** | Title of the cited paper |
| **Abstract** | Abstract text content |
| **Year** | Publication year |

These inputs are short enough to fit within the model’s 512-token limit — no chunking required.

---

## 🧮 Training Setup

The cited-paper encoder was trained jointly with the query-talk encoder under a **dual-encoder contrastive framework** inspired by Dense Passage Retrieval (Karpukhin et al., 2020).

Each talk $Ti$ and paper $Rj$ is encoded into embeddings $fT(Ti)$ and $fR(Rj)$.  
Their dot-product similarity $s_{ij} = f_T(T_i) \cdot f_R(R_j)$ is optimized using a sigmoid-based binary loss supporting multiple positives per query:

$$
L = - \sum_i [y_i \log \sigma(s_i) + (1 - y_i)\log(1 - \sigma(s_i))]
$$

Negatives are sampled in-batch from other talk–paper pairs.  
Before training, a **domain adaptation stage** aligned each talk with its own paper’s abstract to adapt to scientific and spoken-language data.

---


## Citation

If you use this dataset, please cite the following paper:

```bibtex
@misc{broy2025talk2refdatasetreferenceprediction,
  title        = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
  author       = {Frederik Broy and Maike Züfle and Jan Niehues},
  year         = {2025},
  eprint       = {2510.24478},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2510.24478}
}