s8frbroy's picture
Update README.md
e0de5ec verified
---
base_model: sentence-transformers/all-MiniLM-L6-v2
datasets:
- s8frbroy/talk2ref
language: en
library_name: transformers
license: cc-by-4.0
pipeline_tag: feature-extraction
tags:
- scientific-retrieval
- dense-passage-retrieval
- dual-encoder
- citation-prediction
- talk2ref
- SBERT
---
# 📚 Talk2Ref Cited Paper Encoder
This model encodes **scientific papers** (titles, abstracts, and publication years) into dense embeddings for **Reference Prediction from Talks (RPT)** within the [Talk2Ref](https://huggingface.co/datasets/s8frbroy/talk2ref) framework.
It serves as the **key-side encoder** in a **dual-encoder (DPR-style)** retrieval setup, paired with the [Talk2Ref Query Talk Encoder](https://huggingface.co/s8frbroy/talk2ref_query_talk_encoder).
---
---
## 🎯 Usage
Example with `transformers`:
```python
from transformers import AutoModel
import torch
# Load model
model = AutoModel.from_pretrained("s8frbroy/talk2ref_ref_key_cited_paper_encoder")
# Example input
title = "Attention Is All You Need"
year = 2017
abstract = "The Transformer model replaces recurrence with attention mechanisms for ..."
# Build input in Talk2Ref format
key_text = f"Title: {title}. Published in {year}. Abstract: {abstract}"
# Compute embedding
with torch.no_grad():
embedding = model([key_text])
print(embedding.shape) # (1, hidden_dim)
```
## 🧩 Model Overview
| Property | Description |
|-----------|-------------|
| **Architecture** | Sentence-BERT (all-MiniLM-L6-v2 backbone) |
| **Pooling** | Mean pooling |
| **Max sequence length** | 512 tokens |
| **Training data** | Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) |
| **Objective** | Contrastive binary (DPR-style) loss |
| **Task** | Encode cited papers into a shared semantic space with talk transcripts |
---
## 🧠 Input Features
| Feature | Description |
|----------|-------------|
| **Title** | Title of the cited paper |
| **Abstract** | Abstract text content |
| **Year** | Publication year |
These inputs are short enough to fit within the model’s 512-token limit — no chunking required.
---
## 🧮 Training Setup
The cited-paper encoder was trained jointly with the query-talk encoder under a **dual-encoder contrastive framework** inspired by Dense Passage Retrieval (Karpukhin et al., 2020).
Each talk $Ti$ and paper $Rj$ is encoded into embeddings $fT(Ti)$ and $fR(Rj)$.
Their dot-product similarity $s_{ij} = f_T(T_i) \cdot f_R(R_j)$ is optimized using a sigmoid-based binary loss supporting multiple positives per query:
$$
L = - \sum_i [y_i \log \sigma(s_i) + (1 - y_i)\log(1 - \sigma(s_i))]
$$
Negatives are sampled in-batch from other talk–paper pairs.
Before training, a **domain adaptation stage** aligned each talk with its own paper’s abstract to adapt to scientific and spoken-language data.
---
## Citation
If you use this dataset, please cite the following paper:
```bibtex
@misc{broy2025talk2refdatasetreferenceprediction,
title = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
author = {Frederik Broy and Maike Züfle and Jan Niehues},
year = {2025},
eprint = {2510.24478},
archivePrefix= {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2510.24478}
}