s8frbroy
/

talk2ref_ref_key_cited_paper_encoder

Feature Extraction

scientific-retrieval

dense-passage-retrieval

citation-prediction

Model card Files Files and versions

talk2ref_ref_key_cited_paper_encoder / README.md

s8frbroy's picture

Update README.md

e0de5ec verified about 1 month ago

|

history blame contribute delete

3.29 kB

	---
	base_model: sentence-transformers/all-MiniLM-L6-v2
	datasets:
	- s8frbroy/talk2ref
	language: en
	library_name: transformers
	license: cc-by-4.0
	pipeline_tag: feature-extraction
	tags:
	- scientific-retrieval
	- dense-passage-retrieval
	- dual-encoder
	- citation-prediction
	- talk2ref
	- SBERT
	---

	# 📚 Talk2Ref Cited Paper Encoder

	This model encodes scientific papers (titles, abstracts, and publication years) into dense embeddings for Reference Prediction from Talks (RPT) within the [Talk2Ref](https://huggingface.co/datasets/s8frbroy/talk2ref) framework.
	It serves as the key-side encoder in a dual-encoder (DPR-style) retrieval setup, paired with the [Talk2Ref Query Talk Encoder](https://huggingface.co/s8frbroy/talk2ref_query_talk_encoder).

	---

	---

	## 🎯 Usage

	Example with `transformers`:

	```python
	from transformers import AutoModel
	import torch

	# Load model
	model = AutoModel.from_pretrained("s8frbroy/talk2ref_ref_key_cited_paper_encoder")

	# Example input
	title = "Attention Is All You Need"
	year = 2017
	abstract = "The Transformer model replaces recurrence with attention mechanisms for ..."


	# Build input in Talk2Ref format
	key_text = f"Title: {title}. Published in {year}. Abstract: {abstract}"

	# Compute embedding
	with torch.no_grad():
	embedding = model([key_text])

	print(embedding.shape) # (1, hidden_dim)

	```

	## 🧩 Model Overview

	\| Property \| Description \|
	\|-----------\|-------------\|
	\| Architecture \| Sentence-BERT (all-MiniLM-L6-v2 backbone) \|
	\| Pooling \| Mean pooling \|
	\| Max sequence length \| 512 tokens \|
	\| Training data \| Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) \|
	\| Objective \| Contrastive binary (DPR-style) loss \|
	\| Task \| Encode cited papers into a shared semantic space with talk transcripts \|

	---

	## 🧠 Input Features

	\| Feature \| Description \|
	\|----------\|-------------\|
	\| Title \| Title of the cited paper \|
	\| Abstract \| Abstract text content \|
	\| Year \| Publication year \|

	These inputs are short enough to fit within the model’s 512-token limit — no chunking required.

	---

	## 🧮 Training Setup

	The cited-paper encoder was trained jointly with the query-talk encoder under a dual-encoder contrastive framework inspired by Dense Passage Retrieval (Karpukhin et al., 2020).

	Each talk $Ti$ and paper $Rj$ is encoded into embeddings $fT(Ti)$ and $fR(Rj)$.
	Their dot-product similarity $s_{ij} = f_T(T_i) \cdot f_R(R_j)$ is optimized using a sigmoid-based binary loss supporting multiple positives per query:

	$$
	L = - \sum_i [y_i \log \sigma(s_i) + (1 - y_i)\log(1 - \sigma(s_i))]
	$$

	Negatives are sampled in-batch from other talk–paper pairs.
	Before training, a domain adaptation stage aligned each talk with its own paper’s abstract to adapt to scientific and spoken-language data.

	---


	## Citation

	If you use this dataset, please cite the following paper:

	```bibtex
	@misc{broy2025talk2refdatasetreferenceprediction,
	title = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
	author = {Frederik Broy and Maike Züfle and Jan Niehues},
	year = {2025},
	eprint = {2510.24478},
	archivePrefix= {arXiv},
	primaryClass = {cs.CL},
	url = {https://arxiv.org/abs/2510.24478}
	}