s8frbroy
/

talk2ref_query_talk_encoder

Feature Extraction

scientific-retrieval

dense-passage-retrieval

sentence-embedding

Model card Files Files and versions

talk2ref_query_talk_encoder / README.md

s8frbroy's picture

Upload BaseEncoderHF

e7cf688 verified about 2 months ago

|

history blame contribute delete

2.41 kB

	---
	base_model: sentence-transformers/all-MiniLM-L6-v2
	datasets:
	- fbroy/talk2ref
	language: en
	library_name: transformers
	license: cc-by-4.0
	pipeline_tag: feature-extraction
	tags:
	- scientific-retrieval
	- dense-passage-retrieval
	- dual-encoder
	- talk2ref
	- speech-to-text
	- sentence-embedding
	- SBERT
	---

	# 🗣️ Talk2Ref Query Talk Encoder

	This model encodes scientific talks (transcripts, titles, and years) into dense vector representations, designed for Reference Prediction from Talks (RPT) — the task of retrieving relevant cited papers for a given talk.
	It was trained as part of the [Talk2Ref dataset](https://huggingface.co/datasets/s8frbroy/talk2ref) project.

	The model forms the query-side encoder in a dual-encoder (DPR-style) setup, paired with the [Talk2Ref Cited Paper Encoder](https://huggingface.co/s8frbroy/talk2ref_ref_key_cited_paper_encoder).

	---

	## 🎯 Usage

	Example with `transformers`:

	```python
	from transformers import AutoModel
	import torch

	# Load model
	model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder")

	# Example input
	title = "Attention Is All You Need"
	year = 2017
	query_text = f"The following presentation is about the paper of the title: '{title}'. Published in {year}. " + \
	"In this talk, we introduce the Transformer architecture and discuss its impact on sequence modeling."

	# Compute embedding
	with torch.no_grad():
	embedding = model([query_text])

	print(embedding.shape) # (1, hidden_dim)
	```

	---

	## 🧩 Model Overview

	\| Property \| Description \|
	\|-----------\|-------------\|
	\| Architecture \| Sentence-BERT (all-MiniLM-L6-v2 backbone) \|
	\| Pooling \| Mean pooling \|
	\| Max sequence length \| 512 tokens \|
	\| Training data \| Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) \|
	\| Objective \| Contrastive binary (DPR-style) loss \|
	\| Task \| Encode cited papers into a shared semantic space with talk transcripts \|

	---



	## Citation

	If you use this dataset, please cite the following paper:

	```bibtex
	@misc{broy2025talk2refdatasetreferenceprediction,
	title = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
	author = {Frederik Broy and Maike Züfle and Jan Niehues},
	year = {2025},
	eprint = {2510.24478},
	archivePrefix= {arXiv},
	primaryClass = {cs.CL},
	url = {https://arxiv.org/abs/2510.24478}
	}