File size: 2,412 Bytes

9e7a6cb
e7cf688
 
 
4608ee8
e7cf688
4608ee8
e7cf688
4608ee8
 
 
 
 
 
 
 
9e7a6cb
 
4608ee8
9e7a6cb
4608ee8
 
9e7a6cb
4608ee8
9e7a6cb
4608ee8
9e7a6cb
90fd6c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11b930b
90fd6c1
c5ec30f
 
4608ee8
9e7a6cb
f937764
 
 
 
 
 
 
 
9e7a6cb
4608ee8
9e7a6cb
 
bda8d80
90fd6c1
9e7a6cb
90fd6c1
9e7a6cb
90fd6c1

---
base_model: sentence-transformers/all-MiniLM-L6-v2
datasets:
- fbroy/talk2ref
language: en
library_name: transformers
license: cc-by-4.0
pipeline_tag: feature-extraction
tags:
- scientific-retrieval
- dense-passage-retrieval
- dual-encoder
- talk2ref
- speech-to-text
- sentence-embedding
- SBERT
---

# 🗣️ Talk2Ref Query Talk Encoder

This model encodes **scientific talks** (transcripts, titles, and years) into dense vector representations, designed for **Reference Prediction from Talks (RPT)** — the task of retrieving relevant cited papers for a given talk.  
It was trained as part of the [Talk2Ref dataset](https://huggingface.co/datasets/s8frbroy/talk2ref) project.

The model forms the **query-side encoder** in a **dual-encoder (DPR-style)** setup, paired with the [Talk2Ref Cited Paper Encoder](https://huggingface.co/s8frbroy/talk2ref_ref_key_cited_paper_encoder).

---

## 🎯 Usage

Example with `transformers`:

```python
from transformers import AutoModel
import torch

# Load model
model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder")

# Example input
title = "Attention Is All You Need"
year = 2017
query_text = f"The following presentation is about the paper of the title: '{title}'. Published in {year}. " + \
              "In this talk, we introduce the Transformer architecture and discuss its impact on sequence modeling."

# Compute embedding
with torch.no_grad():
    embedding = model([query_text])

print(embedding.shape)  # (1, hidden_dim)
```

---

## 🧩 Model Overview

| Property | Description |
|-----------|-------------|
| **Architecture** | Sentence-BERT (all-MiniLM-L6-v2 backbone) |
| **Pooling** | Mean pooling |
| **Max sequence length** | 512 tokens |
| **Training data** | Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) |
| **Objective** | Contrastive binary (DPR-style) loss |
| **Task** | Encode cited papers into a shared semantic space with talk transcripts |

---



## Citation

If you use this dataset, please cite the following paper:

```bibtex
@misc{broy2025talk2refdatasetreferenceprediction,
  title        = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
  author       = {Frederik Broy and Maike Züfle and Jan Niehues},
  year         = {2025},
  eprint       = {2510.24478},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2510.24478}
}