File size: 2,412 Bytes
9e7a6cb e7cf688 4608ee8 e7cf688 4608ee8 e7cf688 4608ee8 9e7a6cb 4608ee8 9e7a6cb 4608ee8 9e7a6cb 4608ee8 9e7a6cb 4608ee8 9e7a6cb 90fd6c1 11b930b 90fd6c1 c5ec30f 4608ee8 9e7a6cb f937764 9e7a6cb 4608ee8 9e7a6cb bda8d80 90fd6c1 9e7a6cb 90fd6c1 9e7a6cb 90fd6c1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
base_model: sentence-transformers/all-MiniLM-L6-v2
datasets:
- fbroy/talk2ref
language: en
library_name: transformers
license: cc-by-4.0
pipeline_tag: feature-extraction
tags:
- scientific-retrieval
- dense-passage-retrieval
- dual-encoder
- talk2ref
- speech-to-text
- sentence-embedding
- SBERT
---
# 🗣️ Talk2Ref Query Talk Encoder
This model encodes **scientific talks** (transcripts, titles, and years) into dense vector representations, designed for **Reference Prediction from Talks (RPT)** — the task of retrieving relevant cited papers for a given talk.
It was trained as part of the [Talk2Ref dataset](https://huggingface.co/datasets/s8frbroy/talk2ref) project.
The model forms the **query-side encoder** in a **dual-encoder (DPR-style)** setup, paired with the [Talk2Ref Cited Paper Encoder](https://huggingface.co/s8frbroy/talk2ref_ref_key_cited_paper_encoder).
---
## 🎯 Usage
Example with `transformers`:
```python
from transformers import AutoModel
import torch
# Load model
model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder")
# Example input
title = "Attention Is All You Need"
year = 2017
query_text = f"The following presentation is about the paper of the title: '{title}'. Published in {year}. " + \
"In this talk, we introduce the Transformer architecture and discuss its impact on sequence modeling."
# Compute embedding
with torch.no_grad():
embedding = model([query_text])
print(embedding.shape) # (1, hidden_dim)
```
---
## 🧩 Model Overview
| Property | Description |
|-----------|-------------|
| **Architecture** | Sentence-BERT (all-MiniLM-L6-v2 backbone) |
| **Pooling** | Mean pooling |
| **Max sequence length** | 512 tokens |
| **Training data** | Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) |
| **Objective** | Contrastive binary (DPR-style) loss |
| **Task** | Encode cited papers into a shared semantic space with talk transcripts |
---
## Citation
If you use this dataset, please cite the following paper:
```bibtex
@misc{broy2025talk2refdatasetreferenceprediction,
title = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
author = {Frederik Broy and Maike Züfle and Jan Niehues},
year = {2025},
eprint = {2510.24478},
archivePrefix= {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2510.24478}
}
|