metadata
language: en
license: cc-by-4.0
tags:
- scientific-retrieval
- dense-passage-retrieval
- dual-encoder
- talk2ref
- speech-to-text
- sentence-embedding
- SBERT
library_name: transformers
pipeline_tag: feature-extraction
base_model: sentence-transformers/all-MiniLM-L6-v2
datasets:
- fbroy/talk2ref
🗣️ Talk2Ref Query Talk Encoder
This model encodes scientific talks (transcripts, titles, and years) into dense vector representations, designed for Reference Prediction from Talks (RPT) — the task of retrieving relevant cited papers for a given talk.
It was trained as part of the Talk2Ref dataset project.
The model forms the query-side encoder in a dual-encoder (DPR-style) setup, paired with the Talk2Ref Cited Paper Encoder.
🧩 Model Overview
| Property | Description |
|---|---|
| Architecture | Sentence-BERT (all-MiniLM-L6-v2 backbone) |
| Pooling | Weighted mean aggregation over transcript chunks |
| Max tokens per chunk | 512 |
| Trained on | Talk2Ref dataset — transcripts of 6,279 scientific talks |
| Objective | Contrastive learning (DPR-style) using binary similarity loss |
| Task | Encode scientific talks into a shared semantic space with their cited papers |
🎯 Usage
Example with transformers:
from transformers import AutoTokenizer, AutoModel
import torch
model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
text = "In this talk, we present a new transformer model for scientific retrieval..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
embeddings = model(**inputs).last_hidden_state.mean(dim=1) # or custom weighted pooling