s8frbroy's picture
Update README.md
4608ee8 verified
|
raw
history blame
1.95 kB
metadata
language: en
license: cc-by-4.0
tags:
  - scientific-retrieval
  - dense-passage-retrieval
  - dual-encoder
  - talk2ref
  - speech-to-text
  - sentence-embedding
  - SBERT
library_name: transformers
pipeline_tag: feature-extraction
base_model: sentence-transformers/all-MiniLM-L6-v2
datasets:
  - fbroy/talk2ref

🗣️ Talk2Ref Query Talk Encoder

This model encodes scientific talks (transcripts, titles, and years) into dense vector representations, designed for Reference Prediction from Talks (RPT) — the task of retrieving relevant cited papers for a given talk.
It was trained as part of the Talk2Ref dataset project.

The model forms the query-side encoder in a dual-encoder (DPR-style) setup, paired with the Talk2Ref Cited Paper Encoder.


🧩 Model Overview

Property Description
Architecture Sentence-BERT (all-MiniLM-L6-v2 backbone)
Pooling Weighted mean aggregation over transcript chunks
Max tokens per chunk 512
Trained on Talk2Ref dataset — transcripts of 6,279 scientific talks
Objective Contrastive learning (DPR-style) using binary similarity loss
Task Encode scientific talks into a shared semantic space with their cited papers

🎯 Usage

Example with transformers:

from transformers import AutoTokenizer, AutoModel
import torch

model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

text = "In this talk, we present a new transformer model for scientific retrieval..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
embeddings = model(**inputs).last_hidden_state.mean(dim=1)  # or custom weighted pooling