Dataset Information Extraction from Scientific Text
Fine-tuned flan-t5-base to extract dataset identifiers and repository references from snippets of scientific publications.
What it does: Finds dataset IDs (GSE123456, DOIs, etc.) and repository names (GEO, Zenodo, etc.) in text snippets from papers.
Trained on: vida-nyu/pmc-articles-dataset-mentions-snippets - PMC article snippets extracted using Data Gatherer
Quick Start
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("vida-nyu/flan-t5-base-dataref-info-extract")
tokenizer = AutoTokenizer.from_pretrained("vida-nyu/flan-t5-base-dataref-info-extract")
text = "Extract dataset information: The data are in GEO under GSE123456."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=256, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
- Base: flan-t5-base (250M params)
- Data: PMC snippets with/without dataset mentions. Used 80/20 train-test split
- Config: 5 epochs, lr=3e-4, batch=16 (effective)
- Input: Max 512 tokens with prefix "Extract dataset information: "
- Output: Structured dataset info or null for no-dataset cases
Limitations
- Max 512 tokens input (truncates long passages)
- Trained on biomedical literature (may underperform on other domains)
- English only
- May miss novel/obscure repositories
Citation
@software{flan_t5_dataset_extraction,
title={Flan-T5 for Dataset Information Extraction},
author={VIDA Lab, NYU},
year={2025},
url={https://huggingface.co/vida-nyu/flan-t5-base-dataref-info-extract}
}
Contact: GitHub Issues | VIDA Lab, NYU
- Downloads last month
- 17
Model tree for vida-nyu/flan-t5-base-dataref-info-extract
Base model
google/flan-t5-base