Dataset Information Extraction from Scientific Text

Fine-tuned flan-t5-base to extract dataset identifiers and repository references from snippets of scientific publications.

What it does: Finds dataset IDs (GSE123456, DOIs, etc.) and repository names (GEO, Zenodo, etc.) in text snippets from papers.

Trained on: vida-nyu/pmc-articles-dataset-mentions-snippets - PMC article snippets extracted using Data Gatherer

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("vida-nyu/flan-t5-base-dataref-info-extract")
tokenizer = AutoTokenizer.from_pretrained("vida-nyu/flan-t5-base-dataref-info-extract")

text = "Extract dataset information: The data are in GEO under GSE123456."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=256, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Base: flan-t5-base (250M params)
Data: PMC snippets with/without dataset mentions. Used 80/20 train-test split
Config: 5 epochs, lr=3e-4, batch=16 (effective)
Input: Max 512 tokens with prefix "Extract dataset information: "
Output: Structured dataset info or null for no-dataset cases

Limitations

Max 512 tokens input (truncates long passages)
Trained on biomedical literature (may underperform on other domains)
English only
May miss novel/obscure repositories

Citation

@software{flan_t5_dataset_extraction,
  title={Flan-T5 for Dataset Information Extraction},
  author={VIDA Lab, NYU},
  year={2025},
  url={https://huggingface.co/vida-nyu/flan-t5-base-dataref-info-extract}
}

Contact: GitHub Issues | VIDA Lab, NYU

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for vida-nyu/flan-t5-base-dataref-info-extract

Base model

google/flan-t5-base

Finetuned

(902)

this model

vida-nyu
/

flan-t5-base-dataref-info-extract