Dataset Information Extraction from Scientific Text

Fine-tuned flan-t5-base to extract dataset identifiers and repository references from snippets of scientific publications.

What it does: Finds dataset IDs (GSE123456, DOIs, etc.) and repository names (GEO, Zenodo, etc.) in text snippets from papers.

Trained on: vida-nyu/pmc-articles-dataset-mentions-snippets - PMC article snippets extracted using Data Gatherer

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("vida-nyu/flan-t5-base-dataref-info-extract")
tokenizer = AutoTokenizer.from_pretrained("vida-nyu/flan-t5-base-dataref-info-extract")

text = "Extract dataset information: The data are in GEO under GSE123456."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=256, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Base: flan-t5-base (250M params)
  • Data: PMC snippets with/without dataset mentions. Used 80/20 train-test split
  • Config: 5 epochs, lr=3e-4, batch=16 (effective)
  • Input: Max 512 tokens with prefix "Extract dataset information: "
  • Output: Structured dataset info or null for no-dataset cases

Limitations

  • Max 512 tokens input (truncates long passages)
  • Trained on biomedical literature (may underperform on other domains)
  • English only
  • May miss novel/obscure repositories

Citation

@software{flan_t5_dataset_extraction,
  title={Flan-T5 for Dataset Information Extraction},
  author={VIDA Lab, NYU},
  year={2025},
  url={https://huggingface.co/vida-nyu/flan-t5-base-dataref-info-extract}
}

Contact: GitHub Issues | VIDA Lab, NYU

Downloads last month
17
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vida-nyu/flan-t5-base-dataref-info-extract

Finetuned
(889)
this model

Dataset used to train vida-nyu/flan-t5-base-dataref-info-extract