Claim Information Extractor for Brain Tumor Research

Model: Flan-T5-base (fine-tuned for structured extraction)

Task: Extract structured claim information from research paper sentences.

Overview

This model extracts structured information from claim sentences in brain tumor research papers. It identifies key fields: model, task, dataset, metric, value, comparison, and domain.

Performance

  • Validation Overall Exact Match: 0.4649
  • Test Overall Exact Match: 0.4538
  • Validation Macro Presence F1: 0.4944
  • Test Macro Presence F1: 0.5408

Field-level Performance

Field Exact Match Non-Null Accuracy Null Accuracy
model 0.8072 0.1441 0.9974
task 0.6526 0.1707 0.9898
dataset 0.8173 0.0217 0.9975
metric 0.6807 0.2935 0.9428
value 0.8594 0.3333 0.9851
comparison 0.8273 0.1961 0.9899
domain 0.6205 0.3245 0.9571

Dataset

  • Training: ~3,200 claim sentences with ground-truth annotations
  • Validation: ~400 claim sentences
  • Test: ~400 claim sentences
  • Total: ~3,997 claim sentences from 1,496 research papers

Slot Fields

The model extracts the following structured fields:

  • model: Name of the machine learning or deep learning model
  • task: Type of task (e.g., classification, segmentation)
  • dataset: Dataset or benchmark used
  • metric: Evaluation metric (e.g., accuracy, Dice score)
  • value: Numeric value or performance result
  • comparison: Comparative statements (e.g., "outperforms ResNet")
  • domain: Application domain (e.g., medical imaging)

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import json

model_name = "nawazishpatana/claim-extractor-brain-tumor"  # Example
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example input
prompt = '''Extract structured claim information as JSON with keys model, task, dataset, metric, value, comparison, domain. Use null for missing values.
Title: Dilated SE-DenseNet for Brain Tumor Segmentation
Year: 2024
Claim Sentence: Our model achieved 95% Dice score on BraTS dataset.'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(prediction)
# Output: {"model": "Dilated SE-DenseNet", "dataset": "BraTS", "metric": "Dice", "value": "0.95", ...}

Training Details

  • Base Model: google/flan-t5-base
  • Optimizer: AdamW
  • Learning Rate: 0.0001
  • Batch Size: 16
  • Epochs: 10 (with early stopping)
  • Max Input Length: 256
  • Max Output Length: 128

Evaluation Metrics

  • Exact Match (EM): Percentage of predictions that exactly match ground truth
  • Non-Null Accuracy: Accuracy on fields where ground truth is not null
  • Null Accuracy: Accuracy in predicting null when ground truth is null

Post-Processing

The model includes automatic post-processing:

  1. Fuzzy matching for model names
  2. Dataset name normalization
  3. Metric standardization
  4. Numeric value parsing

Citation

If you use this model, please cite:

@misc{claim-extractor-brain-tumor,
  title={Claim Information Extractor for Brain Tumor Research},
  author={Your Name},
  year={2025},
  howpublished={\url{https://huggingface.co/username/claim-extractor-brain-tumor}}
}
Downloads last month
334
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support