Anonymous NeurIPS 2026 submission. Companion code: see the linked anonymized GitHub repository. Preprocessing pipeline, training code, and evaluation cookbook will be released upon acceptance.

Model Description

smb-v1 models the complex time-varying dynamics of cancer biology through structured clinical signals. It treats structured clinical data as a multimodal environment, fusing heterogeneous data streams into a unified patient state representation.

The model was trained using a SFT + JEPA multi-objective approach.

Unlike general-purpose models, this checkpoint is designed to ingest and synthesize diverse structured modalities across the patient journey, including:

  • Temporal Physiological Signals: Modeling continuous longitudinal trajectories of laboratory values, vital signs, and functional status markers to capture disease progression and physiological drift over time.
  • Clinical Events & Phenotypes: Encoding discrete, high-cardinality sequences of diagnosis codes (ICD), procedure events (CPT), and adverse events to reconstruct the semantic history of the patient's care.
  • Therapeutic Interventions: Integrating complex treatment histories—including systemic therapies (chemotherapy, immunotherapy), radiation dosing schedules, and surgical interventions—to understand causal treatment-response dynamics.
  • Molecular & Genomic Profiles: Embedding high-dimensional static and dynamic biomarker panels (somatic mutations, gene expression signatures, proteomic markers) directly alongside clinical phenotypes.
  • Oncologic Staging & Outcomes: Processing structured tumor staging (TNM), histology classifications, and survival endpoints to anchor representations in ground-truth biological states.

Note: Future model versions will introduce unstructured modalities. This model establishes the foundation using the highest-fidelity structured signals available in modern oncology data warehouses.

Intended Use Cases

This model is optimized for downstream tasks requiring a deep understanding of longitudinal patient history:

  1. Predictive Risk Stratification: Forecasting adverse events, toxicity, or rapid progression based on historical trajectories.
  2. Treatment Response Modeling: Simulating potential patient outcomes under different therapeutic regimens.
  3. Patient Similarity Search: Identifying cohorts with similar biological and clinical progressions for real-world evidence generation.
  4. Clinical Trial Matching: Aligning complex patient states with structured eligibility criteria.

Usage

Inputs must be a pre-formatted patient timeline (the output of the preprocessing pipeline, which will be released upon acceptance). The companion code repository ships a synthetic example demonstrating the expected text format.

1. Installation

pip install transformers torch accelerate

2. Inference Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "<anonymous-hf-org>/smb-structure-qwen3-1.7b"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# A pre-formatted patient timeline (see the companion code repo for an example).
input_text = open("examples/synthetic_patient.txt").read()

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        output_hidden_states=True,
        return_dict=True,
    )

# Last-layer hidden state; pool externally (e.g. mean over seq_len) for a fixed-size embedding.
patient_embedding = outputs.hidden_states[-1]
print(f"Patient representation shape: {patient_embedding.shape}")

Training Data

Single-institution longitudinal cancer EHR data containing structured information derived from pathology reports, clinical reports, biomarkers, and staging. The model was not trained on raw biomedical data of these modalities. Full data documentation will accompany the camera-ready release.

Citation

@misc{smb_structure_anon_2026,
  title  = {The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR},
  author = {Anonymous Authors},
  year   = {2026},
  note   = {NeurIPS 2026 submission. Under review.}
}
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anon-9421/smb-structure-qwen3-1.7b-multi-objective

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(726)
this model