|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- medical |
|
|
- text-generation |
|
|
- language-model |
|
|
- biopan |
|
|
- jepa |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# SMB-v1-1.7B-Structure |
|
|
|
|
|
## Documentation & Quickstart |
|
|
|
|
|
For a comprehensive guide on getting started, architecture details, and advanced usage, please visit our official documentation: [**📖 SMB-v1 Quickstart Guide**](https://docs.standardmodel.bio/get-started/quickstart) |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Model Name:** SMB-v1-1.7B-Structure |
|
|
* **Organization:** Standard Model Biomedicine |
|
|
* **Model Family:** SMB-v1 (Biomedical Foundational Models) |
|
|
* **LLM Backbone:** Qwen3-1.7B |
|
|
* **Training Method:** SFT + JEPA Multi-objective |
|
|
* **License:** Apache 2.0 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**SMB-v1-1.7B-Structure** is the initial release of the SMB-v1 family, specifically engineered to model the complex, time-varying dynamics of cancer biology through structured clinical signals. It treats structured clinical data as a multimodal environment, fusing heterogeneous data streams into a unified patient state representation. |
|
|
|
|
|
Unlike general-purpose models, SMB-v1 is designed to ingest and synthesize diverse structured modalities across the patient journey, including: |
|
|
|
|
|
* **Temporal Physiological Signals:** Modeling continuous longitudinal trajectories of laboratory values, vital signs, and functional status markers to capture disease progression and physiological drift over time. |
|
|
* **Clinical Events & Phenotypes:** Encoding discrete, high-cardinality sequences of diagnosis codes (ICD), procedure events (CPT), and adverse events to reconstruct the semantic history of the patient's care. |
|
|
* **Therapeutic Interventions:** Integrating complex treatment histories—including systemic therapies (chemotherapy, immunotherapy), radiation dosing schedules, and surgical interventions—to understand causal treatment-response dynamics. |
|
|
* **Molecular & Genomic Profiles:** Embedding high-dimensional static and dynamic biomarker panels (somatic mutations, gene expression signatures, proteomic markers) directly alongside clinical phenotypes. |
|
|
* **Oncologic Staging & Outcomes:** Processing structured tumor staging (TNM), histology classifications, and survival endpoints to anchor representations in ground-truth biological states. |
|
|
|
|
|
> **Note:** While the full `SMB-v1` family will introduce unstructured modalities, this **-Structure** variant establishes the foundation using the highest-fidelity structured signals available in modern oncology data warehouses. |
|
|
|
|
|
## Intended Use Cases |
|
|
|
|
|
This model is optimized for downstream tasks requiring a deep understanding of longitudinal patient history: |
|
|
|
|
|
1. **Predictive Risk Stratification:** Forecasting adverse events, toxicity, or rapid progression based on historical trajectories. |
|
|
2. **Treatment Response Modeling:** Simulating potential patient outcomes under different therapeutic regimens. |
|
|
3. **Patient Similarity Search:** Identifying cohorts with similar biological and clinical progressions for real-world evidence generation. |
|
|
4. **Clinical Trial Matching:** Aligning complex patient states with structured eligibility criteria. |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
To use this model effectively, your input data must be in the [**MEDS**](https://medical-event-data-standard.github.io/docs/intro_pages/what_is_MEDS) (Medical Event Data Standard) format and processed using the `smb_biopan_utils` package. This ensures that patient event timelines are correctly serialized into the structured text format the model expects. |
|
|
|
|
|
### 1\. Installation |
|
|
|
|
|
Ensure you have the model package and the data utility package installed: |
|
|
|
|
|
```bash |
|
|
pip install transformers pandas |
|
|
pip install git+https://github.com/standardmodelbio/smb-biopan-utils.git |
|
|
``` |
|
|
|
|
|
### 2\. Inference Example |
|
|
|
|
|
The following example demonstrates how to load the model, process raw MEDS data using `process_ehr_info`, and generate a patient representation. |
|
|
|
|
|
```python |
|
|
import pandas as pd |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from smb_biopan_utils import process_ehr_info |
|
|
|
|
|
# 1. Load Model and Tokenizer |
|
|
model_id = "standardmodelbio/SMB-v1-1.7B-Structure" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
trust_remote_code=True, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# 2. Load Patient Data (MEDS Format) |
|
|
# Ensure your dataframe contains columns for 'time', 'code', 'table', etc. |
|
|
df_meds = pd.read_parquet("path/to/patient_data.parquet") |
|
|
|
|
|
# 3. Format Data for Inference |
|
|
# This utility converts the DataFrame into the structured text format |
|
|
# (e.g., <conditions>...</conditions>) expected by SMB-v1. |
|
|
input_text = process_ehr_info( |
|
|
df=df_meds, |
|
|
subject_id="patient_123", # Specify the subject to process |
|
|
end_time=pd.Timestamp("2024-01-01") # Prediction timepoint |
|
|
) |
|
|
|
|
|
# 4. Generate Representation |
|
|
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model( |
|
|
input_ids=inputs.input_ids, |
|
|
output_hidden_states=True, |
|
|
return_dict=True |
|
|
) |
|
|
|
|
|
# Extract the last hidden state as the patient representation |
|
|
patient_embedding = outputs.hidden_states[-1] |
|
|
print(f"Patient Representation Shape: {patient_embedding.shape}") |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or application, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{biopan_omni, |
|
|
author = {standardmodelbio}, |
|
|
title = {SMB-v1-1.7B-Structure}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure} |
|
|
} |
|
|
``` |