File size: 5,498 Bytes
e2a004e
2049ea1
e2a004e
 
 
 
 
 
 
 
 
 
 
 
f740f14
e2a004e
0d29f67
 
c473c6c
0d29f67
56616f9
422c2f4
56616f9
 
 
 
 
 
422c2f4
56616f9
422c2f4
56616f9
422c2f4
56616f9
422c2f4
56616f9
 
 
 
 
422c2f4
56616f9
422c2f4
 
 
56616f9
422c2f4
56616f9
 
 
 
422c2f4
e2a004e
56616f9
 
2491f58
e2a004e
56616f9
e2a004e
56616f9
e2a004e
56616f9
 
 
 
9e363db
56616f9
e2a004e
56616f9
9e363db
 
56616f9
e2a004e
56616f9
e2a004e
56616f9
 
 
e2a004e
56616f9
e2a004e
 
 
 
56616f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e363db
e2a004e
56616f9
e2a004e
 
 
56616f9
 
 
 
e2a004e
 
 
 
56616f9
9e363db
e2a004e
 
 
792d996
e2a004e
 
56616f9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
license: cc-by-nc-4.0
language:
- en
tags:
- medical
- text-generation
- language-model
- biopan
- jepa
library_name: transformers
pipeline_tag: text-generation
---

# SMB-v1-1.7B-Structure

## Documentation & Quickstart

For a comprehensive guide on getting started, architecture details, and advanced usage, please visit our official documentation: [**📖 SMB-v1 Quickstart Guide**](https://docs.standardmodel.bio/get-started/quickstart)

## Model Details

  * **Model Name:** SMB-v1-1.7B-Structure
  * **Organization:** Standard Model Biomedicine
  * **Model Family:** SMB-v1 (Biomedical Foundational Models)
  * **LLM Backbone:** Qwen3-1.7B
  * **Training Method:** SFT + JEPA Multi-objective
  * **License:** Apache 2.0

## Model Description

**SMB-v1-1.7B-Structure** is the initial release of the SMB-v1 family, specifically engineered to model the complex, time-varying dynamics of cancer biology through structured clinical signals. It treats structured clinical data as a multimodal environment, fusing heterogeneous data streams into a unified patient state representation.

Unlike general-purpose models, SMB-v1 is designed to ingest and synthesize diverse structured modalities across the patient journey, including:

  * **Temporal Physiological Signals:** Modeling continuous longitudinal trajectories of laboratory values, vital signs, and functional status markers to capture disease progression and physiological drift over time.
  * **Clinical Events & Phenotypes:** Encoding discrete, high-cardinality sequences of diagnosis codes (ICD), procedure events (CPT), and adverse events to reconstruct the semantic history of the patient's care.
  * **Therapeutic Interventions:** Integrating complex treatment histories—including systemic therapies (chemotherapy, immunotherapy), radiation dosing schedules, and surgical interventions—to understand causal treatment-response dynamics.
  * **Molecular & Genomic Profiles:** Embedding high-dimensional static and dynamic biomarker panels (somatic mutations, gene expression signatures, proteomic markers) directly alongside clinical phenotypes.
  * **Oncologic Staging & Outcomes:** Processing structured tumor staging (TNM), histology classifications, and survival endpoints to anchor representations in ground-truth biological states.

> **Note:** While the full `SMB-v1` family will introduce unstructured modalities, this **-Structure** variant establishes the foundation using the highest-fidelity structured signals available in modern oncology data warehouses.

## Intended Use Cases

This model is optimized for downstream tasks requiring a deep understanding of longitudinal patient history:

1.  **Predictive Risk Stratification:** Forecasting adverse events, toxicity, or rapid progression based on historical trajectories.
2.  **Treatment Response Modeling:** Simulating potential patient outcomes under different therapeutic regimens.
3.  **Patient Similarity Search:** Identifying cohorts with similar biological and clinical progressions for real-world evidence generation.
4.  **Clinical Trial Matching:** Aligning complex patient states with structured eligibility criteria.


## Usage

To use this model effectively, your input data must be in the [**MEDS**](https://medical-event-data-standard.github.io/docs/intro_pages/what_is_MEDS) (Medical Event Data Standard) format and processed using the `smb_biopan_utils` package. This ensures that patient event timelines are correctly serialized into the structured text format the model expects.

### 1\. Installation

Ensure you have the model package and the data utility package installed:

```bash
pip install transformers pandas
pip install git+https://github.com/standardmodelbio/smb-biopan-utils.git
```

### 2\. Inference Example

The following example demonstrates how to load the model, process raw MEDS data using `process_ehr_info`, and generate a patient representation.

```python
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
from smb_biopan_utils import process_ehr_info

# 1. Load Model and Tokenizer
model_id = "standardmodelbio/SMB-v1-1.7B-Structure"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto"
)

# 2. Load Patient Data (MEDS Format)
# Ensure your dataframe contains columns for 'time', 'code', 'table', etc.
df_meds = pd.read_parquet("path/to/patient_data.parquet")

# 3. Format Data for Inference
# This utility converts the DataFrame into the structured text format 
# (e.g., <conditions>...</conditions>) expected by SMB-v1.
input_text = process_ehr_info(
    df=df_meds,
    subject_id="patient_123",  # Specify the subject to process
    end_time=pd.Timestamp("2024-01-01") # Prediction timepoint
)

# 4. Generate Representation
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model(
    input_ids=inputs.input_ids,
    output_hidden_states=True,
    return_dict=True
)

# Extract the last hidden state as the patient representation
patient_embedding = outputs.hidden_states[-1]
print(f"Patient Representation Shape: {patient_embedding.shape}")
```

## Citation

If you use this model in your research or application, please cite:

```bibtex
@misc{biopan_omni,
  author = {standardmodelbio},
  title = {SMB-v1-1.7B-Structure},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure}
}
```