irsyad-smb's picture
Update README.md
c473c6c verified
---
license: cc-by-nc-4.0
language:
- en
tags:
- medical
- text-generation
- language-model
- biopan
- jepa
library_name: transformers
pipeline_tag: text-generation
---
# SMB-v1-1.7B-Structure
## Documentation & Quickstart
For a comprehensive guide on getting started, architecture details, and advanced usage, please visit our official documentation: [**📖 SMB-v1 Quickstart Guide**](https://docs.standardmodel.bio/get-started/quickstart)
## Model Details
* **Model Name:** SMB-v1-1.7B-Structure
* **Organization:** Standard Model Biomedicine
* **Model Family:** SMB-v1 (Biomedical Foundational Models)
* **LLM Backbone:** Qwen3-1.7B
* **Training Method:** SFT + JEPA Multi-objective
* **License:** Apache 2.0
## Model Description
**SMB-v1-1.7B-Structure** is the initial release of the SMB-v1 family, specifically engineered to model the complex, time-varying dynamics of cancer biology through structured clinical signals. It treats structured clinical data as a multimodal environment, fusing heterogeneous data streams into a unified patient state representation.
Unlike general-purpose models, SMB-v1 is designed to ingest and synthesize diverse structured modalities across the patient journey, including:
* **Temporal Physiological Signals:** Modeling continuous longitudinal trajectories of laboratory values, vital signs, and functional status markers to capture disease progression and physiological drift over time.
* **Clinical Events & Phenotypes:** Encoding discrete, high-cardinality sequences of diagnosis codes (ICD), procedure events (CPT), and adverse events to reconstruct the semantic history of the patient's care.
* **Therapeutic Interventions:** Integrating complex treatment histories—including systemic therapies (chemotherapy, immunotherapy), radiation dosing schedules, and surgical interventions—to understand causal treatment-response dynamics.
* **Molecular & Genomic Profiles:** Embedding high-dimensional static and dynamic biomarker panels (somatic mutations, gene expression signatures, proteomic markers) directly alongside clinical phenotypes.
* **Oncologic Staging & Outcomes:** Processing structured tumor staging (TNM), histology classifications, and survival endpoints to anchor representations in ground-truth biological states.
> **Note:** While the full `SMB-v1` family will introduce unstructured modalities, this **-Structure** variant establishes the foundation using the highest-fidelity structured signals available in modern oncology data warehouses.
## Intended Use Cases
This model is optimized for downstream tasks requiring a deep understanding of longitudinal patient history:
1. **Predictive Risk Stratification:** Forecasting adverse events, toxicity, or rapid progression based on historical trajectories.
2. **Treatment Response Modeling:** Simulating potential patient outcomes under different therapeutic regimens.
3. **Patient Similarity Search:** Identifying cohorts with similar biological and clinical progressions for real-world evidence generation.
4. **Clinical Trial Matching:** Aligning complex patient states with structured eligibility criteria.
## Usage
To use this model effectively, your input data must be in the [**MEDS**](https://medical-event-data-standard.github.io/docs/intro_pages/what_is_MEDS) (Medical Event Data Standard) format and processed using the `smb_biopan_utils` package. This ensures that patient event timelines are correctly serialized into the structured text format the model expects.
### 1\. Installation
Ensure you have the model package and the data utility package installed:
```bash
pip install transformers pandas
pip install git+https://github.com/standardmodelbio/smb-biopan-utils.git
```
### 2\. Inference Example
The following example demonstrates how to load the model, process raw MEDS data using `process_ehr_info`, and generate a patient representation.
```python
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
from smb_biopan_utils import process_ehr_info
# 1. Load Model and Tokenizer
model_id = "standardmodelbio/SMB-v1-1.7B-Structure"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
device_map="auto"
)
# 2. Load Patient Data (MEDS Format)
# Ensure your dataframe contains columns for 'time', 'code', 'table', etc.
df_meds = pd.read_parquet("path/to/patient_data.parquet")
# 3. Format Data for Inference
# This utility converts the DataFrame into the structured text format
# (e.g., <conditions>...</conditions>) expected by SMB-v1.
input_text = process_ehr_info(
df=df_meds,
subject_id="patient_123", # Specify the subject to process
end_time=pd.Timestamp("2024-01-01") # Prediction timepoint
)
# 4. Generate Representation
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model(
input_ids=inputs.input_ids,
output_hidden_states=True,
return_dict=True
)
# Extract the last hidden state as the patient representation
patient_embedding = outputs.hidden_states[-1]
print(f"Patient Representation Shape: {patient_embedding.shape}")
```
## Citation
If you use this model in your research or application, please cite:
```bibtex
@misc{biopan_omni,
author = {standardmodelbio},
title = {SMB-v1-1.7B-Structure},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure}
}
```