Update README.md

c473c6c verified 24 days ago

5.5 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	tags:
	- medical
	- text-generation
	- language-model
	- biopan
	- jepa
	library_name: transformers
	pipeline_tag: text-generation
	---

	# SMB-v1-1.7B-Structure

	## Documentation & Quickstart

	For a comprehensive guide on getting started, architecture details, and advanced usage, please visit our official documentation: [📖 SMB-v1 Quickstart Guide](https://docs.standardmodel.bio/get-started/quickstart)

	## Model Details

	* Model Name: SMB-v1-1.7B-Structure
	* Organization: Standard Model Biomedicine
	* Model Family: SMB-v1 (Biomedical Foundational Models)
	* LLM Backbone: Qwen3-1.7B
	* Training Method: SFT + JEPA Multi-objective
	* License: Apache 2.0

	## Model Description

	SMB-v1-1.7B-Structure is the initial release of the SMB-v1 family, specifically engineered to model the complex, time-varying dynamics of cancer biology through structured clinical signals. It treats structured clinical data as a multimodal environment, fusing heterogeneous data streams into a unified patient state representation.

	Unlike general-purpose models, SMB-v1 is designed to ingest and synthesize diverse structured modalities across the patient journey, including:

	* Temporal Physiological Signals: Modeling continuous longitudinal trajectories of laboratory values, vital signs, and functional status markers to capture disease progression and physiological drift over time.
	* Clinical Events & Phenotypes: Encoding discrete, high-cardinality sequences of diagnosis codes (ICD), procedure events (CPT), and adverse events to reconstruct the semantic history of the patient's care.
	* Therapeutic Interventions: Integrating complex treatment histories—including systemic therapies (chemotherapy, immunotherapy), radiation dosing schedules, and surgical interventions—to understand causal treatment-response dynamics.
	* Molecular & Genomic Profiles: Embedding high-dimensional static and dynamic biomarker panels (somatic mutations, gene expression signatures, proteomic markers) directly alongside clinical phenotypes.
	* Oncologic Staging & Outcomes: Processing structured tumor staging (TNM), histology classifications, and survival endpoints to anchor representations in ground-truth biological states.

	> Note: While the full `SMB-v1` family will introduce unstructured modalities, this -Structure variant establishes the foundation using the highest-fidelity structured signals available in modern oncology data warehouses.

	## Intended Use Cases

	This model is optimized for downstream tasks requiring a deep understanding of longitudinal patient history:

	1. Predictive Risk Stratification: Forecasting adverse events, toxicity, or rapid progression based on historical trajectories.
	2. Treatment Response Modeling: Simulating potential patient outcomes under different therapeutic regimens.
	3. Patient Similarity Search: Identifying cohorts with similar biological and clinical progressions for real-world evidence generation.
	4. Clinical Trial Matching: Aligning complex patient states with structured eligibility criteria.


	## Usage

	To use this model effectively, your input data must be in the [MEDS](https://medical-event-data-standard.github.io/docs/intro_pages/what_is_MEDS) (Medical Event Data Standard) format and processed using the `smb_biopan_utils` package. This ensures that patient event timelines are correctly serialized into the structured text format the model expects.

	### 1\. Installation

	Ensure you have the model package and the data utility package installed:

	```bash
	pip install transformers pandas
	pip install git+https://github.com/standardmodelbio/smb-biopan-utils.git
	```

	### 2\. Inference Example

	The following example demonstrates how to load the model, process raw MEDS data using `process_ehr_info`, and generate a patient representation.

	```python
	import pandas as pd
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from smb_biopan_utils import process_ehr_info

	# 1. Load Model and Tokenizer
	model_id = "standardmodelbio/SMB-v1-1.7B-Structure"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	trust_remote_code=True,
	device_map="auto"
	)

	# 2. Load Patient Data (MEDS Format)
	# Ensure your dataframe contains columns for 'time', 'code', 'table', etc.
	df_meds = pd.read_parquet("path/to/patient_data.parquet")

	# 3. Format Data for Inference
	# This utility converts the DataFrame into the structured text format
	# (e.g., <conditions>...</conditions>) expected by SMB-v1.
	input_text = process_ehr_info(
	df=df_meds,
	subject_id="patient_123", # Specify the subject to process
	end_time=pd.Timestamp("2024-01-01") # Prediction timepoint
	)

	# 4. Generate Representation
	inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

	outputs = model(
	input_ids=inputs.input_ids,
	output_hidden_states=True,
	return_dict=True
	)

	# Extract the last hidden state as the patient representation
	patient_embedding = outputs.hidden_states[-1]
	print(f"Patient Representation Shape: {patient_embedding.shape}")
	```

	## Citation

	If you use this model in your research or application, please cite:

	```bibtex
	@misc{biopan_omni,
	author = {standardmodelbio},
	title = {SMB-v1-1.7B-Structure},
	year = {2025},
	publisher = {HuggingFace},
	url = {https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure}
	}
	```