Instructions to use jiosephlee/starling-transfer-ssv2-srcval with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jiosephlee/starling-transfer-ssv2-srcval with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="jiosephlee/starling-transfer-ssv2-srcval", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("jiosephlee/starling-transfer-ssv2-srcval", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Starling oral-bioavailability transfer model
Given two molecules (SMILES + study metadata) and molecule A's measured oral bioavailability, this model predicts whether oral-bioavailability behavior transfers from A to B β i.e. whether the two molecules behave similarly under the given study context. It is self-contained: the frozen encoders are bundled with the trained head, so it runs end-to-end on raw inputs.
Architecture
Per molecule (siamese β the same encoders + projections are applied to A and B; only the head is position-aware):
- Molecule encoder β
ibm-research/MoLFormer-XL-both-10pct(MolFormer-XL), frozen: SMILES β mean-pooled token embedding β 768-d, then a 2-layer MLP (768β1024β768). - Metadata encoder β
sentence-transformers/all-MiniLM-L6-v2(MiniLM), frozen: each of the 7 metadata fields is embedded separately (mean-pooled, L2-normalized) β 384-d, then a learned per-field projection β 64-d (7Γ64 = 448-d total). A missing/empty field uses a learned per-field "missing" embedding instead of the text embedding, so absent metadata is handled gracefully and distinctly from any real value. - Per molecule =
[mol_mlp (768) | metadata (448)]= 1216-d.
Pair head:
- Concatenate
[z_A, z_B](2Γ1216) + molecule A's bioavailability scalar (value_A / 100) β 2433-d input. - A pre-norm residual SwiGLU MLP (32 blocks, width 1024, FFN 4096) β one logit.
sigmoid(logit)= P(transfer). ~407M trainable params; encoders frozen.
Metadata fields (order matters)
molecule_name, species_or_population, dose, oral_exposure_mode, qualifying_conditions, comparator, extra_details
Pass a dict per molecule keyed by these names. Omit a key, or pass None/"", for a missing field
β the model then uses its learned per-field "missing" embedding.
Usage
from transformers import AutoModel
m = AutoModel.from_pretrained("jiosephlee/starling-transfer-ssv2-srcval", trust_remote_code=True).eval()
out = m(
smiles_a=["CC(=O)Oc1ccccc1C(=O)O"], # molecule A (bioavailability known)
smiles_b=["CCO"], # molecule B (candidate)
metadata_a=[{"species_or_population": "human", "dose": "325 mg", "oral_exposure_mode": "tablet"}],
metadata_b=[{"species_or_population": "human"}], # missing fields are fine
source_value=[68.0], # molecule A's RAW oral_bioavailability_value (e.g. percent)
)
p_transfer = out.logits.sigmoid() # batched: pass parallel lists for many pairs
source_value is molecule A's raw oral_bioavailability_value; the model scales it internally by
100. Inputs are batched lists of equal length.
Training & performance
Trained on the same_species_v2 oral-bioavailability transfer split (~338M molecule pairs; the frozen
embeddings are precomputed once and the head is trained on top). The label is |value_A - value_B|
thresholded, so the model uses A's known value as an anchor and learns to estimate B's
bioavailability from its structure + metadata.
- same_species_v2 validation: AUROC ~0.87, accuracy ~0.83, macro-F1 ~0.79
- tianang (cross-dataset) validation: AUROC ~0.95, accuracy ~0.91, macro-F1 ~0.89 (test: AUROC ~0.95)
- Downloads last month
- 281