BioAssayAlign Qwen3-Embedding-0.6B Compatibility

BioAssayAlign logo

What this model is

BioAssayAlign is an assay-conditioned small-molecule ranking model.

It takes:

one assay definition
a submitted list of candidate SMILES

and returns:

one compatibility score per candidate
a ranked shortlist for that assay

This model is designed to answer a practical question:

Given this assay, which molecules in my current candidate list should I screen first?

It is not:

a chatbot
a generative chemistry model
a direct potency regressor
a calibrated probability model

Companion dataset

Public dataset:

BioAssayAlign Assay-Compound Data

The published model was trained on the prepared compatibility-ranking subset inside that dataset release.

Intended use

Use this model when you already have a candidate set and want a ranking signal for one assay at a time.

Reasonable uses:

shortlist triage before wet-lab screening
retrospective ranking experiments
assay-conditioned ranking features in a downstream workflow

Not reasonable uses:

reading the raw score as a probability of success
predicting exact IC50 / EC50 / Ki values
comparing raw scores across unrelated runs as if they were globally calibrated

How to run it locally

This repository is self-contained for inference. You do not need the original training codebase to run the published model.

Install

python -m pip install -r requirements.txt

Minimal local example

from bioassayalign_compatibility import (
    AssayQuery,
    load_compatibility_model_from_hub,
    rank_compounds,
    serialize_assay_query,
)

model = load_compatibility_model_from_hub(
    "lighteternal/BioAssayAlign-Qwen3-Embedding-0.6B-Compatibility"
)

assay_text = serialize_assay_query(
    AssayQuery(
        title="JAK2 inhibition assay",
        description="Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.",
        organism="Homo sapiens",
        readout="luminescence",
        assay_format="cell-based",
        assay_type="inhibition",
        target_uniprot=["O60674"],
    )
)

results = rank_compounds(
    model,
    assay_text=assay_text,
    smiles_list=[
        "CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1",
        "c1ccccc1",
        "CCO",
    ],
)

for row in results:
    print(row)

What to provide

Best practice:

provide structured assay fields rather than one free-form paragraph
include target, readout, organism, and format when known
submit one parent or cleaned SMILES per candidate

Recommended assay fields:

title
description
organism
readout
assay_format
assay_type
target_uniprot

The model is reasonably robust to wording changes, but missing metadata can reduce ranking quality.

Model details

Published artifact configuration:

Component	Value
Assay encoder	`Qwen/Qwen3-Embedding-0.6B`
Assay encoder training	Frozen
Assay metadata features	Enabled, `128` dims
Molecule features	Morgan fingerprints (`r=2,3`, `2048` bits each), chirality, `MACCS`, `30` RDKit descriptors
Projection dimension	`512`
Hidden dimension	`1024`
Dropout	`0.12`
Final score	Learned compatibility head output

Important:

the published score is not a raw embedding dot product
the ranking comes from the learned scorer head

Training data

The public artifact was trained on a frozen assay-compound corpus derived from:

PubChem BioAssay
ChEMBL

The published model uses the prepared compatibility-ranking subset from:

lighteternal/BioAssayAlign-Assay-Compound-Data

Prepared training dataset

Field	Value
Assays	`11,195`
Candidate-pool rows	`1,432,532`
Training groups	`508,216`
Train assays	`8,967`
Validation assays	`1,117`
Test assays	`1,111`

Preparation rules

Rule	Value
Minimum actives per assay	`4`
Minimum inactives per assay	`16`
Maximum actives per assay	`48`
Maximum inactives per assay	`192`
Molecule standardization	Enabled
Source manifest SHA256	`e4766477b64860952258cb4b76567b83061d5de44bb5f3b322ecdfe54f19910b`

Each training group contains:

one assay
one positive compound
multiple explicit same-assay inactive compounds

This is a ranking setup, not a generic text-retrieval setup.

Training configuration

Field	Value
Framework	`pytorch_head_only_compatibility_ranking`
Learning rate	`1.5e-3`
Batch size	`192`
Weight decay	`1e-4`
Hard-negative fraction	`0.5`
Negatives per example	`15`
Negative sets per positive	`2`
Max epochs	`30`
Early stopping patience	`5`
Early stopping min delta	`0.001`
Best epoch	`9`

Results

Main evaluation

Split	Mean AUPRC	Random-baseline AUPRC	Hit@10	Mean AUROC	Mean nDCG@50
Validation	`0.6214`	`0.2678`	`0.9722`	`0.7767`	`0.7140`
Test	`0.6339`	`0.2749`	`0.9739`	`0.7815`	`0.7250`

Interpretation:

the model materially beats the random ranking baseline
it is strongest as a within-list ranking tool
the main output to trust is the ranking order and shortlist separation, not the raw score magnitude

Score interpretation

The raw output is a learned compatibility logit-like score.

What it means:

higher is better
differences are meaningful within the same submitted list
absolute values are not calibrated across unrelated runs

Example:

candidate A score: 6.25
candidate B score: -8.65
candidate C score: -23.37

This does not mean A has a literal probability or potency attached to it. It means A ranked substantially above B and C for that submitted assay and candidate set.

For user-facing interpretation, the recommended order is:

rank
relative shortlist score within the submitted list
chemistry context columns
raw model score only for debugging or export

If you want a normalized within-list view, you can compute:

min-max scaling to 0–100
or softmax over the submitted list

Those are still not calibrated biological probabilities.

Example predictions

These examples were produced from the published weights.

Example: JAK2 cell assay

Assay:

title: JAK2 inhibition assay
description: Cell-based luminescence assay measuring JAK2 inhibition in HEK293 cells.
organism: Homo sapiens
readout: luminescence
assay format: cell-based
assay type: inhibition
target UniProt: O60674

Rank	Candidate SMILES	Raw score
1	`CC(=O)Nc1ncc(C#N)c(Nc2ccc(F)c(Cl)c2)n1`	`6.2590`
2	`Cc1cc(=O)n(C)c(=O)[nH]1`	`-8.6542`
3	`CCO`	`-12.8678`
4	`CCOc1ccc2nc(N3CCN(C)CC3)n(C)c(=O)c2c1`	`-23.3741`

Example: ALDH1A1 fluorescence assay