InventMol-R1

Target-Conditioned Molecular Ideation Model for Drug Discovery Research

Research prototype. Not for clinical use. No experimental validation.

Model Description

InventMol-R1 is a fine-tuned version of Qwen2.5-0.5B that generates novel drug-like molecules conditioned on biological context. Given a protein target, disease, mutation, and mechanism of action, the model outputs molecular structures in SELFIES format.

This model demonstrates the concept of reasoning-guided molecular ideation aligned with modern AI-driven drug discovery pipelines.

Intended Use

Research prototype for computational drug discovery
Molecular ideation and scaffold hopping
Educational demonstration of LLMs in cheminformatics
Target-conditioned molecular generation

Training Data

Trained on tyrosine kinase inhibitors with bioactivity data from ChEMBL, filtered for drug-likeness and converted to SELFIES representation. The dataset includes:

7+ protein targets (EGFR, BRAF, ALK, KIT, VEGFR, BTK, FGFR, MET, RET)
Multiple disease contexts (NSCLC, melanoma, GIST, AML, etc.)
Clinically relevant mutations (T790M, V600E, D816V, etc.)
Mechanism of action annotations
Potency labels (active, intermediate, inactive)

Quick Start

from unsloth import FastLanguageModel
from selfies import decoder
from rdkit import Chem
from rdkit.Chem import Descriptors
import re

model, tokenizer = FastLanguageModel.from_pretrained("Hamdan003/InventMol-R1")

def extract_selfies(text):
    matches = re.findall(r'\[[^\]]*\]', text)
    if len(matches) >= 5:
        first = text.find(matches[0])
        count = 0
        for i in range(first, len(text)):
            if text[i] == '[': count += 1
            elif text[i] == ']':
                count -= 1
                if count == 0: return text[first:i+1]
    return ""

def generate_molecule(target, disease, mutation, mechanism):
    prompt = f"[Target]: {target}\n[Disease]: {disease}\n[Mutation]: {mutation}\n[Mechanism]: {mechanism}\n[Potency]: High\n"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7, do_sample=True, top_p=0.95)
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    selfies_str = extract_selfies(generated)
    if selfies_str:
        smiles = decoder(selfies_str)
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            return smiles, Descriptors.MolWt(mol), Descriptors.MolLogP(mol)
    return None, 0, 0

smiles, mw, logp = generate_molecule("EGFR", "NSCLC", "T790M", "Irreversible covalent inhibition")
print(f"SMILES: {smiles}\nMW: {mw:.0f}\nLogP: {logp:.1f}")

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Hamdan003/inventmol-r1

Base model

Qwen/Qwen2.5-0.5B

Finetuned

(609)

this model