Shoriful025's picture
Create README.md
2d6db17 verified
metadata
language: en
license: apache-2.0
tags:
  - chemistry
  - biology
  - smiles
  - toxicity
  - molecular-prediction

molecular_toxicity_predictor

Overview

This model is a RoBERTa-based transformer (ChemBERTa) designed for the binary classification of chemical compounds based on potential molecular toxicity. It inputs SMILES (Simplified Molecular Input Line Entry System) strings and predicts whether the compound is likely to exhibit toxic properties in human cells.

Model Architecture

The model uses a BERT-style pre-training approach on chemical structures.

  • Input: Tokenized SMILES sequences representing molecular graphs.
  • Architecture: RoBERTa-base with 6 hidden layers, optimized for chemical informatics.
  • Vocabulary: A custom BPE (Byte-Pair Encoding) tokenizer trained on 77 million molecules from the ZINC database.

Intended Use

  • Drug Discovery: Early-stage screening of candidate molecules to filter out toxic compounds.
  • Regulatory Safety: Preliminary safety assessment for industrial chemicals.
  • Environmental Health: Predicting the impact of synthetic compounds on aquatic ecosystems.

Limitations

  • Stereochemistry: Limited ability to distinguish between enantiomers or specific spatial isomers that may have differing toxicities.
  • Domain Gap: May not generalize well to extremely large biological macromolecules (e.g., proteins or long peptides).
  • In-Vitro vs In-Vivo: Predicts molecular interaction, but does not simulate systemic metabolism or organ-specific toxicity (e.g., liver vs kidney).