File size: 2,589 Bytes

00cf9fe
9f2bf97
 
 
00cf9fe
 
 
9f2bf97
 
00cf9fe
9f2bf97
00cf9fe
9f2bf97
 
00cf9fe
9f2bf97
 
 
 
 
 
00cf9fe
9f2bf97
 
 
 
 
 
 
00cf9fe
9f2bf97
 
00cf9fe
9f2bf97
 
26004d1
9f2bf97
 
00cf9fe
9f2bf97
00cf9fe
9f2bf97
00cf9fe
9f2bf97
 
 
00cf9fe
9f2bf97
 
00cf9fe
9f2bf97
 
 
 
 
 
00cf9fe
9f2bf97
 
 
00cf9fe
9f2bf97
 
 
 
 
 
 
 
 
 
 
00cf9fe
9f2bf97
 
 
00cf9fe
9f2bf97
00cf9fe
9f2bf97
00cf9fe
9f2bf97
 
 
 
 
26004d1

---
library_name: transformers, DeepChem
tags:
- chemistry
---


# 🧬 MoLLaMA-Small
MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings. 

This model uses DeepChem's `SmilesTokenizer` and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation.

## 📊 Model Performance
The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures.

| Metric | Score |
| :--- | :--- |
| **Parameters** | 57.2 M |
| **Validity** | 100.0% |
| **Average QED** | 0.6400 |
| **Diversity** | 0.8363 |

## 🏗️ Model Architecture
A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling:
* **Hidden Size**: 768
* **Intermediate Size**: 2048
* **Number of Hidden Layers**: 8
* **Number of Attention Heads**: 8
* **Max Position Embeddings**: 1024

## 🚀 How to Use
You can easily load this model using the standard `transformers` library. The model generates SMILES strings by prompting it with the `[bos]` (Beginning of Sequence) token.

### Prerequisites
Make sure you have the required libraries installed:

```bash
pip install transformers torch deepchem

```

### Generation Code

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Load Model and Tokenizer
model_id = "jonghyunlee/MoLLaMA"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
)

# 2. Prepare Prompt for Unconditional Generation
prompt = "[bos]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

# 3. Generate SMILES
model.eval()
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )

# 4. Decode the output
generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "")
print(f"Generated SMILES: {generated_smiles}")

```

## 📚 Training Details

* **Dataset**: `ZINC15` + `MuMOInstruct` (Parquet format)
* **Epochs**: 5
* **Batch Size**: 512 (with gradient accumulation steps of 4)
* **Learning Rate**: 1e-4 (Cosine scheduler, 10% Warmup)
* **Precision**: bf16 (Mixed Precision)
* **Early Stopping Patience**: 5 epochs