---
library_name: transformers, DeepChem
tags:
- chemistry
---


# 🧬 MoLLaMA-Small
MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings. 

This model uses DeepChem's `SmilesTokenizer` and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation.

## 📊 Model Performance
The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures.

| Metric | Score |
| :--- | :--- |
| **Parameters** | 57.2 M |
| **Validity** | 100.0% |
| **Average QED** | 0.6400 |
| **Diversity** | 0.8363 |

## 🏗️ Model Architecture
A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling:
* **Hidden Size**: 768
* **Intermediate Size**: 2048
* **Number of Hidden Layers**: 8
* **Number of Attention Heads**: 8
* **Max Position Embeddings**: 1024

## 🚀 How to Use
You can easily load this model using the standard `transformers` library. The model generates SMILES strings by prompting it with the `[bos]` (Beginning of Sequence) token.

### Prerequisites
Make sure you have the required libraries installed:

```bash
pip install transformers torch deepchem

```

### Generation Code

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Load Model and Tokenizer
model_id = "jonghyunlee/MoLLaMA"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
)

# 2. Prepare Prompt for Unconditional Generation
prompt = "[bos]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

# 3. Generate SMILES
model.eval()
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )

# 4. Decode the output
generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "")
print(f"Generated SMILES: {generated_smiles}")

```

## 📚 Training Details

* **Dataset**: `ZINC15` + `MuMOInstruct` (Parquet format)
* **Epochs**: 5
* **Batch Size**: 512 (with gradient accumulation steps of 4)
* **Learning Rate**: 1e-4 (Cosine scheduler, 10% Warmup)
* **Precision**: bf16 (Mixed Precision)
* **Early Stopping Patience**: 5 epochs