MoLLaMA / README.md
jonghyunlee's picture
Update README.md
26004d1 verified
---
library_name: transformers, DeepChem
tags:
- chemistry
---
# 🧬 MoLLaMA-Small
MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings.
This model uses DeepChem's `SmilesTokenizer` and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation.
## πŸ“Š Model Performance
The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures.
| Metric | Score |
| :--- | :--- |
| **Parameters** | 57.2 M |
| **Validity** | 100.0% |
| **Average QED** | 0.6400 |
| **Diversity** | 0.8363 |
## πŸ—οΈ Model Architecture
A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling:
* **Hidden Size**: 768
* **Intermediate Size**: 2048
* **Number of Hidden Layers**: 8
* **Number of Attention Heads**: 8
* **Max Position Embeddings**: 1024
## πŸš€ How to Use
You can easily load this model using the standard `transformers` library. The model generates SMILES strings by prompting it with the `[bos]` (Beginning of Sequence) token.
### Prerequisites
Make sure you have the required libraries installed:
```bash
pip install transformers torch deepchem
```
### Generation Code
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. Load Model and Tokenizer
model_id = "jonghyunlee/MoLLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# 2. Prepare Prompt for Unconditional Generation
prompt = "[bos]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
# 3. Generate SMILES
model.eval()
with torch.no_grad():
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
# 4. Decode the output
generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "")
print(f"Generated SMILES: {generated_smiles}")
```
## πŸ“š Training Details
* **Dataset**: `ZINC15` + `MuMOInstruct` (Parquet format)
* **Epochs**: 5
* **Batch Size**: 512 (with gradient accumulation steps of 4)
* **Learning Rate**: 1e-4 (Cosine scheduler, 10% Warmup)
* **Precision**: bf16 (Mixed Precision)
* **Early Stopping Patience**: 5 epochs