| --- |
| library_name: transformers, DeepChem |
| tags: |
| - chemistry |
| --- |
| |
|
|
| # 𧬠MoLLaMA-Small |
| MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings. |
|
|
| This model uses DeepChem's `SmilesTokenizer` and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation. |
|
|
| ## π Model Performance |
| The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures. |
|
|
| | Metric | Score | |
| | :--- | :--- | |
| | **Parameters** | 57.2 M | |
| | **Validity** | 100.0% | |
| | **Average QED** | 0.6400 | |
| | **Diversity** | 0.8363 | |
|
|
| ## ποΈ Model Architecture |
| A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling: |
| * **Hidden Size**: 768 |
| * **Intermediate Size**: 2048 |
| * **Number of Hidden Layers**: 8 |
| * **Number of Attention Heads**: 8 |
| * **Max Position Embeddings**: 1024 |
|
|
| ## π How to Use |
| You can easily load this model using the standard `transformers` library. The model generates SMILES strings by prompting it with the `[bos]` (Beginning of Sequence) token. |
|
|
| ### Prerequisites |
| Make sure you have the required libraries installed: |
|
|
| ```bash |
| pip install transformers torch deepchem |
| |
| ``` |
|
|
| ### Generation Code |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| # 1. Load Model and Tokenizer |
| model_id = "jonghyunlee/MoLLaMA" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| torch_dtype=torch.float16, |
| device_map="auto" |
| ) |
| |
| # 2. Prepare Prompt for Unconditional Generation |
| prompt = "[bos]" |
| inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device) |
| |
| # 3. Generate SMILES |
| model.eval() |
| with torch.no_grad(): |
| outputs = model.generate( |
| input_ids=inputs.input_ids, |
| attention_mask=inputs.attention_mask, |
| max_new_tokens=256, |
| do_sample=True, |
| temperature=0.7, |
| top_p=0.9, |
| ) |
| |
| # 4. Decode the output |
| generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "") |
| print(f"Generated SMILES: {generated_smiles}") |
| |
| ``` |
|
|
| ## π Training Details |
|
|
| * **Dataset**: `ZINC15` + `MuMOInstruct` (Parquet format) |
| * **Epochs**: 5 |
| * **Batch Size**: 512 (with gradient accumulation steps of 4) |
| * **Learning Rate**: 1e-4 (Cosine scheduler, 10% Warmup) |
| * **Precision**: bf16 (Mixed Precision) |
| * **Early Stopping Patience**: 5 epochs |