--- library_name: transformers, DeepChem tags: - chemistry --- # 🧬 MoLLaMA-Small MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings. This model uses DeepChem's `SmilesTokenizer` and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation. ## 📊 Model Performance The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures. | Metric | Score | | :--- | :--- | | **Parameters** | 57.2 M | | **Validity** | 100.0% | | **Average QED** | 0.6400 | | **Diversity** | 0.8363 | ## 🏗️ Model Architecture A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling: * **Hidden Size**: 768 * **Intermediate Size**: 2048 * **Number of Hidden Layers**: 8 * **Number of Attention Heads**: 8 * **Max Position Embeddings**: 1024 ## 🚀 How to Use You can easily load this model using the standard `transformers` library. The model generates SMILES strings by prompting it with the `[bos]` (Beginning of Sequence) token. ### Prerequisites Make sure you have the required libraries installed: ```bash pip install transformers torch deepchem ``` ### Generation Code ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer # 1. Load Model and Tokenizer model_id = "jonghyunlee/MoLLaMA" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) # 2. Prepare Prompt for Unconditional Generation prompt = "[bos]" inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device) # 3. Generate SMILES model.eval() with torch.no_grad(): outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9, ) # 4. Decode the output generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "") print(f"Generated SMILES: {generated_smiles}") ``` ## 📚 Training Details * **Dataset**: `ZINC15` + `MuMOInstruct` (Parquet format) * **Epochs**: 5 * **Batch Size**: 512 (with gradient accumulation steps of 4) * **Learning Rate**: 1e-4 (Cosine scheduler, 10% Warmup) * **Precision**: bf16 (Mixed Precision) * **Early Stopping Patience**: 5 epochs