jonghyunlee
/

MoLLaMA

transformers, DeepChem

Model card Files Files and versions

MoLLaMA / README.md

jonghyunlee's picture

Update README.md

26004d1 verified 29 days ago

|

history blame contribute delete

2.59 kB

	---
	library_name: transformers, DeepChem
	tags:
	- chemistry
	---


	# 🧬 MoLLaMA-Small
	MoLLaMA-Small is a lightweight LLaMA-based causal language model (57.2M parameters) trained from scratch to generate valid chemical molecules using SMILES strings.

	This model uses DeepChem's `SmilesTokenizer` and was trained on a combined dataset of ZINC15 and MuMOInstruct. It is designed for unconditional molecule generation.

	## 📊 Model Performance
	The model was evaluated on 30 randomly generated samples from the test set. It demonstrates perfect validity and high diversity in generating chemical structures.

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| Parameters \| 57.2 M \|
	\| Validity \| 100.0% \|
	\| Average QED \| 0.6400 \|
	\| Diversity \| 0.8363 \|

	## 🏗️ Model Architecture
	A custom, scaled-down LLaMA architecture was used to optimize for chemical language modeling:
	* Hidden Size: 768
	* Intermediate Size: 2048
	* Number of Hidden Layers: 8
	* Number of Attention Heads: 8
	* Max Position Embeddings: 1024

	## 🚀 How to Use
	You can easily load this model using the standard `transformers` library. The model generates SMILES strings by prompting it with the `[bos]` (Beginning of Sequence) token.

	### Prerequisites
	Make sure you have the required libraries installed:

	```bash
	pip install transformers torch deepchem

	```

	### Generation Code

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# 1. Load Model and Tokenizer
	model_id = "jonghyunlee/MoLLaMA"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# 2. Prepare Prompt for Unconditional Generation
	prompt = "[bos]"
	inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

	# 3. Generate SMILES
	model.eval()
	with torch.no_grad():
	outputs = model.generate(
	input_ids=inputs.input_ids,
	attention_mask=inputs.attention_mask,
	max_new_tokens=256,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	)

	# 4. Decode the output
	generated_smiles = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(" ", "")
	print(f"Generated SMILES: {generated_smiles}")

	```

	## 📚 Training Details

	* Dataset: `ZINC15` + `MuMOInstruct` (Parquet format)
	* Epochs: 5
	* Batch Size: 512 (with gradient accumulation steps of 4)
	* Learning Rate: 1e-4 (Cosine scheduler, 10% Warmup)
	* Precision: bf16 (Mixed Precision)
	* Early Stopping Patience: 5 epochs