Update README.md

b27e471 verified 9 months ago

12 kB

	---
	license: mit
	---
	# LLaDA-8B-BioGRID-BioPAX

	This repository contains a specialized LoRA adapter for [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct), fine-tuned by Proximile LLC for protein interaction network prediction using the BioPAX format. This adapter combines LLaDA's diffusion-based generation with comprehensive biological knowledge from BioGRID, UniProt, and AlphaFold databases.

	## 🧬 Model Description

	LLaDA-8B-BioGRID-BioPAX is a LoRA (Low-Rank Adaptation) adapter that specializes the base LLaDA model for predicting and completing protein interaction networks. The adapter enables the model to understand both sequence-level and structural characteristics of proteins while maintaining LLaDA's iterative denoising process to generate biologically plausible protein networks in compressed BioPAX format.

	### Key Capabilities

	- Sequence-Aware Network Prediction: Generate complete interaction networks from protein lists with sequence/structure context
	- Structure-Guided Network Completion: Complete partial networks using structural compatibility information
	- New Protein Integration: Predict interactions for novel proteins based on sequence similarity and structural features
	- Multi-Modal Biological Reasoning: Combine interaction patterns with sequence and structural data
	- BioPAX Format Generation: Output structured biological pathway data in compressed BioPAX XML

	## 🚀 Quick Start

	### Installation

	```bash
	pip install transformers peft torch bitsandbytes
	```

	### Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModel
	from peft import PeftModel
	import torch

	# Load base model and tokenizer
	base_model_name = "GSAI-ML/LLaDA-8B-Instruct"
	adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX"

	tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
	base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto")

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, adapter_name)

	# Example: Predict protein network
	messages = [
	{
	"role": "system",
	"content": "You are a protein interaction prediction system. Given a list of proteins with their sequence and structural information, predict all likely interactions between them in compressed BioPAX format."
	},
	{
	"role": "user",
	"content": """Predict the protein interaction network for these proteins:

	PROTEIN: TP53
	UniProt ID: P04637
	Full Name: Tumor protein p53
	Organism: Homo sapiens
	Sequence Length: 393 amino acids
	AlphaFold Structure: Available
	Function: Tumor suppressor that prevents cancer formation

	PROTEIN: MDM2
	UniProt ID: Q00987
	Full Name: E3 ubiquitin-protein ligase Mdm2
	Organism: Homo sapiens
	Sequence Length: 491 amino acids
	AlphaFold Structure: Available
	Function: Regulates p53 tumor suppressor"""
	}
	]

	# Generate network prediction using LLaDA's diffusion process
	# (Implementation of generate() function needed - see full example below)
	```

	## 🔬 Training Details

	### Base Model
	- Architecture: LLaDA (Large Language Diffusion with mAsking)
	- Base Model: GSAI-ML/LLaDA-8B-Instruct
	- Parameters: 8.02B (base model)
	- Adapter Type: LoRA (Low-Rank Adaptation)

	### LoRA Configuration
	- Method: Supervised Fine-Tuning (SFT) with LoRA
	- LoRA Settings:
	- Rank (r): 256 (16 × 16 multiplier)
	- Alpha: 512 (256 × 2 alpha/r ratio)
	- Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
	- Training Data: BioGRID-Conv dataset with 5,000+ protein neighborhoods
	- Context Length: Up to 1,024 tokens (context) + 512 tokens (generation)

	### Data Sources
	- BioGRID 4.4.246: 2.8M+ protein/genetic interactions from 86K+ publications
	- UniProt: Protein sequences, functional annotations, organism data
	- AlphaFold: AI-predicted protein structures, confidence scores

	## 💻 Complete Generation Example

	```python
	import torch
	import json
	from transformers import AutoTokenizer, AutoModel
	from peft import PeftModel

	# Constants for LLaDA generation
	MASK_TOKEN_ID = 126336

	def add_gumbel_noise(logits, temperature):
	"""Add Gumbel noise for categorical sampling in diffusion models."""
	if temperature <= 0:
	return logits

	logits = logits.to(torch.float64)
	noise = torch.rand_like(logits, dtype=torch.float64)
	gumbel_noise = (- torch.log(noise)) ** temperature
	return logits.exp() / gumbel_noise

	def get_num_transfer_tokens(mask_index, steps):
	"""Compute tokens to transition at each denoising step."""
	mask_num = mask_index.sum(dim=1, keepdim=True)

	if steps == 0:
	steps = 1

	base = mask_num // steps
	remainder = mask_num % steps

	num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base

	for i in range(mask_num.size(0)):
	if remainder[i] > 0:
	num_transfer_tokens[i, :remainder[i]] += 1

	return num_transfer_tokens

	def generate(model, prompt, steps=128, gen_length=128, block_length=32, temperature=0.,
	remasking='low_confidence', mask_id=MASK_TOKEN_ID):
	"""Generate text using LLaDA's diffusion-based process."""
	device = next(model.parameters()).device
	prompt = prompt.to(device)

	x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(device)
	x[:, :prompt.shape[1]] = prompt.clone()

	prompt_index = (x != mask_id)

	assert gen_length % block_length == 0
	num_blocks = gen_length // block_length

	assert steps % num_blocks == 0
	steps_per_block = steps // num_blocks

	for num_block in range(num_blocks):
	block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id)
	num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps_per_block)

	for i in range(steps_per_block):
	mask_index = (x == mask_id)
	if not mask_index.any():
	break

	outputs = model(x)
	logits = outputs.logits

	logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
	x0 = torch.argmax(logits_with_noise, dim=-1)

	if remasking == 'low_confidence':
	p = torch.nn.functional.softmax(logits.to(torch.float64), dim=-1)
	x0_p = torch.squeeze(
	torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1)
	elif remasking == 'random':
	x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
	else:
	raise NotImplementedError(remasking)

	x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -float('inf')

	x0 = torch.where(mask_index, x0, x)
	confidence = torch.where(mask_index, x0_p, -float('inf'))

	transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
	for j in range(confidence.shape[0]):
	_, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i])
	transfer_index[j, select_index] = True
	x[transfer_index] = x0[transfer_index]

	return x

	def predict_protein_network(model, tokenizer, messages, temperature=0.1, gen_length=512, steps=128):
	"""Generate protein network prediction."""
	formatted_input = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	input_ids = tokenizer(formatted_input, return_tensors="pt")["input_ids"]

	with torch.no_grad():
	output_ids = generate(
	model,
	input_ids,
	steps=steps,
	gen_length=gen_length,
	block_length=32,
	temperature=temperature,
	remasking='low_confidence'
	)

	generated_text = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=False).split("<\|")[0]
	return generated_text

	# Load model
	base_model_name = "GSAI-ML/LLaDA-8B-Instruct"
	adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX"

	tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
	base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto")
	model = PeftModel.from_pretrained(base_model, adapter_name)

	# Example prediction
	messages = [
	{
	"role": "user",
	"content": """Predict the protein interaction network for these proteins in compressed BioPAX format:

	PROTEIN: TP53
	UniProt ID: P04637
	Full Name: Tumor protein p53
	Organism: Homo sapiens
	Sequence Length: 393 amino acids
	AlphaFold Structure: Available

	PROTEIN: MDM2
	UniProt ID: Q00987
	Full Name: E3 ubiquitin-protein ligase Mdm2
	Organism: Homo sapiens
	Sequence Length: 491 amino acids
	AlphaFold Structure: Available"""
	}
	]

	result = predict_protein_network(model, tokenizer, messages)
	print("Predicted Network:")
	print(result)
	```

	## 📊 BioPAX Output Format

	The model generates protein networks in compressed BioPAX format:

	```xml
	<biopax>
	<proteins>
	<p id="tp53" name="TP53" uniprot="P04637" fullname="Tumor protein p53"/>
	<p id="mdm2" name="MDM2" uniprot="Q00987" fullname="E3 ubiquitin-protein ligase Mdm2"/>
	</proteins>
	<interactions>
	<i id="1" a="tp53" b="mdm2" type="Affinity Capture-Western"/>
	<i id="2" a="tp53" b="mdm2" type="Biochemical Activity"/>
	</interactions>
	</biopax>
	```

	## 🧪 Supported Task Types

	1. Complete Network Prediction: Generate full interaction networks from protein lists
	2. New Protein Integration: Predict interactions for new proteins in existing networks
	3. Partial Network Completion: Fill in missing interactions in incomplete networks
	4. Property-Constrained Generation: Generate networks meeting specific biological constraints

	## ⚠️ Limitations

	- Diffusion-Based Generation: LLaDA's iterative denoising may behave differently than standard autoregressive models
	- BioPAX Format Specificity: Output must precisely match the compressed BioPAX XML schema
	- Biological Accuracy: Predictions are based on training data patterns and may not reflect all biological realities
	- Computational Requirements: Diffusion generation requires more compute than standard inference

	## 📚 Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{llada-8b-biogrid-biopax,
	author = {Proximile LLC},
	title = {LLaDA-8B-BioGRID-BioPAX: LoRA Adapter for Diffusion-Based Protein Interaction Network Prediction},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Proximile/LLaDA-8B-BioGRID-BioPAX}}
	}
	```

	Also cite the original LLaDA paper and BioGRID database.

	## 🏢 About Proximile LLC

	Proximile LLC provides secure, cost-effective, and private AI solutions tailored to small and medium-sized businesses. We specialize in:

	- On-premise AI inference solutions that ensure unparalleled privacy
	- Cost-effective hardware configurations including specialized bioinformatics workstations
	- Secure Local AI applications for life sciences, including protein analysis and drug discovery tools
	- Specialized services for compliance & governance in regulated industries

	Visit [proximile.llc](https://proximile.llc) to learn more about our secure, local AI solutions for your business.

	## 🔄 Model Updates

	- June 16, 2025 – Initial LoRA adapter release with BioGRID 4.4.246 training data
	- Enhanced with UniProt and AlphaFold integration for comprehensive protein context

	## 📄 License

	This LoRA adapter is released under the same license as the base LLaDA model.