--- license: mit --- # LLaDA-8B-BioGRID-BioPAX This repository contains a specialized LoRA adapter for [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct), fine-tuned by **Proximile LLC** for protein interaction network prediction using the BioPAX format. This adapter combines LLaDA's diffusion-based generation with comprehensive biological knowledge from BioGRID, UniProt, and AlphaFold databases. ## ๐Ÿงฌ Model Description LLaDA-8B-BioGRID-BioPAX is a LoRA (Low-Rank Adaptation) adapter that specializes the base LLaDA model for predicting and completing protein interaction networks. The adapter enables the model to understand both sequence-level and structural characteristics of proteins while maintaining LLaDA's iterative denoising process to generate biologically plausible protein networks in compressed BioPAX format. ### Key Capabilities - **Sequence-Aware Network Prediction**: Generate complete interaction networks from protein lists with sequence/structure context - **Structure-Guided Network Completion**: Complete partial networks using structural compatibility information - **New Protein Integration**: Predict interactions for novel proteins based on sequence similarity and structural features - **Multi-Modal Biological Reasoning**: Combine interaction patterns with sequence and structural data - **BioPAX Format Generation**: Output structured biological pathway data in compressed BioPAX XML ## ๐Ÿš€ Quick Start ### Installation ```bash pip install transformers peft torch bitsandbytes ``` ### Basic Usage ```python from transformers import AutoTokenizer, AutoModel from peft import PeftModel import torch # Load base model and tokenizer base_model_name = "GSAI-ML/LLaDA-8B-Instruct" adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX" tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True) base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto") # Load LoRA adapter model = PeftModel.from_pretrained(base_model, adapter_name) # Example: Predict protein network messages = [ { "role": "system", "content": "You are a protein interaction prediction system. Given a list of proteins with their sequence and structural information, predict all likely interactions between them in compressed BioPAX format." }, { "role": "user", "content": """Predict the protein interaction network for these proteins: PROTEIN: TP53 UniProt ID: P04637 Full Name: Tumor protein p53 Organism: Homo sapiens Sequence Length: 393 amino acids AlphaFold Structure: Available Function: Tumor suppressor that prevents cancer formation PROTEIN: MDM2 UniProt ID: Q00987 Full Name: E3 ubiquitin-protein ligase Mdm2 Organism: Homo sapiens Sequence Length: 491 amino acids AlphaFold Structure: Available Function: Regulates p53 tumor suppressor""" } ] # Generate network prediction using LLaDA's diffusion process # (Implementation of generate() function needed - see full example below) ``` ## ๐Ÿ”ฌ Training Details ### Base Model - **Architecture**: LLaDA (Large Language Diffusion with mAsking) - **Base Model**: GSAI-ML/LLaDA-8B-Instruct - **Parameters**: 8.02B (base model) - **Adapter Type**: LoRA (Low-Rank Adaptation) ### LoRA Configuration - **Method**: Supervised Fine-Tuning (SFT) with LoRA - **LoRA Settings**: - Rank (r): 256 (16 ร— 16 multiplier) - Alpha: 512 (256 ร— 2 alpha/r ratio) - Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` - **Training Data**: BioGRID-Conv dataset with 5,000+ protein neighborhoods - **Context Length**: Up to 1,024 tokens (context) + 512 tokens (generation) ### Data Sources - **BioGRID 4.4.246**: 2.8M+ protein/genetic interactions from 86K+ publications - **UniProt**: Protein sequences, functional annotations, organism data - **AlphaFold**: AI-predicted protein structures, confidence scores ## ๐Ÿ’ป Complete Generation Example ```python import torch import json from transformers import AutoTokenizer, AutoModel from peft import PeftModel # Constants for LLaDA generation MASK_TOKEN_ID = 126336 def add_gumbel_noise(logits, temperature): """Add Gumbel noise for categorical sampling in diffusion models.""" if temperature <= 0: return logits logits = logits.to(torch.float64) noise = torch.rand_like(logits, dtype=torch.float64) gumbel_noise = (- torch.log(noise)) ** temperature return logits.exp() / gumbel_noise def get_num_transfer_tokens(mask_index, steps): """Compute tokens to transition at each denoising step.""" mask_num = mask_index.sum(dim=1, keepdim=True) if steps == 0: steps = 1 base = mask_num // steps remainder = mask_num % steps num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base for i in range(mask_num.size(0)): if remainder[i] > 0: num_transfer_tokens[i, :remainder[i]] += 1 return num_transfer_tokens def generate(model, prompt, steps=128, gen_length=128, block_length=32, temperature=0., remasking='low_confidence', mask_id=MASK_TOKEN_ID): """Generate text using LLaDA's diffusion-based process.""" device = next(model.parameters()).device prompt = prompt.to(device) x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(device) x[:, :prompt.shape[1]] = prompt.clone() prompt_index = (x != mask_id) assert gen_length % block_length == 0 num_blocks = gen_length // block_length assert steps % num_blocks == 0 steps_per_block = steps // num_blocks for num_block in range(num_blocks): block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id) num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps_per_block) for i in range(steps_per_block): mask_index = (x == mask_id) if not mask_index.any(): break outputs = model(x) logits = outputs.logits logits_with_noise = add_gumbel_noise(logits, temperature=temperature) x0 = torch.argmax(logits_with_noise, dim=-1) if remasking == 'low_confidence': p = torch.nn.functional.softmax(logits.to(torch.float64), dim=-1) x0_p = torch.squeeze( torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1) elif remasking == 'random': x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device) else: raise NotImplementedError(remasking) x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -float('inf') x0 = torch.where(mask_index, x0, x) confidence = torch.where(mask_index, x0_p, -float('inf')) transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device) for j in range(confidence.shape[0]): _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i]) transfer_index[j, select_index] = True x[transfer_index] = x0[transfer_index] return x def predict_protein_network(model, tokenizer, messages, temperature=0.1, gen_length=512, steps=128): """Generate protein network prediction.""" formatted_input = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) input_ids = tokenizer(formatted_input, return_tensors="pt")["input_ids"] with torch.no_grad(): output_ids = generate( model, input_ids, steps=steps, gen_length=gen_length, block_length=32, temperature=temperature, remasking='low_confidence' ) generated_text = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=False).split("<|")[0] return generated_text # Load model base_model_name = "GSAI-ML/LLaDA-8B-Instruct" adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX" tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True) base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto") model = PeftModel.from_pretrained(base_model, adapter_name) # Example prediction messages = [ { "role": "user", "content": """Predict the protein interaction network for these proteins in compressed BioPAX format: PROTEIN: TP53 UniProt ID: P04637 Full Name: Tumor protein p53 Organism: Homo sapiens Sequence Length: 393 amino acids AlphaFold Structure: Available PROTEIN: MDM2 UniProt ID: Q00987 Full Name: E3 ubiquitin-protein ligase Mdm2 Organism: Homo sapiens Sequence Length: 491 amino acids AlphaFold Structure: Available""" } ] result = predict_protein_network(model, tokenizer, messages) print("Predicted Network:") print(result) ``` ## ๐Ÿ“Š BioPAX Output Format The model generates protein networks in compressed BioPAX format: ```xml

``` ## ๐Ÿงช Supported Task Types 1. **Complete Network Prediction**: Generate full interaction networks from protein lists 2. **New Protein Integration**: Predict interactions for new proteins in existing networks 3. **Partial Network Completion**: Fill in missing interactions in incomplete networks 4. **Property-Constrained Generation**: Generate networks meeting specific biological constraints ## โš ๏ธ Limitations - **Diffusion-Based Generation**: LLaDA's iterative denoising may behave differently than standard autoregressive models - **BioPAX Format Specificity**: Output must precisely match the compressed BioPAX XML schema - **Biological Accuracy**: Predictions are based on training data patterns and may not reflect all biological realities - **Computational Requirements**: Diffusion generation requires more compute than standard inference ## ๐Ÿ“š Citation If you use this model in your research, please cite: ```bibtex @misc{llada-8b-biogrid-biopax, author = {Proximile LLC}, title = {LLaDA-8B-BioGRID-BioPAX: LoRA Adapter for Diffusion-Based Protein Interaction Network Prediction}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Proximile/LLaDA-8B-BioGRID-BioPAX}} } ``` Also cite the original LLaDA paper and BioGRID database. ## ๐Ÿข About Proximile LLC Proximile LLC provides secure, cost-effective, and private AI solutions tailored to small and medium-sized businesses. We specialize in: - **On-premise AI inference** solutions that ensure unparalleled privacy - **Cost-effective hardware configurations** including specialized bioinformatics workstations - **Secure Local AI applications** for life sciences, including protein analysis and drug discovery tools - **Specialized services** for compliance & governance in regulated industries Visit [proximile.llc](https://proximile.llc) to learn more about our secure, local AI solutions for your business. ## ๐Ÿ”„ Model Updates - **June 16, 2025** โ€“ Initial LoRA adapter release with BioGRID 4.4.246 training data - Enhanced with UniProt and AlphaFold integration for comprehensive protein context ## ๐Ÿ“„ License This LoRA adapter is released under the same license as the base LLaDA model.