| | --- |
| | license: mit |
| | --- |
| | # LLaDA-8B-BioGRID-BioPAX |
| |
|
| | This repository contains a specialized LoRA adapter for [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct), fine-tuned by **Proximile LLC** for protein interaction network prediction using the BioPAX format. This adapter combines LLaDA's diffusion-based generation with comprehensive biological knowledge from BioGRID, UniProt, and AlphaFold databases. |
| |
|
| | ## 𧬠Model Description |
| |
|
| | LLaDA-8B-BioGRID-BioPAX is a LoRA (Low-Rank Adaptation) adapter that specializes the base LLaDA model for predicting and completing protein interaction networks. The adapter enables the model to understand both sequence-level and structural characteristics of proteins while maintaining LLaDA's iterative denoising process to generate biologically plausible protein networks in compressed BioPAX format. |
| |
|
| | ### Key Capabilities |
| |
|
| | - **Sequence-Aware Network Prediction**: Generate complete interaction networks from protein lists with sequence/structure context |
| | - **Structure-Guided Network Completion**: Complete partial networks using structural compatibility information |
| | - **New Protein Integration**: Predict interactions for novel proteins based on sequence similarity and structural features |
| | - **Multi-Modal Biological Reasoning**: Combine interaction patterns with sequence and structural data |
| | - **BioPAX Format Generation**: Output structured biological pathway data in compressed BioPAX XML |
| |
|
| | ## π Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers peft torch bitsandbytes |
| | ``` |
| |
|
| | ### Basic Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | from peft import PeftModel |
| | import torch |
| | |
| | # Load base model and tokenizer |
| | base_model_name = "GSAI-ML/LLaDA-8B-Instruct" |
| | adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True) |
| | base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto") |
| | |
| | # Load LoRA adapter |
| | model = PeftModel.from_pretrained(base_model, adapter_name) |
| | |
| | # Example: Predict protein network |
| | messages = [ |
| | { |
| | "role": "system", |
| | "content": "You are a protein interaction prediction system. Given a list of proteins with their sequence and structural information, predict all likely interactions between them in compressed BioPAX format." |
| | }, |
| | { |
| | "role": "user", |
| | "content": """Predict the protein interaction network for these proteins: |
| | |
| | PROTEIN: TP53 |
| | UniProt ID: P04637 |
| | Full Name: Tumor protein p53 |
| | Organism: Homo sapiens |
| | Sequence Length: 393 amino acids |
| | AlphaFold Structure: Available |
| | Function: Tumor suppressor that prevents cancer formation |
| | |
| | PROTEIN: MDM2 |
| | UniProt ID: Q00987 |
| | Full Name: E3 ubiquitin-protein ligase Mdm2 |
| | Organism: Homo sapiens |
| | Sequence Length: 491 amino acids |
| | AlphaFold Structure: Available |
| | Function: Regulates p53 tumor suppressor""" |
| | } |
| | ] |
| | |
| | # Generate network prediction using LLaDA's diffusion process |
| | # (Implementation of generate() function needed - see full example below) |
| | ``` |
| |
|
| | ## π¬ Training Details |
| |
|
| | ### Base Model |
| | - **Architecture**: LLaDA (Large Language Diffusion with mAsking) |
| | - **Base Model**: GSAI-ML/LLaDA-8B-Instruct |
| | - **Parameters**: 8.02B (base model) |
| | - **Adapter Type**: LoRA (Low-Rank Adaptation) |
| |
|
| | ### LoRA Configuration |
| | - **Method**: Supervised Fine-Tuning (SFT) with LoRA |
| | - **LoRA Settings**: |
| | - Rank (r): 256 (16 Γ 16 multiplier) |
| | - Alpha: 512 (256 Γ 2 alpha/r ratio) |
| | - Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| | - **Training Data**: BioGRID-Conv dataset with 5,000+ protein neighborhoods |
| | - **Context Length**: Up to 1,024 tokens (context) + 512 tokens (generation) |
| |
|
| | ### Data Sources |
| | - **BioGRID 4.4.246**: 2.8M+ protein/genetic interactions from 86K+ publications |
| | - **UniProt**: Protein sequences, functional annotations, organism data |
| | - **AlphaFold**: AI-predicted protein structures, confidence scores |
| |
|
| | ## π» Complete Generation Example |
| |
|
| | ```python |
| | import torch |
| | import json |
| | from transformers import AutoTokenizer, AutoModel |
| | from peft import PeftModel |
| | |
| | # Constants for LLaDA generation |
| | MASK_TOKEN_ID = 126336 |
| | |
| | def add_gumbel_noise(logits, temperature): |
| | """Add Gumbel noise for categorical sampling in diffusion models.""" |
| | if temperature <= 0: |
| | return logits |
| | |
| | logits = logits.to(torch.float64) |
| | noise = torch.rand_like(logits, dtype=torch.float64) |
| | gumbel_noise = (- torch.log(noise)) ** temperature |
| | return logits.exp() / gumbel_noise |
| | |
| | def get_num_transfer_tokens(mask_index, steps): |
| | """Compute tokens to transition at each denoising step.""" |
| | mask_num = mask_index.sum(dim=1, keepdim=True) |
| | |
| | if steps == 0: |
| | steps = 1 |
| | |
| | base = mask_num // steps |
| | remainder = mask_num % steps |
| | |
| | num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base |
| | |
| | for i in range(mask_num.size(0)): |
| | if remainder[i] > 0: |
| | num_transfer_tokens[i, :remainder[i]] += 1 |
| | |
| | return num_transfer_tokens |
| | |
| | def generate(model, prompt, steps=128, gen_length=128, block_length=32, temperature=0., |
| | remasking='low_confidence', mask_id=MASK_TOKEN_ID): |
| | """Generate text using LLaDA's diffusion-based process.""" |
| | device = next(model.parameters()).device |
| | prompt = prompt.to(device) |
| | |
| | x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(device) |
| | x[:, :prompt.shape[1]] = prompt.clone() |
| | |
| | prompt_index = (x != mask_id) |
| | |
| | assert gen_length % block_length == 0 |
| | num_blocks = gen_length // block_length |
| | |
| | assert steps % num_blocks == 0 |
| | steps_per_block = steps // num_blocks |
| | |
| | for num_block in range(num_blocks): |
| | block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id) |
| | num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps_per_block) |
| | |
| | for i in range(steps_per_block): |
| | mask_index = (x == mask_id) |
| | if not mask_index.any(): |
| | break |
| | |
| | outputs = model(x) |
| | logits = outputs.logits |
| | |
| | logits_with_noise = add_gumbel_noise(logits, temperature=temperature) |
| | x0 = torch.argmax(logits_with_noise, dim=-1) |
| | |
| | if remasking == 'low_confidence': |
| | p = torch.nn.functional.softmax(logits.to(torch.float64), dim=-1) |
| | x0_p = torch.squeeze( |
| | torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1) |
| | elif remasking == 'random': |
| | x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device) |
| | else: |
| | raise NotImplementedError(remasking) |
| | |
| | x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -float('inf') |
| | |
| | x0 = torch.where(mask_index, x0, x) |
| | confidence = torch.where(mask_index, x0_p, -float('inf')) |
| | |
| | transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device) |
| | for j in range(confidence.shape[0]): |
| | _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i]) |
| | transfer_index[j, select_index] = True |
| | x[transfer_index] = x0[transfer_index] |
| | |
| | return x |
| | |
| | def predict_protein_network(model, tokenizer, messages, temperature=0.1, gen_length=512, steps=128): |
| | """Generate protein network prediction.""" |
| | formatted_input = tokenizer.apply_chat_template( |
| | messages, |
| | tokenize=False, |
| | add_generation_prompt=True |
| | ) |
| | |
| | input_ids = tokenizer(formatted_input, return_tensors="pt")["input_ids"] |
| | |
| | with torch.no_grad(): |
| | output_ids = generate( |
| | model, |
| | input_ids, |
| | steps=steps, |
| | gen_length=gen_length, |
| | block_length=32, |
| | temperature=temperature, |
| | remasking='low_confidence' |
| | ) |
| | |
| | generated_text = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=False).split("<|")[0] |
| | return generated_text |
| | |
| | # Load model |
| | base_model_name = "GSAI-ML/LLaDA-8B-Instruct" |
| | adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True) |
| | base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto") |
| | model = PeftModel.from_pretrained(base_model, adapter_name) |
| | |
| | # Example prediction |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": """Predict the protein interaction network for these proteins in compressed BioPAX format: |
| | |
| | PROTEIN: TP53 |
| | UniProt ID: P04637 |
| | Full Name: Tumor protein p53 |
| | Organism: Homo sapiens |
| | Sequence Length: 393 amino acids |
| | AlphaFold Structure: Available |
| | |
| | PROTEIN: MDM2 |
| | UniProt ID: Q00987 |
| | Full Name: E3 ubiquitin-protein ligase Mdm2 |
| | Organism: Homo sapiens |
| | Sequence Length: 491 amino acids |
| | AlphaFold Structure: Available""" |
| | } |
| | ] |
| | |
| | result = predict_protein_network(model, tokenizer, messages) |
| | print("Predicted Network:") |
| | print(result) |
| | ``` |
| |
|
| | ## π BioPAX Output Format |
| |
|
| | The model generates protein networks in compressed BioPAX format: |
| |
|
| | ```xml |
| | <biopax> |
| | <proteins> |
| | <p id="tp53" name="TP53" uniprot="P04637" fullname="Tumor protein p53"/> |
| | <p id="mdm2" name="MDM2" uniprot="Q00987" fullname="E3 ubiquitin-protein ligase Mdm2"/> |
| | </proteins> |
| | <interactions> |
| | <i id="1" a="tp53" b="mdm2" type="Affinity Capture-Western"/> |
| | <i id="2" a="tp53" b="mdm2" type="Biochemical Activity"/> |
| | </interactions> |
| | </biopax> |
| | ``` |
| |
|
| | ## π§ͺ Supported Task Types |
| |
|
| | 1. **Complete Network Prediction**: Generate full interaction networks from protein lists |
| | 2. **New Protein Integration**: Predict interactions for new proteins in existing networks |
| | 3. **Partial Network Completion**: Fill in missing interactions in incomplete networks |
| | 4. **Property-Constrained Generation**: Generate networks meeting specific biological constraints |
| |
|
| | ## β οΈ Limitations |
| |
|
| | - **Diffusion-Based Generation**: LLaDA's iterative denoising may behave differently than standard autoregressive models |
| | - **BioPAX Format Specificity**: Output must precisely match the compressed BioPAX XML schema |
| | - **Biological Accuracy**: Predictions are based on training data patterns and may not reflect all biological realities |
| | - **Computational Requirements**: Diffusion generation requires more compute than standard inference |
| |
|
| | ## π Citation |
| |
|
| | If you use this model in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{llada-8b-biogrid-biopax, |
| | author = {Proximile LLC}, |
| | title = {LLaDA-8B-BioGRID-BioPAX: LoRA Adapter for Diffusion-Based Protein Interaction Network Prediction}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | howpublished = {\url{https://huggingface.co/Proximile/LLaDA-8B-BioGRID-BioPAX}} |
| | } |
| | ``` |
| |
|
| | Also cite the original LLaDA paper and BioGRID database. |
| |
|
| | ## π’ About Proximile LLC |
| |
|
| | Proximile LLC provides secure, cost-effective, and private AI solutions tailored to small and medium-sized businesses. We specialize in: |
| |
|
| | - **On-premise AI inference** solutions that ensure unparalleled privacy |
| | - **Cost-effective hardware configurations** including specialized bioinformatics workstations |
| | - **Secure Local AI applications** for life sciences, including protein analysis and drug discovery tools |
| | - **Specialized services** for compliance & governance in regulated industries |
| |
|
| | Visit [proximile.llc](https://proximile.llc) to learn more about our secure, local AI solutions for your business. |
| |
|
| | ## π Model Updates |
| |
|
| | - **June 16, 2025** β Initial LoRA adapter release with BioGRID 4.4.246 training data |
| | - Enhanced with UniProt and AlphaFold integration for comprehensive protein context |
| |
|
| | ## π License |
| |
|
| | This LoRA adapter is released under the same license as the base LLaDA model. |