---
license: mit
---
# LLaDA-8B-BioGRID-BioPAX

This repository contains a specialized LoRA adapter for [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct), fine-tuned by **Proximile LLC** for protein interaction network prediction using the BioPAX format. This adapter combines LLaDA's diffusion-based generation with comprehensive biological knowledge from BioGRID, UniProt, and AlphaFold databases.

## 🧬 Model Description

LLaDA-8B-BioGRID-BioPAX is a LoRA (Low-Rank Adaptation) adapter that specializes the base LLaDA model for predicting and completing protein interaction networks. The adapter enables the model to understand both sequence-level and structural characteristics of proteins while maintaining LLaDA's iterative denoising process to generate biologically plausible protein networks in compressed BioPAX format.

### Key Capabilities

- **Sequence-Aware Network Prediction**: Generate complete interaction networks from protein lists with sequence/structure context
- **Structure-Guided Network Completion**: Complete partial networks using structural compatibility information  
- **New Protein Integration**: Predict interactions for novel proteins based on sequence similarity and structural features
- **Multi-Modal Biological Reasoning**: Combine interaction patterns with sequence and structural data
- **BioPAX Format Generation**: Output structured biological pathway data in compressed BioPAX XML

## 🚀 Quick Start

### Installation

```bash
pip install transformers peft torch bitsandbytes
```

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model_name = "GSAI-ML/LLaDA-8B-Instruct"
adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX"

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_name)

# Example: Predict protein network
messages = [
    {
        "role": "system",
        "content": "You are a protein interaction prediction system. Given a list of proteins with their sequence and structural information, predict all likely interactions between them in compressed BioPAX format."
    },
    {
        "role": "user",
        "content": """Predict the protein interaction network for these proteins:

PROTEIN: TP53
  UniProt ID: P04637
  Full Name: Tumor protein p53
  Organism: Homo sapiens
  Sequence Length: 393 amino acids
  AlphaFold Structure: Available
  Function: Tumor suppressor that prevents cancer formation

PROTEIN: MDM2  
  UniProt ID: Q00987
  Full Name: E3 ubiquitin-protein ligase Mdm2
  Organism: Homo sapiens
  Sequence Length: 491 amino acids
  AlphaFold Structure: Available
  Function: Regulates p53 tumor suppressor"""
    }
]

# Generate network prediction using LLaDA's diffusion process
# (Implementation of generate() function needed - see full example below)
```

## 🔬 Training Details

### Base Model
- **Architecture**: LLaDA (Large Language Diffusion with mAsking)
- **Base Model**: GSAI-ML/LLaDA-8B-Instruct
- **Parameters**: 8.02B (base model)
- **Adapter Type**: LoRA (Low-Rank Adaptation)

### LoRA Configuration
- **Method**: Supervised Fine-Tuning (SFT) with LoRA
- **LoRA Settings**:
  - Rank (r): 256 (16 × 16 multiplier)
  - Alpha: 512 (256 × 2 alpha/r ratio)
  - Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
- **Training Data**: BioGRID-Conv dataset with 5,000+ protein neighborhoods
- **Context Length**: Up to 1,024 tokens (context) + 512 tokens (generation)

### Data Sources
- **BioGRID 4.4.246**: 2.8M+ protein/genetic interactions from 86K+ publications
- **UniProt**: Protein sequences, functional annotations, organism data
- **AlphaFold**: AI-predicted protein structures, confidence scores

## 💻 Complete Generation Example

```python
import torch
import json
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel

# Constants for LLaDA generation
MASK_TOKEN_ID = 126336

def add_gumbel_noise(logits, temperature):
    """Add Gumbel noise for categorical sampling in diffusion models."""
    if temperature <= 0:
        return logits
        
    logits = logits.to(torch.float64)
    noise = torch.rand_like(logits, dtype=torch.float64)
    gumbel_noise = (- torch.log(noise)) ** temperature
    return logits.exp() / gumbel_noise

def get_num_transfer_tokens(mask_index, steps):
    """Compute tokens to transition at each denoising step."""
    mask_num = mask_index.sum(dim=1, keepdim=True)
    
    if steps == 0:
        steps = 1
        
    base = mask_num // steps
    remainder = mask_num % steps
    
    num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
    
    for i in range(mask_num.size(0)):
        if remainder[i] > 0:
            num_transfer_tokens[i, :remainder[i]] += 1
            
    return num_transfer_tokens

def generate(model, prompt, steps=128, gen_length=128, block_length=32, temperature=0.,
             remasking='low_confidence', mask_id=MASK_TOKEN_ID):
    """Generate text using LLaDA's diffusion-based process."""
    device = next(model.parameters()).device
    prompt = prompt.to(device)
    
    x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(device)
    x[:, :prompt.shape[1]] = prompt.clone()
    
    prompt_index = (x != mask_id)
    
    assert gen_length % block_length == 0
    num_blocks = gen_length // block_length
    
    assert steps % num_blocks == 0
    steps_per_block = steps // num_blocks
    
    for num_block in range(num_blocks):
        block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id)
        num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps_per_block)
        
        for i in range(steps_per_block):
            mask_index = (x == mask_id)
            if not mask_index.any():
                break
                
            outputs = model(x)
            logits = outputs.logits
            
            logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
            x0 = torch.argmax(logits_with_noise, dim=-1)
            
            if remasking == 'low_confidence':
                p = torch.nn.functional.softmax(logits.to(torch.float64), dim=-1)
                x0_p = torch.squeeze(
                    torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1)
            elif remasking == 'random':
                x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
            else:
                raise NotImplementedError(remasking)
            
            x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -float('inf')
            
            x0 = torch.where(mask_index, x0, x)
            confidence = torch.where(mask_index, x0_p, -float('inf'))
            
            transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
            for j in range(confidence.shape[0]):
                _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i])
                transfer_index[j, select_index] = True
            x[transfer_index] = x0[transfer_index]
    
    return x

def predict_protein_network(model, tokenizer, messages, temperature=0.1, gen_length=512, steps=128):
    """Generate protein network prediction."""
    formatted_input = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    input_ids = tokenizer(formatted_input, return_tensors="pt")["input_ids"]
    
    with torch.no_grad():
        output_ids = generate(
            model, 
            input_ids, 
            steps=steps,
            gen_length=gen_length,
            block_length=32,
            temperature=temperature,
            remasking='low_confidence'
        )
    
    generated_text = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=False).split("<|")[0]
    return generated_text

# Load model
base_model_name = "GSAI-ML/LLaDA-8B-Instruct"
adapter_name = "Proximile/LLaDA-8B-BioGRID-BioPAX"

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name, trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(base_model, adapter_name)

# Example prediction
messages = [
    {
        "role": "user",
        "content": """Predict the protein interaction network for these proteins in compressed BioPAX format:

PROTEIN: TP53
  UniProt ID: P04637
  Full Name: Tumor protein p53
  Organism: Homo sapiens
  Sequence Length: 393 amino acids
  AlphaFold Structure: Available

PROTEIN: MDM2
  UniProt ID: Q00987  
  Full Name: E3 ubiquitin-protein ligase Mdm2
  Organism: Homo sapiens
  Sequence Length: 491 amino acids
  AlphaFold Structure: Available"""
    }
]

result = predict_protein_network(model, tokenizer, messages)
print("Predicted Network:")
print(result)
```

## 📊 BioPAX Output Format

The model generates protein networks in compressed BioPAX format:

```xml
<biopax>
  <proteins>
    <p id="tp53" name="TP53" uniprot="P04637" fullname="Tumor protein p53"/>
    <p id="mdm2" name="MDM2" uniprot="Q00987" fullname="E3 ubiquitin-protein ligase Mdm2"/>
  </proteins>
  <interactions>
    <i id="1" a="tp53" b="mdm2" type="Affinity Capture-Western"/>
    <i id="2" a="tp53" b="mdm2" type="Biochemical Activity"/>
  </interactions>
</biopax>
```

## 🧪 Supported Task Types

1. **Complete Network Prediction**: Generate full interaction networks from protein lists
2. **New Protein Integration**: Predict interactions for new proteins in existing networks  
3. **Partial Network Completion**: Fill in missing interactions in incomplete networks
4. **Property-Constrained Generation**: Generate networks meeting specific biological constraints

## ⚠️ Limitations

- **Diffusion-Based Generation**: LLaDA's iterative denoising may behave differently than standard autoregressive models
- **BioPAX Format Specificity**: Output must precisely match the compressed BioPAX XML schema
- **Biological Accuracy**: Predictions are based on training data patterns and may not reflect all biological realities
- **Computational Requirements**: Diffusion generation requires more compute than standard inference

## 📚 Citation

If you use this model in your research, please cite:

```bibtex
@misc{llada-8b-biogrid-biopax,
  author = {Proximile LLC},
  title = {LLaDA-8B-BioGRID-BioPAX: LoRA Adapter for Diffusion-Based Protein Interaction Network Prediction},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Proximile/LLaDA-8B-BioGRID-BioPAX}}
}
```

Also cite the original LLaDA paper and BioGRID database.

## 🏢 About Proximile LLC

Proximile LLC provides secure, cost-effective, and private AI solutions tailored to small and medium-sized businesses. We specialize in:

- **On-premise AI inference** solutions that ensure unparalleled privacy
- **Cost-effective hardware configurations** including specialized bioinformatics workstations
- **Secure Local AI applications** for life sciences, including protein analysis and drug discovery tools
- **Specialized services** for compliance & governance in regulated industries

Visit [proximile.llc](https://proximile.llc) to learn more about our secure, local AI solutions for your business.

## 🔄 Model Updates

- **June 16, 2025** – Initial LoRA adapter release with BioGRID 4.4.246 training data
- Enhanced with UniProt and AlphaFold integration for comprehensive protein context

## 📄 License

This LoRA adapter is released under the same license as the base LLaDA model.