nb-transformer / README.md

Update README.md

a6b948f verified 6 months ago

13.9 kB

	---
	license: mit
	language:
	- en
	library_name: pytorch
	pipeline_tag: tabular-regression
	tags:
	- pytorch
	- transformer
	- bioinformatics
	- negative-binomial
	- glm
	- statistics
	- genomics
	- computational-biology
	datasets:
	- synthetic
	metrics:
	- mae
	- rmse
	model-index:
	- name: NB-Transformer
	results:
	- task:
	type: tabular-regression
	name: Negative Binomial GLM Parameter Estimation
	dataset:
	type: synthetic
	name: Synthetic NB GLM Data
	metrics:
	- type: mae
	value: 0.152
	name: Log Fold Change MAE
	- type: inference_time
	value: 0.076
	name: Inference Time (ms)
	---

	# NB-Transformer: Fast Negative Binomial GLM Parameter Estimation

	[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-1.10+-red.svg)](https://pytorch.org/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

	NB-Transformer is a fast, accurate neural network approach for Negative Binomial GLM parameter estimation, designed as a modern replacement for statistical analysis of counts. Using transformer-based attention mechanisms, it provides 14.8x speedup over classical methods while maintaining superior accuracy.

	Paper: [arxiv.org/abs/2508.04111](https://arxiv.org/abs/2508.04111)

	## 🚀 Key Features

	- ⚡ Ultra-Fast: 14.8x faster than classical GLM (0.076ms vs 1.128ms per test)
	- 🎯 More Accurate: 47% better accuracy on log fold change estimation
	- 🔬 Complete Statistical Inference: P-values, confidence intervals, and power analysis
	- 📊 Robust: 100% success rate vs 98.7% for classical methods
	- 🧠 Transformer Architecture: Attention-based modeling of variable-length sample sets
	- 📦 Easy to Use: Simple API with pre-trained model included

	## 📈 Performance Benchmarks

	Based on comprehensive validation with 1000+ test cases:

	\| Method \| Success Rate \| Time (ms) \| μ MAE \| β MAE \| α MAE \|
	\|--------\|--------------\|-----------\|-------\|-------\|-------\|
	\| NB-Transformer \| 100.0% \| 0.076 \| 0.202 \| 0.152 \| 0.477 \|
	\| Classical GLM \| 98.7% \| 1.128 \| 0.212 \| 0.284 \| 0.854 \|
	\| Method of Moments \| 100.0% \| 0.021 \| 0.213 \| 0.289 \| 0.852 \|

	Key Achievements:
	- 47% better accuracy on β (log fold change) - the critical parameter for differential expression
	- 44% better accuracy on α (dispersion) - essential for proper statistical inference
	- 100% convergence rate with no numerical instabilities

	## 🛠️ Installation

	```bash
	pip install nb-transformer
	```

	Or install from source:
	```bash
	git clone https://huggingface.co/valsv/nb-transformer
	cd nb-transformer
	pip install -e .
	```

	## 🎯 Quick Start

	### Basic Usage

	```python
	from nb_transformer import load_pretrained_model

	# Load the pre-trained model (downloads automatically)
	model = load_pretrained_model()

	# Your data: log10(CPM + 1) transformed counts
	control_samples = [2.1, 1.8, 2.3, 2.0] # 4 control samples
	treatment_samples = [1.5, 1.2, 1.7, 1.4] # 4 treatment samples

	# Get NB GLM parameters instantly
	params = model.predict_parameters(control_samples, treatment_samples)

	print(f"μ̂ (base mean): {params['mu']:.3f}") # -0.245
	print(f"β̂ (log fold change): {params['beta']:.3f}") # -0.421
	print(f"α̂ (log dispersion): {params['alpha']:.3f}") # -1.832
	print(f"Fold change: {np.exp(params['beta']):.2f}x") # 0.66x (downregulated)
	```

	### Complete Statistical Analysis

	```python
	import numpy as np
	from nb_transformer import load_pretrained_model
	from nb_transformer.inference import compute_nb_glm_inference

	# Load model and data
	model = load_pretrained_model()
	control_counts = np.array([1520, 1280, 1650, 1400])
	treatment_counts = np.array([980, 890, 1100, 950])
	control_lib_sizes = np.array([1e6, 1.1e6, 0.9e6, 1.05e6])
	treatment_lib_sizes = np.array([1e6, 1.0e6, 1.1e6, 0.95e6])

	# Transform to log10(CPM + 1)
	control_transformed = np.log10(1e4 * control_counts / control_lib_sizes + 1)
	treatment_transformed = np.log10(1e4 * treatment_counts / treatment_lib_sizes + 1)

	# Get parameters
	params = model.predict_parameters(control_transformed, treatment_transformed)

	# Complete statistical inference
	results = compute_nb_glm_inference(
	params['mu'], params['beta'], params['alpha'],
	control_counts, treatment_counts,
	control_lib_sizes, treatment_lib_sizes
	)

	print(f"Log fold change: {results['beta']:.3f} ± {results['se_beta']:.3f}")
	print(f"P-value: {results['pvalue']:.2e}")
	print(f"Significant: {'Yes' if results['pvalue'] < 0.05 else 'No'}")
	```

	### Quick Demo

	```python
	from nb_transformer import quick_inference_example

	# Run a complete example with sample data
	params = quick_inference_example()
	```

	## 🔬 Validation & Reproducibility

	This package includes three comprehensive validation scripts that reproduce all key results:

	### 1. Accuracy Validation
	Compare parameter estimation accuracy and speed across methods:

	```bash
	python examples/validate_accuracy.py --n_tests 1000 --output_dir results/
	```

	Expected Output:
	- Accuracy comparison plots
	- Speed benchmarks
	- Parameter estimation metrics
	- Success rate analysis

	### 2. P-value Calibration Validation
	Validate that p-values are properly calibrated under null hypothesis:

	```bash
	python examples/validate_calibration.py --n_tests 10000 --output_dir results/
	```

	Expected Output:
	- QQ plots for p-value uniformity
	- Statistical tests for calibration
	- False positive rate analysis
	- Calibration assessment report

	### 3. Statistical Power Analysis
	Evaluate statistical power across experimental designs and effect sizes:

	```bash
	python examples/validate_power.py --n_tests 1000 --output_dir results/
	```

	Expected Output:
	- Power curves by experimental design (3v3, 5v5, 7v7, 9v9)
	- Effect size analysis
	- Method comparison across designs
	- Statistical power benchmarks

	## 🧮 Mathematical Foundation

	### Model Architecture

	NB-Transformer uses a specialized transformer architecture for set-to-set comparison:

	- Input: Two variable-length sets of log-transformed expression values
	- Architecture: Pair-set transformer with intra-set and cross-set attention
	- Output: Three parameters (μ, β, α) for Negative Binomial GLM
	- Training: 2.5M parameters trained on synthetic data with known ground truth

	### Statistical Inference

	The model enables complete statistical inference through Fisher information:

	1. Parameter Estimation: Direct neural network prediction (μ̂, β̂, α̂)
	2. Fisher Weights: W<sub>i</sub> = m<sub>i</sub>/(1 + φm<sub>i</sub>) where m<sub>i</sub> = ℓ<sub>i</sub>exp(μ̂ + x<sub>i</sub>β̂)
	3. Standard Errors: SE(β̂) = √[(X'WX)<sup>-1</sup>]<sub>ββ</sub>
	4. Wald Statistics: W = β̂²/SE(β̂)² ~ χ²(1) under H₀: β = 0
	5. P-values: Proper Type I error control validated via calibration analysis

	### Key Innovation

	Unlike iterative maximum likelihood estimation, NB-Transformer learns the parameter mapping directly from data patterns, enabling:
	- Instant inference without convergence issues
	- Robust parameter estimation across challenging scenarios
	- Full statistical validity through Fisher information framework

	## 📊 Comprehensive Validation Results

	### Accuracy Across Parameter Types

	\| Parameter \| NB-Transformer \| Classical GLM \| Improvement \|
	\|-----------\|---------------\|---------------\|-------------\|
	\| μ (base mean) \| 0.202 MAE \| 0.212 MAE \| 5% better \|
	\| β (log fold change) \| 0.152 MAE \| 0.284 MAE \| 47% better \|
	\| α (dispersion) \| 0.477 MAE \| 0.854 MAE \| 44% better \|

	### Statistical Power Analysis

	Power analysis across experimental designs shows competitive performance:

	\| Design \| Effect Size β=1.0 \| Effect Size β=2.0 \|
	\|--------\|-------------------\|-------------------\|
	\| 3v3 samples \| 85% power \| 99% power \|
	\| 5v5 samples \| 92% power \| >99% power \|
	\| 7v7 samples \| 96% power \| >99% power \|
	\| 9v9 samples \| 98% power \| >99% power \|

	### P-value Calibration

	Rigorous calibration validation confirms proper statistical inference:
	- Kolmogorov-Smirnov test: p = 0.127 (well-calibrated)
	- Anderson-Darling test: p = 0.089 (well-calibrated)
	- False positive rate: 5.1% at α = 0.05 (properly controlled)

	## 🏗️ Architecture Details

	### Model Specifications
	- Model Type: Pair-set transformer for NB GLM parameter estimation
	- Parameters: 2.5M trainable parameters
	- Architecture:
	- Input dimension: 128
	- Attention heads: 8
	- Self-attention layers: 3
	- Cross-attention layers: 3
	- Dropout: 0.1
	- Training: Synthetic data with online generation
	- Validation Loss: 0.4628 (v13 checkpoint)

	### Input/Output Specification
	- Input: Two lists of log10(CPM + 1) transformed expression values
	- Output: Dictionary with keys 'mu', 'beta', 'alpha' (all on log scale)
	- Sample Size: Handles 2-20 samples per condition (variable length)
	- Expression Range: Optimized for typical RNA-seq expression levels

	## 🔧 Advanced Usage

	### Custom Model Loading

	```python
	from nb_transformer import load_pretrained_model

	# Load model on specific device
	model = load_pretrained_model(device='cuda') # or 'cpu', 'mps'

	# Load custom checkpoint
	model = load_pretrained_model(checkpoint_path='path/to/custom.ckpt')
	```

	### Batch Processing

	```python
	# Process multiple gene comparisons efficiently
	from nb_transformer.method_of_moments import estimate_batch_parameters_vectorized

	control_sets = [[2.1, 1.8, 2.3], [1.9, 2.2, 1.7]] # Multiple genes
	treatment_sets = [[1.5, 1.2, 1.7], [2.1, 2.4, 1.9]]

	# Fast batch estimation
	results = estimate_batch_parameters_vectorized(control_sets, treatment_sets)
	```

	### Training Custom Models

	```python
	from nb_transformer import train_dispersion_transformer, ParameterDistributions

	# Define custom parameter distributions
	param_dist = ParameterDistributions()
	param_dist.mu_params = {'loc': -1.0, 'scale': 2.0}
	param_dist.alpha_params = {'mean': -2.0, 'std': 1.0}
	param_dist.beta_params = {'prob_de': 0.3, 'std': 1.0}

	# Training configuration
	config = {
	'model_config': {
	'd_model': 128,
	'n_heads': 8,
	'num_self_layers': 3,
	'num_cross_layers': 3,
	'dropout': 0.1
	},
	'batch_size': 512,
	'max_epochs': 20,
	'examples_per_epoch': 100000,
	'parameter_distributions': param_dist
	}

	# Train model
	results = train_dispersion_transformer(config)
	```

	## 📋 Requirements

	### Core Dependencies
	- Python ≥ 3.8
	- PyTorch ≥ 1.10.0
	- PyTorch Lightning ≥ 1.8.0
	- NumPy ≥ 1.21.0
	- SciPy ≥ 1.7.0

	### Optional Dependencies
	- Validation: `statsmodels`, `pandas`, `matplotlib`, `scikit-learn`
	- Visualization: `plotnine`, `theme-nxn` (custom plotting theme)
	- Development: `pytest`, `flake8`, `black`, `mypy`

	## 🧪 Model Training Details

	### Training Data
	- Synthetic Generation: Online negative binomial data generation
	- Parameter Distributions: Based on empirical RNA-seq statistics
	- Sample Sizes: Variable 2-10 samples per condition
	- Expression Levels: Realistic RNA-seq dynamic range
	- Library Sizes: Log-normal distribution (CV ~30%)

	### Training Process
	- Epochs: 100 epochs
	- Batch Size: 32
	- Learning Rate: 1e-4 with ReduceLROnPlateau scheduler
	- Loss Function: Multi-task MSE loss with parameter-specific weights
	- Validation: Hold-out synthetic data with different parameter seeds

	### Hardware Optimization
	- Apple Silicon: Optimized for MPS (Metal Performance Shaders)
	- Multi-core CPU: Efficient multi-worker data generation
	- Memory Usage: Minimal memory footprint (~100MB model)
	- Inference Speed: Single-core CPU sufficient for real-time analysis

	## 🤝 Contributing

	We welcome contributions! Please see our contributing guidelines:

	1. Bug Reports: Open issues with detailed reproduction steps
	2. Feature Requests: Propose new functionality with use cases
	3. Code Contributions: Fork, develop, and submit pull requests
	4. Validation: Run validation scripts to ensure reproducibility
	5. Documentation: Improve examples and documentation

	### Development Setup

	```bash
	git clone https://huggingface.co/valsv/nb-transformer
	cd nb-transformer
	pip install -e ".[dev,analysis]"

	# Run tests
	pytest tests/

	# Run validation
	python examples/validate_accuracy.py --n_tests 100
	```

	## 📖 Citation

	If you use NB-Transformer in your research, please cite:

	```bibtex
	@software{svensson2025nbtransformer,
	title={NB-Transformer: Fast Negative Binomial GLM Parameter Estimation using Transformers},
	author={Svensson, Valentine},
	year={2025},
	url={https://huggingface.co/valsv/nb-transformer},
	version={1.0.0}
	}
	```

	## 📚 Related Work

	### Transformer Applications in Biology
	- Set-based Learning: Zaheer et al. (2017). Deep Sets. NIPS.
	- Attention Mechanisms: Vaswani et al. (2017). Attention Is All You Need. NIPS.
	- Biological Applications: Rives et al. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS.

	## ⚖️ License

	MIT License - see [LICENSE](LICENSE) file for details.

	## 🏷️ Version History

	### v1.0.0 (2025-08-04)
	- Initial release with pre-trained v13 model
	- Complete validation suite (accuracy, calibration, power)
	- Production-ready API with comprehensive documentation
	- Hugging Face integration for easy model distribution

	---

	🚀 Ready to revolutionize your differential expression analysis? Install NB-Transformer today!

	```bash
	pip install nb-transformer
	```

	For questions, issues, or contributions, visit our [Hugging Face repository](https://huggingface.co/valsv/nb-transformer) or open an issue.