Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,216 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- Allanatrix/Scientific_Research_Tokenized
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
base_model:
|
| 8 |
+
- Allanatrix/NexaMOE_Mini
|
| 9 |
+
pipeline_tag: text-generation
|
| 10 |
+
tags:
|
| 11 |
+
- Science
|
| 12 |
+
- Hypothesis
|
| 13 |
+
- Methodology
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# NexaMOE Family of Models
|
| 17 |
+
|
| 18 |
+
## Welcome to the NexaMOE Repository!
|
| 19 |
+
|
| 20 |
+
Get ready to supercharge your scientific research with the **NexaMOE family of models**! This Hugging Face repository hosts a powerful suite of Mixture-of-Experts (MoE) models designed to generate hypotheses and methodologies across **physics**, **biology**, and **materials science**. Built with efficiency and scalability in mind, the NexaMOE family includes the baseline **NexaMOE**, the reasoning-enhanced **NEXA-CoT**, and the long-context powerhouse **NEXA-Ultramax**. Whether you’re a researcher tackling complex STEM problems, a data scientist exploring scientific ML, or a student learning about domain-specific AI, this repository is your go-to resource for cutting-edge scientific computation.
|
| 21 |
+
|
| 22 |
+
## Model Overview
|
| 23 |
+
|
| 24 |
+
The NexaMOE family is a 110 million to 2.2 billion parameter architecture that uses a **Semantic Router** to direct queries to domain-specific expert modules (Physics, Biology, Materials Science). It’s optimized for resource-constrained environments, leveraging advanced training strategies, hardware optimizations, and techniques like reinforcement learning and sparse attention. Below are the current and planned models:
|
| 25 |
+
|
| 26 |
+
### 1. NexaMOE_Mini (Still working on this)
|
| 27 |
+
- **Parameters**: ~110 million
|
| 28 |
+
- **Purpose**: Generates hypotheses and methodological scaffolding for scientific tasks in physics, biology, and materials science.
|
| 29 |
+
- **Architecture**:
|
| 30 |
+
- **Semantic Router**: BERT-based classifier routes queries to domain-specific experts.
|
| 31 |
+
- **Expert Modules**: T5-based submodules for Physics, Biology, and Materials Science.
|
| 32 |
+
- **Inference & Validation Pipeline**: Aggregates expert outputs and ensures consistency.
|
| 33 |
+
- **Knowledge Feedback Loop**: Refines routing using reinforcement learning.
|
| 34 |
+
- **Training**:
|
| 35 |
+
- Pretrained on ~325M tokens from arXiv, PubMed, and other scientific corpora.
|
| 36 |
+
- Fine-tuned with QLoRA on 300k instruction-style samples.
|
| 37 |
+
- Uses AzureSky Optimizer (Stochastic Approximation + Adam hybrid).
|
| 38 |
+
- **Use Cases**:
|
| 39 |
+
- Generate plausible hypotheses (e.g., new material properties).
|
| 40 |
+
- Suggest experimental methods (e.g., protein folding protocols).
|
| 41 |
+
- Summarize scientific texts with domain-specific insights.
|
| 42 |
+
|
| 43 |
+
### 2. NEXA-CoT (Coming Soon)
|
| 44 |
+
- **Parameters**: ~110 million
|
| 45 |
+
- **Purpose**: Enhances step-by-step logical reasoning for complex STEM tasks, like physics problem-solving or interdisciplinary hypothesis generation.
|
| 46 |
+
- **Architecture**:
|
| 47 |
+
- Adds a **Chain of Thought (CoT) Processor** with sparse attention (Longformer-style) for multi-step reasoning.
|
| 48 |
+
- Includes **Conditional Routing** to engage the CoT Processor based on a “reasoning_required” flag.
|
| 49 |
+
- Integrates with expert modules for structured, logical outputs.
|
| 50 |
+
- **Training**:
|
| 51 |
+
- Trained in three stages: Easy (basic logic), Moderate (complex tasks), Hard (advanced reasoning).
|
| 52 |
+
- Uses ~425-500M tokens, including a Reasoning Curriculum Dataset (50-75M tokens) for CoT optimization.
|
| 53 |
+
- Employs AzureSky Optimizer with reinforcement learning fine-tuning.
|
| 54 |
+
- **Use Cases**:
|
| 55 |
+
- Solve multi-step physics problems (e.g., astrophysics simulations).
|
| 56 |
+
- Generate detailed, logical methodologies (e.g., combining CFD and alloy modeling).
|
| 57 |
+
- Teach scientific reasoning in educational settings.
|
| 58 |
+
|
| 59 |
+
### 3. NEXA-Ultramax (Coming soon)
|
| 60 |
+
- **Parameters**: ~2.2 billion
|
| 61 |
+
- **Purpose**: Processes large scientific documents (up to 20,000 tokens) with deep contextual understanding.
|
| 62 |
+
- **Architecture**:
|
| 63 |
+
- Features a **Long Context Attention Layer** with two Flash Attention v2 layers for efficient long-sequence processing.
|
| 64 |
+
- Includes a **Longform Context Manager** to chunk inputs while preserving semantic coherence.
|
| 65 |
+
- Scales parameters using mixed precision training and gradient checkpointing.
|
| 66 |
+
- **Training**:
|
| 67 |
+
- Trained on ~600-650M tokens, including a Long-Context Corpus (100-150M tokens) of full arXiv papers and NIH grants.
|
| 68 |
+
- Uses AzureSky Optimizer with mixed precision (FP16/BF16) and gradient checkpointing.
|
| 69 |
+
- **Use Cases**:
|
| 70 |
+
- Summarize or analyze long scientific papers (e.g., 20K-token preprints).
|
| 71 |
+
- Generate hypotheses from extended contexts (e.g., patent methods).
|
| 72 |
+
- Support multi-query tasks requiring deep document understanding.
|
| 73 |
+
|
| 74 |
+
### Future Models (Planned)
|
| 75 |
+
- **NEXA-MOE-Mini**: A lightweight version (~50M parameters) optimized for edge devices, using ~325M tokens. Planned for low-resource environments.
|
| 76 |
+
- **NEXA-MOE-Super**: A larger-scale model (~10B parameters) for advanced scientific tasks, using ~1B tokens. Planned for high-performance computing clusters.
|
| 77 |
+
- **NEXA-MOE-MultiModal**: Integrates text, images, and graphs for scientific data analysis (e.g., protein structures, simulation plots). Planned for future research.
|
| 78 |
+
|
| 79 |
+
## Dataset and Training Details
|
| 80 |
+
|
| 81 |
+
The NexaMOE family is trained on a **tiered token strategy** to maximize efficiency and domain specificity, as outlined in the architecture document:
|
| 82 |
+
|
| 83 |
+
- **Warm Start Corpus** (100M tokens): General language understanding from FineWeb-Edu, OpenWebMath, Wikipedia, and Aristo Science Questions.
|
| 84 |
+
- **Scientific Pretraining Corpus** (200-300M tokens): Domain-specific data from arXiv (physics), PubMed/BioRxiv (biology), and Materials Project/ChemRxiv (materials science).
|
| 85 |
+
- **Instruction Fine-Tune Dataset** (25-30M tokens): 300k high-quality instruction-style samples for hypothesis and method generation.
|
| 86 |
+
- **Reasoning Curriculum Dataset** (50-75M tokens, CoT only): SciBench, OpenBookQA, and others for step-by-step reasoning.
|
| 87 |
+
- **Long-Context Corpus** (100-150M tokens, UltraMAX only): Full arXiv papers, NIH grants, and USPTO patents for long-context alignment.
|
| 88 |
+
|
| 89 |
+
**Token Efficiency Strategies**:
|
| 90 |
+
- Entropy scoring to remove low-information samples.
|
| 91 |
+
- Semantic tagging (e.g., [PHYS], [BIO], [MTH]) for domain routing.
|
| 92 |
+
- Distillation using larger models (e.g., GPT-4) to summarize and structure data.
|
| 93 |
+
- Routing and filtering to activate only relevant expert paths.
|
| 94 |
+
|
| 95 |
+
**Total Token Budget**:
|
| 96 |
+
- NexaMOE-Mini: ~325M tokens
|
| 97 |
+
- NEXA-CoT: ~425-500M tokens
|
| 98 |
+
- NEXA-Ultramax: ~600-650M tokens
|
| 99 |
+
|
| 100 |
+
**Hardware**:
|
| 101 |
+
- CPU: Intel i5 vPro 8th Gen (overclocked to 6.0 GHz) with 16 GB RAM.
|
| 102 |
+
- GPUs: Dual NVIDIA T4 GPUs (cloud-hosted) at 90%+ capacity.
|
| 103 |
+
- Performance: 47-50 petaflops with an optimized CPU-GPU pipeline.
|
| 104 |
+
|
| 105 |
+
**Optimization Techniques**:
|
| 106 |
+
- Sparse attention, mixed precision training, gradient checkpointing.
|
| 107 |
+
- Hyperparameter tuning with Optuna, Just-in-Time (JIT) compilation, multi-threading.
|
| 108 |
+
- AzureSky Optimizer for efficient convergence.
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
# Download Models:
|
| 112 |
+
|
| 113 |
+
Model weights are hosted on Hugging Face. Download them using the transformers library or directly from the repository’s model card.
|
| 114 |
+
Example:huggingface-cli download your-username/nexamoe-base
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
# Usage
|
| 118 |
+
|
| 119 |
+
Load a Model:Use the transformers library to load NexaMOE models:
|
| 120 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 121 |
+
|
| 122 |
+
model_name = "your-username/nexamoe-base"
|
| 123 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 124 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
Generate Hypotheses or Methods:Provide a prompt with optional domain tags:
|
| 128 |
+
prompt = "[PHYS] Suggest a hypothesis for dark matter detection."
|
| 129 |
+
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
|
| 130 |
+
outputs = model.generate(**inputs, max_length=200)
|
| 131 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
Use NEXA-CoT for Reasoning:Enable the CoT Processor for step-by-step logic:
|
| 135 |
+
prompt = "[BIO] [reasoning_required] Propose a method to predict protein folding."
|
| 136 |
+
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
|
| 137 |
+
outputs = model.generate(**inputs, max_length=500)
|
| 138 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
Process Long Documents with NEXA-Ultramax:Handle large inputs (up to 20,000 tokens):
|
| 142 |
+
with open("arxiv_paper.txt", "r") as f:
|
| 143 |
+
document = f.read()
|
| 144 |
+
prompt = f"[MAT] Summarize this document: {document}"
|
| 145 |
+
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=20000).to("cuda")
|
| 146 |
+
outputs = model.generate(**inputs, max_length=1000)
|
| 147 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
Fine-Tune with QLoRA:Use the provided instruction dataset for fine-tuning:
|
| 151 |
+
from peft import LoraConfig, get_peft_model
|
| 152 |
+
from datasets import load_dataset
|
| 153 |
+
|
| 154 |
+
dataset = load_dataset("your-username/nexamoe-instruction-data")
|
| 155 |
+
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q", "v"])
|
| 156 |
+
model = get_peft_model(model, lora_config)
|
| 157 |
+
|
| 158 |
+
# Train with your preferred trainer (e.g., Hugging Face Trainer)
|
| 159 |
+
|
| 160 |
+
Run Inference via CLI or GUI:
|
| 161 |
+
|
| 162 |
+
Command-Line:python inference.py --model your-username/nexamoe-base --prompt "[PHYS] Hypothesize a new superconductor."
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
Gradio GUI:python app.py
|
| 166 |
+
|
| 167 |
+
Opens a web interface to interact with the model.
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
Model Weights and Datasets
|
| 171 |
+
|
| 172 |
+
Models:
|
| 173 |
+
your-username/nexamoe-base: Baseline NexaMOE (110M parameters).
|
| 174 |
+
your-username/nexamoe-cot: NEXA-CoT (110M parameters).
|
| 175 |
+
your-username/nexamoe-ultramax: NEXA-Ultramax (2.2B parameters).
|
| 176 |
+
|
| 177 |
+
|
| 178 |
+
Datasets:
|
| 179 |
+
your-username/nexamoe-instruction-data: 300k instruction-style samples for QLoRA fine-tuning.
|
| 180 |
+
your-username/nexamoe-reasoning-data: Reasoning Curriculum Dataset for CoT training.
|
| 181 |
+
your-username/nexamoe-long-context-data: Long-Context Corpus for UltraMAX training.
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
# Requirements
|
| 185 |
+
|
| 186 |
+
Hardware: NVIDIA GPU with 16-24GB VRAM (e.g., T4, A100) for training/inference. CPU fallback supported for preprocessing.
|
| 187 |
+
Software: Python 3.10, PyTorch, Transformers, Accelerate, PEFT, Optuna, Gradio.
|
| 188 |
+
|
| 189 |
+
# Performance Metrics
|
| 190 |
+
|
| 191 |
+
Extreme Specialization: Modular experts improve response fidelity and interpretability.
|
| 192 |
+
Distributed Training: Full hardware saturation stabilizes runtimes and reduces crashes.
|
| 193 |
+
Generalizability: Robust across physics, biology, and materials science tasks.
|
| 194 |
+
Optimizer Efficiency: AzureSky Optimizer enhances convergence speed and precision.
|
| 195 |
+
|
| 196 |
+
See the architecture document for detailed loss curves and metrics.
|
| 197 |
+
Similar Models
|
| 198 |
+
Explore related models for inspiration:
|
| 199 |
+
|
| 200 |
+
Grok (xAI): General-purpose conversational AI with scientific capabilities. Link
|
| 201 |
+
LLaMA (Meta AI): Efficient research models for NLP tasks. Link
|
| 202 |
+
SciBERT: BERT variant for scientific text processing. Link
|
| 203 |
+
Galactica (Meta AI): Scientific language model for paper summarization. Link
|
| 204 |
+
BioBERT: BERT variant for biomedical text. Link
|
| 205 |
+
|
| 206 |
+
For the models, cite:
|
| 207 |
+
|
| 208 |
+
Allanatrix. (2025). NexaMOE Family of Models. Retrieved (6/17/2025)
|
| 209 |
+
|
| 210 |
+
Acknowledgements
|
| 211 |
+
We thank the scientific and AI communities for advancing Mixture-of-Experts architectures and domain-specific LLMs. Special thanks to the authors of the datasets used (arXiv, PubMed, Materials Project) and the developers of tools like Transformers, PEFT, and Optuna.
|
| 212 |
+
For more information, see: https://materialsproject.org/, https://arxiv.org/, https://pubmed.ncbi.nlm.nih.gov/
|
| 213 |
+
License
|
| 214 |
+
MIT License (see LICENSE file for details).
|
| 215 |
+
|
| 216 |
+
Have questions or ideas? Open an issue on GitHub or join the discussion on Hugging Face. Happy researching!```
|