File size: 9,847 Bytes
ad00e79 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 | ---
language:
- en
license: mit
tags:
- vortex
- science
- physics
- chemistry
- biology
- mathematics
- ssm
- mamba
- hybrid-architecture
- custom-tokenizer
- from-scratch
- matrix-corp
pipeline_tag: text-generation
library_name: transformers
model_type: vortex
---
# Vortex Scientific
**Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
## π Features
- **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
- **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
- **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
- **Two Model Sizes**:
- **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
- **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
- **HuggingFace Compatible**: Full integration with `transformers` library
- **From Scratch**: No base model β everything built bottom-up including tokenizer and weights
## ποΈ Architecture
Vortex uses a two-block hybrid architecture:
1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN
Layer ratios:
- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)
### Science Modules
- **EquationModule**: LaTeX equation detection and structural understanding
- **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
- **CitationModule**: Citation span detection, provenance tracking, confidence scoring
- **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences
## π¦ Project Structure
```
Vortex/
βββ configs/
β βββ vortex_7b_config.py # 7B model configuration
β βββ vortex_13b_config.py # 13B model configuration
β βββ training_config.py # Training hyperparameters
βββ models/
β βββ ssm_layer.py # State-space layer
β βββ attention_layer.py # Local windowed attention
β βββ scigate_ffn.py # Science-gated feed-forward
β βββ vortex_model.py # Main model class
β βββ science_modules/ # Specialized science modules
βββ tokenizer/
β βββ vortex_tokenizer.py # Custom BPE tokenizer with science vocab
βββ data/
β βββ dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.)
β βββ quality_filter.py # Multi-stage quality filtering
β βββ domain_classifier.py # 7-domain classifier
β βββ deduplication.py # MinHash LSH deduplication
β βββ scraper.py # Web scraping (arXiv, PubMed, etc.)
βββ training/
β βββ trainer.py # Main training loop
β βββ losses.py # Science-aware loss functions
β βββ curriculum.py # Curriculum learning scheduler
βββ inference/
β βββ cuda_optimize.py # CUDA optimizations (Flash Attention, INT8)
β βββ mps_optimize.py # MPS optimizations for Apple Silicon
βββ evaluation/ # Science benchmarks (coming soon)
βββ configuration_vortex.py # HF config class
βββ tokenization_vortex.py # HF tokenizer wrapper
βββ modeling_vortex.py # HF model integration
βββ train.py # Training entry point
βββ inference/inference.py # Inference entry point
βββ requirements.txt
```
## π Quick Start
### Installation
```bash
# Clone and setup
cd Vortex
pip install -r requirements.txt
# For CUDA optimizations
pip install flash-attn
pip install bitsandbytes
```
### Training
```bash
# Train 7B model on CUDA
python train.py \
--model_size 7b \
--device cuda \
--data_dir ./data/processed \
--output_dir ./checkpoints \
--max_steps 100000
# Train 13B model with INT8 quantization (for 8GB VRAM)
python train.py \
--model_size 13b \
--device cuda \
--quantization int8 \
--data_dir ./data/processed \
--output_dir ./checkpoints_13b
```
### Inference
```bash
# Generate text with 7B model
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--device cuda \
--prompt "The equation E = mc^2 describes" \
--max_new_tokens 100
# Interactive mode
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--device cuda \
--interactive
# On Apple Silicon (MPS)
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--use_mps \
--prompt "Explain quantum mechanics"
```
### HuggingFace Integration
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
# Generate
input_text = "The energy of a photon is given by"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
```
## π Data Pipeline
1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
3. **Deduplication**: MinHash LSH for near-duplicate detection
4. **Domain Classification**: Classify into 7 science domains
5. **Tokenization**: Custom science-aware BPE tokenizer
6. **Sharding**: Write to Parquet with statistics
```python
from data.dataset_loader import VortexDatasetLoader
from data.quality_filter import ScienceQualityFilter
from data.deduplication import MinHashLSH
# Load and process data
loader = VortexDatasetLoader()
quality_filter = ScienceQualityFilter()
lsh = MinHashLSH()
# Stream datasets, filter, deduplicate, and shard
for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
if quality_filter.filter(sample["text"]):
lsh.add_document(sample["id"], sample["text"])
# Tokenize and save
```
## π― Training Strategy
### Curriculum Learning
Training progresses through 4 stages:
1. **Foundation** (0-20%): Basic science text, simple equations, definitions
2. **Domain** (20-50%): Domain-specific deep content per science area
3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
4. **Integration** (80-100%): Cross-domain science, full dataset
### Science-Aware Loss
```python
total_loss = (
lm_loss * 1.0 # Standard next token prediction
+ equation_loss * 0.3 # Equation reconstruction accuracy
+ domain_loss * 0.1 # Domain classification head
+ citation_loss * 0.1 # Citation detection accuracy
+ numerical_loss * 0.2 # Numerical reasoning accuracy
)
```
## βοΈ Configuration
### 7B Config (VORTEX_7B_CONFIG)
- `d_model`: 4096
- `num_layers`: 32
- `num_heads`: 32
- `d_state`: 16
- `ssm_ratio`: 0.6
- `vocab_size`: 50000
- `max_seq_len`: 16384
### 13B Config (VORTEX_13B_CONFIG)
- `d_model`: 5120
- `num_layers`: 40
- `num_heads`: 40
- `d_state`: 32
- `ssm_ratio`: 0.5
- `vocab_size`: 50000
- `max_seq_len`: 16384
## π§ Hardware Targets
### Nvidia 4060 Laptop (8GB VRAM)
- **7B**: BF16, no quantization, Flash Attention 2, torch.compile
- **13B**: INT8 quantization, Flash Attention 2, torch.compile
- Target TPS: 25-40 (7B), 15-25 (13B)
### Apple Silicon (M2/M3)
- **7B on M3**: BF16 (via float16), SDPA, no compile
- **13B on M3 Max**: BF16, unified memory, SDPA
- Target TPS: 20-35 (7B), 12-20 (13B)
## π§ͺ Science Domains
1. **Physics** (`[PHYS]`)
2. **Mathematics** (`[MATH]`)
3. **Chemistry** (`[CHEM]`)
4. **Biology** (`[BIO]`)
5. **Earth Science** (`[EARTH]`)
6. **Space Science** (`[SPACE]`)
7. **Zoology** (`[ZOO]`)
Domain tags can be included in training data to guide the SciGate FFN routing.
## π Tokenizer
Custom BPE tokenizer with:
- 40,000 base BPE tokens trained on scientific corpus
- 10,000 science-specific tokens:
- 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
- 118 chemical element symbols
- 200 SI and derived units
- 300 scientific abbreviations (DNA, RNA, ATP, etc.)
- 500 mathematical operators
- Amino acid codes
- Greek alphabet (Ξ±, Ξ², Ξ³, etc.)
- Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags
## π§ͺ Evaluation
Science benchmarks across all 7 domains will be added. Planned benchmarks:
- **Physics**: Feynman Questions, Physics GRE
- **Math**: MATH dataset, GSM8K
- **Chemistry**: Chemistry problem-solving, molecular property prediction
- **Biology**: PubMed QA, bioinformatics tasks
- **Earth Science**: Climate modeling questions
- **Space Science**: Astronomy problem sets
- **Zoology**: Species classification, ecological reasoning
## π License
This is a school science project. Code is provided for educational purposes.
## π Acknowledgments
- **Mamba** (Gu et al.) for SSM architecture inspiration
- **Flash Attention** (Dao et al.) for efficient attention
- **HuggingFace** for transformers library
- All open scientific data sources: arXiv, PubMed, S2ORC, etc.
## π§ Contact
For questions or issues, please open an issue on GitHub.
---
**Built with β€οΈ for scientific AI research**
|