Vortex-7b-V1 / README.md
Zandy-Wandy's picture
Update README.md
9438f6c verified
---
language:
- en
license: mit
tags:
- vortex
- science
- physics
- chemistry
- biology
- mathematics
- ssm
- mamba
- hybrid-architecture
- custom-tokenizer
- from-scratch
- matrix-corp
pipeline_tag: text-generation
library_name: transformers
model_type: vortex
---
# Vortex Scientific
**Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
## 🌟 Features
- **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
- **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
- **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
- **Two Model Sizes**:
- **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
- **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
- **HuggingFace Compatible**: Full integration with `transformers` library
- **From Scratch**: No base model — everything built bottom-up including tokenizer and weights
## 🏗️ Architecture
Vortex uses a two-block hybrid architecture:
1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN
Layer ratios:
- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)
### Science Modules
- **EquationModule**: LaTeX equation detection and structural understanding
- **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
- **CitationModule**: Citation span detection, provenance tracking, confidence scoring
- **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences
## 📦 Project Structure
```
Vortex/
├── configs/
│ ├── vortex_7b_config.py # 7B model configuration
│ ├── vortex_13b_config.py # 13B model configuration
│ └── training_config.py # Training hyperparameters
├── models/
│ ├── ssm_layer.py # State-space layer
│ ├── attention_layer.py # Local windowed attention
│ ├── scigate_ffn.py # Science-gated feed-forward
│ ├── vortex_model.py # Main model class
│ └── science_modules/ # Specialized science modules
├── tokenizer/
│ └── vortex_tokenizer.py # Custom BPE tokenizer with science vocab
├── data/
│ ├── dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.)
│ ├── quality_filter.py # Multi-stage quality filtering
│ ├── domain_classifier.py # 7-domain classifier
│ ├── deduplication.py # MinHash LSH deduplication
│ └── scraper.py # Web scraping (arXiv, PubMed, etc.)
├── training/
│ ├── trainer.py # Main training loop
│ ├── losses.py # Science-aware loss functions
│ └── curriculum.py # Curriculum learning scheduler
├── inference/
│ ├── cuda_optimize.py # CUDA optimizations (Flash Attention, INT8)
│ └── mps_optimize.py # MPS optimizations for Apple Silicon
├── evaluation/ # Science benchmarks (coming soon)
├── configuration_vortex.py # HF config class
├── tokenization_vortex.py # HF tokenizer wrapper
├── modeling_vortex.py # HF model integration
├── train.py # Training entry point
├── inference/inference.py # Inference entry point
└── requirements.txt
```
## 🚀 Quick Start
### Installation
```bash
# Clone and setup
cd Vortex
pip install -r requirements.txt
# For CUDA optimizations
pip install flash-attn
pip install bitsandbytes
```
### Training
```bash
# Train 7B model on CUDA
python train.py \
--model_size 7b \
--device cuda \
--data_dir ./data/processed \
--output_dir ./checkpoints \
--max_steps 100000
# Train 13B model with INT8 quantization (for 8GB VRAM)
python train.py \
--model_size 13b \
--device cuda \
--quantization int8 \
--data_dir ./data/processed \
--output_dir ./checkpoints_13b
```
### Inference
```bash
# Generate text with 7B model
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--device cuda \
--prompt "The equation E = mc^2 describes" \
--max_new_tokens 100
# Interactive mode
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--device cuda \
--interactive
# On Apple Silicon (MPS)
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--use_mps \
--prompt "Explain quantum mechanics"
```
### HuggingFace Integration
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
# Generate
input_text = "The energy of a photon is given by"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
```
## 📊 Data Pipeline
1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
3. **Deduplication**: MinHash LSH for near-duplicate detection
4. **Domain Classification**: Classify into 7 science domains
5. **Tokenization**: Custom science-aware BPE tokenizer
6. **Sharding**: Write to Parquet with statistics
```python
from data.dataset_loader import VortexDatasetLoader
from data.quality_filter import ScienceQualityFilter
from data.deduplication import MinHashLSH
# Load and process data
loader = VortexDatasetLoader()
quality_filter = ScienceQualityFilter()
lsh = MinHashLSH()
# Stream datasets, filter, deduplicate, and shard
for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
if quality_filter.filter(sample["text"]):
lsh.add_document(sample["id"], sample["text"])
# Tokenize and save
```
## 🎯 Training Strategy
### Curriculum Learning
Training progresses through 4 stages:
1. **Foundation** (0-20%): Basic science text, simple equations, definitions
2. **Domain** (20-50%): Domain-specific deep content per science area
3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
4. **Integration** (80-100%): Cross-domain science, full dataset
### Science-Aware Loss
```python
total_loss = (
lm_loss * 1.0 # Standard next token prediction
+ equation_loss * 0.3 # Equation reconstruction accuracy
+ domain_loss * 0.1 # Domain classification head
+ citation_loss * 0.1 # Citation detection accuracy
+ numerical_loss * 0.2 # Numerical reasoning accuracy
)
```
## ⚙️ Configuration
### 7B Config (VORTEX_7B_CONFIG)
- `d_model`: 4096
- `num_layers`: 32
- `num_heads`: 32
- `d_state`: 16
- `ssm_ratio`: 0.6
- `vocab_size`: 50000
- `max_seq_len`: 16384
### 13B Config (VORTEX_13B_CONFIG)
- `d_model`: 5120
- `num_layers`: 40
- `num_heads`: 40
- `d_state`: 32
- `ssm_ratio`: 0.5
- `vocab_size`: 50000
- `max_seq_len`: 16384
## 🔧 Hardware Targets
### Nvidia 4060 Laptop (8GB VRAM)
- **7B**: BF16, no quantization, Flash Attention 2, torch.compile
- **13B**: INT8 quantization, Flash Attention 2, torch.compile
- Target TPS: 25-40 (7B), 15-25 (13B)
### Apple Silicon (M2/M3)
- **7B on M3**: BF16 (via float16), SDPA, no compile
- **13B on M3 Max**: BF16, unified memory, SDPA
- Target TPS: 20-35 (7B), 12-20 (13B)
## 🧪 Science Domains
1. **Physics** (`[PHYS]`)
2. **Mathematics** (`[MATH]`)
3. **Chemistry** (`[CHEM]`)
4. **Biology** (`[BIO]`)
5. **Earth Science** (`[EARTH]`)
6. **Space Science** (`[SPACE]`)
7. **Zoology** (`[ZOO]`)
Domain tags can be included in training data to guide the SciGate FFN routing.
## 📝 Tokenizer
Custom BPE tokenizer with:
- 40,000 base BPE tokens trained on scientific corpus
- 10,000 science-specific tokens:
- 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
- 118 chemical element symbols
- 200 SI and derived units
- 300 scientific abbreviations (DNA, RNA, ATP, etc.)
- 500 mathematical operators
- Amino acid codes
- Greek alphabet (α, β, γ, etc.)
- Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags
## 🧪 Evaluation
Science benchmarks across all 7 domains will be added. Planned benchmarks:
- **Physics**: Feynman Questions, Physics GRE
- **Math**: MATH dataset, GSM8K
- **Chemistry**: Chemistry problem-solving, molecular property prediction
- **Biology**: PubMed QA, bioinformatics tasks
- **Earth Science**: Climate modeling questions
- **Space Science**: Astronomy problem sets
- **Zoology**: Species classification, ecological reasoning
## 📄 License
This is a school science project. Code is provided for educational purposes.
## 🙏 Acknowledgments
- **Mamba** (Gu et al.) for SSM architecture inspiration
- **Flash Attention** (Dao et al.) for efficient attention
- **HuggingFace** for transformers library
- All open scientific data sources: arXiv, PubMed, S2ORC, etc.
## 📧 Contact
For questions or issues, please open an issue on GitHub.
---
**Built with ❤️ for scientific AI research**