README.md · Matrix-Corp/Vortex-13b-V1 at main

File size: 9,847 Bytes

ad00e79

---
language:
  - en
license: mit
tags:
  - vortex
  - science
  - physics
  - chemistry
  - biology
  - mathematics
  - ssm
  - mamba
  - hybrid-architecture
  - custom-tokenizer
  - from-scratch
  - matrix-corp
pipeline_tag: text-generation
library_name: transformers
model_type: vortex
---

# Vortex Scientific

**Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).

## 🌟 Features

- **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
- **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
- **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
- **Two Model Sizes**:
  - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
  - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
- **HuggingFace Compatible**: Full integration with `transformers` library
- **From Scratch**: No base model — everything built bottom-up including tokenizer and weights

## 🏗️ Architecture

Vortex uses a two-block hybrid architecture:

1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN

Layer ratios:
- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)

### Science Modules

- **EquationModule**: LaTeX equation detection and structural understanding
- **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
- **CitationModule**: Citation span detection, provenance tracking, confidence scoring
- **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences

## 📦 Project Structure

```
Vortex/
├── configs/
│   ├── vortex_7b_config.py      # 7B model configuration
│   ├── vortex_13b_config.py     # 13B model configuration
│   └── training_config.py       # Training hyperparameters
├── models/
│   ├── ssm_layer.py             # State-space layer
│   ├── attention_layer.py       # Local windowed attention
│   ├── scigate_ffn.py           # Science-gated feed-forward
│   ├── vortex_model.py          # Main model class
│   └── science_modules/         # Specialized science modules
├── tokenizer/
│   └── vortex_tokenizer.py      # Custom BPE tokenizer with science vocab
├── data/
│   ├── dataset_loader.py        # Open dataset loading (Pile, S2ORC, etc.)
│   ├── quality_filter.py        # Multi-stage quality filtering
│   ├── domain_classifier.py     # 7-domain classifier
│   ├── deduplication.py         # MinHash LSH deduplication
│   └── scraper.py               # Web scraping (arXiv, PubMed, etc.)
├── training/
│   ├── trainer.py               # Main training loop
│   ├── losses.py                # Science-aware loss functions
│   └── curriculum.py            # Curriculum learning scheduler
├── inference/
│   ├── cuda_optimize.py         # CUDA optimizations (Flash Attention, INT8)
│   └── mps_optimize.py          # MPS optimizations for Apple Silicon
├── evaluation/                  # Science benchmarks (coming soon)
├── configuration_vortex.py      # HF config class
├── tokenization_vortex.py       # HF tokenizer wrapper
├── modeling_vortex.py           # HF model integration
├── train.py                     # Training entry point
├── inference/inference.py       # Inference entry point
└── requirements.txt
```

## 🚀 Quick Start

### Installation

```bash
# Clone and setup
cd Vortex
pip install -r requirements.txt

# For CUDA optimizations
pip install flash-attn
pip install bitsandbytes
```

### Training

```bash
# Train 7B model on CUDA
python train.py \
    --model_size 7b \
    --device cuda \
    --data_dir ./data/processed \
    --output_dir ./checkpoints \
    --max_steps 100000

# Train 13B model with INT8 quantization (for 8GB VRAM)
python train.py \
    --model_size 13b \
    --device cuda \
    --quantization int8 \
    --data_dir ./data/processed \
    --output_dir ./checkpoints_13b
```

### Inference

```bash
# Generate text with 7B model
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --device cuda \
    --prompt "The equation E = mc^2 describes" \
    --max_new_tokens 100

# Interactive mode
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --device cuda \
    --interactive

# On Apple Silicon (MPS)
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --use_mps \
    --prompt "Explain quantum mechanics"
```

### HuggingFace Integration

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints")

# Generate
input_text = "The energy of a photon is given by"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
```

## 📊 Data Pipeline

1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
3. **Deduplication**: MinHash LSH for near-duplicate detection
4. **Domain Classification**: Classify into 7 science domains
5. **Tokenization**: Custom science-aware BPE tokenizer
6. **Sharding**: Write to Parquet with statistics

```python
from data.dataset_loader import VortexDatasetLoader
from data.quality_filter import ScienceQualityFilter
from data.deduplication import MinHashLSH

# Load and process data
loader = VortexDatasetLoader()
quality_filter = ScienceQualityFilter()
lsh = MinHashLSH()

# Stream datasets, filter, deduplicate, and shard
for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
    if quality_filter.filter(sample["text"]):
        lsh.add_document(sample["id"], sample["text"])
        # Tokenize and save
```

## 🎯 Training Strategy

### Curriculum Learning

Training progresses through 4 stages:

1. **Foundation** (0-20%): Basic science text, simple equations, definitions
2. **Domain** (20-50%): Domain-specific deep content per science area
3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
4. **Integration** (80-100%): Cross-domain science, full dataset

### Science-Aware Loss

```python
total_loss = (
    lm_loss * 1.0              # Standard next token prediction
    + equation_loss * 0.3      # Equation reconstruction accuracy
    + domain_loss * 0.1        # Domain classification head
    + citation_loss * 0.1      # Citation detection accuracy
    + numerical_loss * 0.2     # Numerical reasoning accuracy
)
```

## ⚙️ Configuration

### 7B Config (VORTEX_7B_CONFIG)

- `d_model`: 4096
- `num_layers`: 32
- `num_heads`: 32
- `d_state`: 16
- `ssm_ratio`: 0.6
- `vocab_size`: 50000
- `max_seq_len`: 16384

### 13B Config (VORTEX_13B_CONFIG)

- `d_model`: 5120
- `num_layers`: 40
- `num_heads`: 40
- `d_state`: 32
- `ssm_ratio`: 0.5
- `vocab_size`: 50000
- `max_seq_len`: 16384

## 🔧 Hardware Targets

### Nvidia 4060 Laptop (8GB VRAM)

- **7B**: BF16, no quantization, Flash Attention 2, torch.compile
- **13B**: INT8 quantization, Flash Attention 2, torch.compile
- Target TPS: 25-40 (7B), 15-25 (13B)

### Apple Silicon (M2/M3)

- **7B on M3**: BF16 (via float16), SDPA, no compile
- **13B on M3 Max**: BF16, unified memory, SDPA
- Target TPS: 20-35 (7B), 12-20 (13B)

## 🧪 Science Domains

1. **Physics** (`[PHYS]`)
2. **Mathematics** (`[MATH]`)
3. **Chemistry** (`[CHEM]`)
4. **Biology** (`[BIO]`)
5. **Earth Science** (`[EARTH]`)
6. **Space Science** (`[SPACE]`)
7. **Zoology** (`[ZOO]`)

Domain tags can be included in training data to guide the SciGate FFN routing.

## 📝 Tokenizer

Custom BPE tokenizer with:

- 40,000 base BPE tokens trained on scientific corpus
- 10,000 science-specific tokens:
  - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
  - 118 chemical element symbols
  - 200 SI and derived units
  - 300 scientific abbreviations (DNA, RNA, ATP, etc.)
  - 500 mathematical operators
  - Amino acid codes
  - Greek alphabet (α, β, γ, etc.)
- Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags

## 🧪 Evaluation

Science benchmarks across all 7 domains will be added. Planned benchmarks:

- **Physics**: Feynman Questions, Physics GRE
- **Math**: MATH dataset, GSM8K
- **Chemistry**: Chemistry problem-solving, molecular property prediction
- **Biology**: PubMed QA, bioinformatics tasks
- **Earth Science**: Climate modeling questions
- **Space Science**: Astronomy problem sets
- **Zoology**: Species classification, ecological reasoning

## 📄 License

This is a school science project. Code is provided for educational purposes.

## 🙏 Acknowledgments

- **Mamba** (Gu et al.) for SSM architecture inspiration
- **Flash Attention** (Dao et al.) for efficient attention
- **HuggingFace** for transformers library
- All open scientific data sources: arXiv, PubMed, S2ORC, etc.

## 📧 Contact

For questions or issues, please open an issue on GitHub.

---

**Built with ❤️ for scientific AI research**