Matrix-Corp
/

Vortex-13b-V1

@@ -1,289 +1,311 @@
-# Vortex Scientific
-**Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
-## 🌟 Features
-- **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
-- **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
-- **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
-- **Two Model Sizes**:
-  - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
-  - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
-- **HuggingFace Compatible**: Full integration with `transformers` library
-- **From Scratch**: No base model — everything built bottom-up including tokenizer and weights
-## 🏗️ Architecture
-Vortex uses a two-block hybrid architecture:
-1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
-2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN
-Layer ratios:
-- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
-- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)
-### Science Modules
-- **EquationModule**: LaTeX equation detection and structural understanding
-- **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
-- **CitationModule**: Citation span detection, provenance tracking, confidence scoring
-- **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences
-## 📦 Project Structure
-```
-Vortex/
-├── configs/
-│   ├── vortex_7b_config.py      # 7B model configuration
-│   ├── vortex_13b_config.py     # 13B model configuration
-│   └── training_config.py       # Training hyperparameters
-├── models/
-│   ├── ssm_layer.py             # State-space layer
-│   ├── attention_layer.py       # Local windowed attention
-│   ├── scigate_ffn.py           # Science-gated feed-forward
-│   ├── vortex_model.py          # Main model class
-│   └── science_modules/         # Specialized science modules
-├── tokenizer/
-│   └── vortex_tokenizer.py      # Custom BPE tokenizer with science vocab
-├── data/
-│   ├── dataset_loader.py        # Open dataset loading (Pile, S2ORC, etc.)
-│   ├── quality_filter.py        # Multi-stage quality filtering
-│   ├── domain_classifier.py     # 7-domain classifier
-│   ├── deduplication.py         # MinHash LSH deduplication
-│   └── scraper.py               # Web scraping (arXiv, PubMed, etc.)
-├── training/
-│   ├── trainer.py               # Main training loop
-│   ├── losses.py                # Science-aware loss functions
-│   └── curriculum.py            # Curriculum learning scheduler
-├── inference/
-│   ├── cuda_optimize.py         # CUDA optimizations (Flash Attention, INT8)
-│   └── mps_optimize.py          # MPS optimizations for Apple Silicon
-├── evaluation/                  # Science benchmarks (coming soon)
-├── configuration_vortex.py      # HF config class
-├── tokenization_vortex.py       # HF tokenizer wrapper
-├── modeling_vortex.py           # HF model integration
-├── train.py                     # Training entry point
-├── inference/inference.py       # Inference entry point
-└── requirements.txt
-```
-## 🚀 Quick Start
-### Installation
-```bash
-# Clone and setup
-cd Vortex
-pip install -r requirements.txt
-# For CUDA optimizations
-pip install flash-attn
-pip install bitsandbytes
-```
-### Training
-```bash
-# Train 7B model on CUDA
-python train.py \
-    --model_size 7b \
-    --device cuda \
-    --data_dir ./data/processed \
-    --output_dir ./checkpoints \
-    --max_steps 100000
-# Train 13B model with INT8 quantization (for 8GB VRAM)
-python train.py \
-    --model_size 13b \
-    --device cuda \
-    --quantization int8 \
-    --data_dir ./data/processed \
-    --output_dir ./checkpoints_13b
-```
-### Inference
-```bash
-# Generate text with 7B model
-python inference/inference.py \
-    --model_path ./checkpoints/latest.pt \
-    --model_size 7b \
-    --device cuda \
-    --prompt "The equation E = mc^2 describes" \
-    --max_new_tokens 100
-# Interactive mode
-python inference/inference.py \
-    --model_path ./checkpoints/latest.pt \
-    --model_size 7b \
-    --device cuda \
-    --interactive
-# On Apple Silicon (MPS)
-python inference/inference.py \
-    --model_path ./checkpoints/latest.pt \
-    --model_size 7b \
-    --use_mps \
-    --prompt "Explain quantum mechanics"
-```
-### HuggingFace Integration
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-# Load model and tokenizer
-model = AutoModelForCausalLM.from_pretrained("./checkpoints")
-tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
-# Generate
-input_text = "The energy of a photon is given by"
-inputs = tokenizer(input_text, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=50)
-print(tokenizer.decode(outputs[0]))
-```
-## 📊 Data Pipeline
-1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
-2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
-3. **Deduplication**: MinHash LSH for near-duplicate detection
-4. **Domain Classification**: Classify into 7 science domains
-5. **Tokenization**: Custom science-aware BPE tokenizer
-6. **Sharding**: Write to Parquet with statistics
-```python
-from data.dataset_loader import VortexDatasetLoader
-from data.quality_filter import ScienceQualityFilter
-from data.deduplication import MinHashLSH
-# Load and process data
-loader = VortexDatasetLoader()
-quality_filter = ScienceQualityFilter()
-lsh = MinHashLSH()
-# Stream datasets, filter, deduplicate, and shard
-for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
-    if quality_filter.filter(sample["text"]):
-        lsh.add_document(sample["id"], sample["text"])
-        # Tokenize and save
-```
-## 🎯 Training Strategy
-### Curriculum Learning
-Training progresses through 4 stages:
-1. **Foundation** (0-20%): Basic science text, simple equations, definitions
-2. **Domain** (20-50%): Domain-specific deep content per science area
-3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
-4. **Integration** (80-100%): Cross-domain science, full dataset
-### Science-Aware Loss
-```python
-total_loss = (
-    lm_loss * 1.0              # Standard next token prediction
-    + equation_loss * 0.3      # Equation reconstruction accuracy
-    + domain_loss * 0.1        # Domain classification head
-    + citation_loss * 0.1      # Citation detection accuracy
-    + numerical_loss * 0.2     # Numerical reasoning accuracy
-)
-```
-## ⚙️ Configuration
-### 7B Config (VORTEX_7B_CONFIG)
-- `d_model`: 4096
-- `num_layers`: 32
-- `num_heads`: 32
-- `d_state`: 16
-- `ssm_ratio`: 0.6
-- `vocab_size`: 50000
-- `max_seq_len`: 16384
-### 13B Config (VORTEX_13B_CONFIG)
-- `d_model`: 5120
-- `num_layers`: 40
-- `num_heads`: 40
-- `d_state`: 32
-- `ssm_ratio`: 0.5
-- `vocab_size`: 50000
-- `max_seq_len`: 16384
-## 🔧 Hardware Targets
-### Nvidia 4060 Laptop (8GB VRAM)
-- **7B**: BF16, no quantization, Flash Attention 2, torch.compile
-- **13B**: INT8 quantization, Flash Attention 2, torch.compile
-- Target TPS: 25-40 (7B), 15-25 (13B)
-### Apple Silicon (M2/M3)
-- **7B on M3**: BF16 (via float16), SDPA, no compile
-- **13B on M3 Max**: BF16, unified memory, SDPA
-- Target TPS: 20-35 (7B), 12-20 (13B)
-## 🧪 Science Domains
-1. **Physics** (`[PHYS]`)
-2. **Mathematics** (`[MATH]`)
-3. **Chemistry** (`[CHEM]`)
-4. **Biology** (`[BIO]`)
-5. **Earth Science** (`[EARTH]`)
-6. **Space Science** (`[SPACE]`)
-7. **Zoology** (`[ZOO]`)
-Domain tags can be included in training data to guide the SciGate FFN routing.
-## 📝 Tokenizer
-Custom BPE tokenizer with:
-- 40,000 base BPE tokens trained on scientific corpus
-- 10,000 science-specific tokens:
-  - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
-  - 118 chemical element symbols
-  - 200 SI and derived units
-  - 300 scientific abbreviations (DNA, RNA, ATP, etc.)
-  - 500 mathematical operators
-  - Amino acid codes
-  - Greek alphabet (α, β, γ, etc.)
-- Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags
-## 🧪 Evaluation
-Science benchmarks across all 7 domains will be added. Planned benchmarks:
-- **Physics**: Feynman Questions, Physics GRE
-- **Math**: MATH dataset, GSM8K
-- **Chemistry**: Chemistry problem-solving, molecular property prediction
-- **Biology**: PubMed QA, bioinformatics tasks
-- **Earth Science**: Climate modeling questions
-- **Space Science**: Astronomy problem sets
-- **Zoology**: Species classification, ecological reasoning
-## 📄 License
-This is a school science project. Code is provided for educational purposes.
-## 🙏 Acknowledgments
-- **Mamba** (Gu et al.) for SSM architecture inspiration
-- **Flash Attention** (Dao et al.) for efficient attention
-- **HuggingFace** for transformers library
-- All open scientific data sources: arXiv, PubMed, S2ORC, etc.
-## 📧 Contact
-For questions or issues, please open an issue on GitHub.
----
-**Built with ❤️ for scientific AI research**

+---
+language:
+  - en
+license: mit
+tags:
+  - vortex
+  - science
+  - physics
+  - chemistry
+  - biology
+  - mathematics
+  - ssm
+  - mamba
+  - hybrid-architecture
+  - custom-tokenizer
+  - from-scratch
+  - matrix-corp
+pipeline_tag: text-generation
+library_name: transformers
+model_type: vortex
+---
+# Vortex Scientific
+**Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
+## 🌟 Features
+- **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
+- **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
+- **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
+- **Two Model Sizes**:
+  - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
+  - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
+- **HuggingFace Compatible**: Full integration with `transformers` library
+- **From Scratch**: No base model — everything built bottom-up including tokenizer and weights
+## 🏗️ Architecture
+Vortex uses a two-block hybrid architecture:
+1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
+2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN
+Layer ratios:
+- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
+- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)
+### Science Modules
+- **EquationModule**: LaTeX equation detection and structural understanding
+- **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
+- **CitationModule**: Citation span detection, provenance tracking, confidence scoring
+- **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences
+## 📦 Project Structure
+```
+Vortex/
+├── configs/
+│   ├── vortex_7b_config.py      # 7B model configuration
+│   ├── vortex_13b_config.py     # 13B model configuration
+│   └── training_config.py       # Training hyperparameters
+├── models/
+│   ├── ssm_layer.py             # State-space layer
+│   ├── attention_layer.py       # Local windowed attention
+│   ├── scigate_ffn.py           # Science-gated feed-forward
+│   ├── vortex_model.py          # Main model class
+│   └── science_modules/         # Specialized science modules
+├── tokenizer/
+│   └── vortex_tokenizer.py      # Custom BPE tokenizer with science vocab
+├── data/
+│   ├── dataset_loader.py        # Open dataset loading (Pile, S2ORC, etc.)
+│   ├── quality_filter.py        # Multi-stage quality filtering
+│   ├── domain_classifier.py     # 7-domain classifier
+│   ├── deduplication.py         # MinHash LSH deduplication
+│   └── scraper.py               # Web scraping (arXiv, PubMed, etc.)
+├── training/
+│   ├── trainer.py               # Main training loop
+│   ├── losses.py                # Science-aware loss functions
+│   └── curriculum.py            # Curriculum learning scheduler
+├── inference/
+│   ├── cuda_optimize.py         # CUDA optimizations (Flash Attention, INT8)
+│   └── mps_optimize.py          # MPS optimizations for Apple Silicon
+├── evaluation/                  # Science benchmarks (coming soon)
+├── configuration_vortex.py      # HF config class
+├── tokenization_vortex.py       # HF tokenizer wrapper
+├── modeling_vortex.py           # HF model integration
+├── train.py                     # Training entry point
+├── inference/inference.py       # Inference entry point
+└── requirements.txt
+```
+## 🚀 Quick Start
+### Installation
+```bash
+# Clone and setup
+cd Vortex
+pip install -r requirements.txt
+# For CUDA optimizations
+pip install flash-attn
+pip install bitsandbytes
+```
+### Training
+```bash
+# Train 7B model on CUDA
+python train.py \
+    --model_size 7b \
+    --device cuda \
+    --data_dir ./data/processed \
+    --output_dir ./checkpoints \
+    --max_steps 100000
+# Train 13B model with INT8 quantization (for 8GB VRAM)
+python train.py \
+    --model_size 13b \
+    --device cuda \
+    --quantization int8 \
+    --data_dir ./data/processed \
+    --output_dir ./checkpoints_13b
+```
+### Inference
+```bash
+# Generate text with 7B model
+python inference/inference.py \
+    --model_path ./checkpoints/latest.pt \
+    --model_size 7b \
+    --device cuda \
+    --prompt "The equation E = mc^2 describes" \
+    --max_new_tokens 100
+# Interactive mode
+python inference/inference.py \
+    --model_path ./checkpoints/latest.pt \
+    --model_size 7b \
+    --device cuda \
+    --interactive
+# On Apple Silicon (MPS)
+python inference/inference.py \
+    --model_path ./checkpoints/latest.pt \
+    --model_size 7b \
+    --use_mps \
+    --prompt "Explain quantum mechanics"
+```
+### HuggingFace Integration
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("./checkpoints")
+tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
+# Generate
+input_text = "The energy of a photon is given by"
+inputs = tokenizer(input_text, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(outputs[0]))
+```
+## 📊 Data Pipeline
+1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
+2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
+3. **Deduplication**: MinHash LSH for near-duplicate detection
+4. **Domain Classification**: Classify into 7 science domains
+5. **Tokenization**: Custom science-aware BPE tokenizer
+6. **Sharding**: Write to Parquet with statistics
+```python
+from data.dataset_loader import VortexDatasetLoader
+from data.quality_filter import ScienceQualityFilter
+from data.deduplication import MinHashLSH
+# Load and process data
+loader = VortexDatasetLoader()
+quality_filter = ScienceQualityFilter()
+lsh = MinHashLSH()
+# Stream datasets, filter, deduplicate, and shard
+for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
+    if quality_filter.filter(sample["text"]):
+        lsh.add_document(sample["id"], sample["text"])
+        # Tokenize and save
+```
+## 🎯 Training Strategy
+### Curriculum Learning
+Training progresses through 4 stages:
+1. **Foundation** (0-20%): Basic science text, simple equations, definitions
+2. **Domain** (20-50%): Domain-specific deep content per science area
+3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
+4. **Integration** (80-100%): Cross-domain science, full dataset
+### Science-Aware Loss
+```python
+total_loss = (
+    lm_loss * 1.0              # Standard next token prediction
+    + equation_loss * 0.3      # Equation reconstruction accuracy
+    + domain_loss * 0.1        # Domain classification head
+    + citation_loss * 0.1      # Citation detection accuracy
+    + numerical_loss * 0.2     # Numerical reasoning accuracy
+)
+```
+## ⚙️ Configuration
+### 7B Config (VORTEX_7B_CONFIG)
+- `d_model`: 4096
+- `num_layers`: 32
+- `num_heads`: 32
+- `d_state`: 16
+- `ssm_ratio`: 0.6
+- `vocab_size`: 50000
+- `max_seq_len`: 16384
+### 13B Config (VORTEX_13B_CONFIG)
+- `d_model`: 5120
+- `num_layers`: 40
+- `num_heads`: 40
+- `d_state`: 32
+- `ssm_ratio`: 0.5
+- `vocab_size`: 50000
+- `max_seq_len`: 16384
+## 🔧 Hardware Targets
+### Nvidia 4060 Laptop (8GB VRAM)
+- **7B**: BF16, no quantization, Flash Attention 2, torch.compile
+- **13B**: INT8 quantization, Flash Attention 2, torch.compile
+- Target TPS: 25-40 (7B), 15-25 (13B)
+### Apple Silicon (M2/M3)
+- **7B on M3**: BF16 (via float16), SDPA, no compile
+- **13B on M3 Max**: BF16, unified memory, SDPA
+- Target TPS: 20-35 (7B), 12-20 (13B)
+## 🧪 Science Domains
+1. **Physics** (`[PHYS]`)
+2. **Mathematics** (`[MATH]`)
+3. **Chemistry** (`[CHEM]`)
+4. **Biology** (`[BIO]`)
+5. **Earth Science** (`[EARTH]`)
+6. **Space Science** (`[SPACE]`)
+7. **Zoology** (`[ZOO]`)
+Domain tags can be included in training data to guide the SciGate FFN routing.
+## 📝 Tokenizer
+Custom BPE tokenizer with:
+- 40,000 base BPE tokens trained on scientific corpus
+- 10,000 science-specific tokens:
+  - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
+  - 118 chemical element symbols
+  - 200 SI and derived units
+  - 300 scientific abbreviations (DNA, RNA, ATP, etc.)
+  - 500 mathematical operators
+  - Amino acid codes
+  - Greek alphabet (α, β, γ, etc.)
+- Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags
+## 🧪 Evaluation
+Science benchmarks across all 7 domains will be added. Planned benchmarks:
+- **Physics**: Feynman Questions, Physics GRE
+- **Math**: MATH dataset, GSM8K
+- **Chemistry**: Chemistry problem-solving, molecular property prediction
+- **Biology**: PubMed QA, bioinformatics tasks
+- **Earth Science**: Climate modeling questions
+- **Space Science**: Astronomy problem sets
+- **Zoology**: Species classification, ecological reasoning
+## 📄 License
+This is a school science project. Code is provided for educational purposes.
+## 🙏 Acknowledgments
+- **Mamba** (Gu et al.) for SSM architecture inspiration
+- **Flash Attention** (Dao et al.) for efficient attention
+- **HuggingFace** for transformers library
+- All open scientific data sources: arXiv, PubMed, S2ORC, etc.
+## 📧 Contact
+For questions or issues, please open an issue on GitHub.
+---
+**Built with ❤️ for scientific AI research**