| | --- |
| | language: |
| | - en |
| | license: mit |
| | tags: |
| | - vortex |
| | - science |
| | - physics |
| | - chemistry |
| | - biology |
| | - mathematics |
| | - ssm |
| | - mamba |
| | - hybrid-architecture |
| | - custom-tokenizer |
| | - from-scratch |
| | - matrix-corp |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | model_type: vortex |
| | --- |
| | |
| | # Vortex Scientific |
| |
|
| | **Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs). |
| |
|
| | ## 🌟 Features |
| |
|
| | - **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks |
| | - **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures |
| | - **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3) |
| | - **Two Model Sizes**: |
| | - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM |
| | - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization |
| | - **HuggingFace Compatible**: Full integration with `transformers` library |
| | - **From Scratch**: No base model — everything built bottom-up including tokenizer and weights |
| |
|
| | ## 🏗️ Architecture |
| |
|
| | Vortex uses a two-block hybrid architecture: |
| |
|
| | 1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity) |
| | 2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN |
| |
|
| | Layer ratios: |
| | - 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...) |
| | - 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...) |
| |
|
| | ### Science Modules |
| |
|
| | - **EquationModule**: LaTeX equation detection and structural understanding |
| | - **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness |
| | - **CitationModule**: Citation span detection, provenance tracking, confidence scoring |
| | - **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences |
| |
|
| | ## 📦 Project Structure |
| |
|
| | ``` |
| | Vortex/ |
| | ├── configs/ |
| | │ ├── vortex_7b_config.py # 7B model configuration |
| | │ ├── vortex_13b_config.py # 13B model configuration |
| | │ └── training_config.py # Training hyperparameters |
| | ├── models/ |
| | │ ├── ssm_layer.py # State-space layer |
| | │ ├── attention_layer.py # Local windowed attention |
| | │ ├── scigate_ffn.py # Science-gated feed-forward |
| | │ ├── vortex_model.py # Main model class |
| | │ └── science_modules/ # Specialized science modules |
| | ├── tokenizer/ |
| | │ └── vortex_tokenizer.py # Custom BPE tokenizer with science vocab |
| | ├── data/ |
| | │ ├── dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.) |
| | │ ├── quality_filter.py # Multi-stage quality filtering |
| | │ ├── domain_classifier.py # 7-domain classifier |
| | │ ├── deduplication.py # MinHash LSH deduplication |
| | │ └── scraper.py # Web scraping (arXiv, PubMed, etc.) |
| | ├── training/ |
| | │ ├── trainer.py # Main training loop |
| | │ ├── losses.py # Science-aware loss functions |
| | │ └── curriculum.py # Curriculum learning scheduler |
| | ├── inference/ |
| | │ ├── cuda_optimize.py # CUDA optimizations (Flash Attention, INT8) |
| | │ └── mps_optimize.py # MPS optimizations for Apple Silicon |
| | ├── evaluation/ # Science benchmarks (coming soon) |
| | ├── configuration_vortex.py # HF config class |
| | ├── tokenization_vortex.py # HF tokenizer wrapper |
| | ├── modeling_vortex.py # HF model integration |
| | ├── train.py # Training entry point |
| | ├── inference/inference.py # Inference entry point |
| | └── requirements.txt |
| | ``` |
| |
|
| | ## 🚀 Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | # Clone and setup |
| | cd Vortex |
| | pip install -r requirements.txt |
| | |
| | # For CUDA optimizations |
| | pip install flash-attn |
| | pip install bitsandbytes |
| | ``` |
| |
|
| | ### Training |
| |
|
| | ```bash |
| | # Train 7B model on CUDA |
| | python train.py \ |
| | --model_size 7b \ |
| | --device cuda \ |
| | --data_dir ./data/processed \ |
| | --output_dir ./checkpoints \ |
| | --max_steps 100000 |
| | |
| | # Train 13B model with INT8 quantization (for 8GB VRAM) |
| | python train.py \ |
| | --model_size 13b \ |
| | --device cuda \ |
| | --quantization int8 \ |
| | --data_dir ./data/processed \ |
| | --output_dir ./checkpoints_13b |
| | ``` |
| |
|
| | ### Inference |
| |
|
| | ```bash |
| | # Generate text with 7B model |
| | python inference/inference.py \ |
| | --model_path ./checkpoints/latest.pt \ |
| | --model_size 7b \ |
| | --device cuda \ |
| | --prompt "The equation E = mc^2 describes" \ |
| | --max_new_tokens 100 |
| | |
| | # Interactive mode |
| | python inference/inference.py \ |
| | --model_path ./checkpoints/latest.pt \ |
| | --model_size 7b \ |
| | --device cuda \ |
| | --interactive |
| | |
| | # On Apple Silicon (MPS) |
| | python inference/inference.py \ |
| | --model_path ./checkpoints/latest.pt \ |
| | --model_size 7b \ |
| | --use_mps \ |
| | --prompt "Explain quantum mechanics" |
| | ``` |
| |
|
| | ### HuggingFace Integration |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | # Load model and tokenizer |
| | model = AutoModelForCausalLM.from_pretrained("./checkpoints") |
| | tokenizer = AutoTokenizer.from_pretrained("./checkpoints") |
| | |
| | # Generate |
| | input_text = "The energy of a photon is given by" |
| | inputs = tokenizer(input_text, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=50) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | ## 📊 Data Pipeline |
| |
|
| | 1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA) |
| | 2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations) |
| | 3. **Deduplication**: MinHash LSH for near-duplicate detection |
| | 4. **Domain Classification**: Classify into 7 science domains |
| | 5. **Tokenization**: Custom science-aware BPE tokenizer |
| | 6. **Sharding**: Write to Parquet with statistics |
| |
|
| | ```python |
| | from data.dataset_loader import VortexDatasetLoader |
| | from data.quality_filter import ScienceQualityFilter |
| | from data.deduplication import MinHashLSH |
| | |
| | # Load and process data |
| | loader = VortexDatasetLoader() |
| | quality_filter = ScienceQualityFilter() |
| | lsh = MinHashLSH() |
| | |
| | # Stream datasets, filter, deduplicate, and shard |
| | for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]): |
| | if quality_filter.filter(sample["text"]): |
| | lsh.add_document(sample["id"], sample["text"]) |
| | # Tokenize and save |
| | ``` |
| |
|
| | ## 🎯 Training Strategy |
| |
|
| | ### Curriculum Learning |
| |
|
| | Training progresses through 4 stages: |
| |
|
| | 1. **Foundation** (0-20%): Basic science text, simple equations, definitions |
| | 2. **Domain** (20-50%): Domain-specific deep content per science area |
| | 3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations |
| | 4. **Integration** (80-100%): Cross-domain science, full dataset |
| |
|
| | ### Science-Aware Loss |
| |
|
| | ```python |
| | total_loss = ( |
| | lm_loss * 1.0 # Standard next token prediction |
| | + equation_loss * 0.3 # Equation reconstruction accuracy |
| | + domain_loss * 0.1 # Domain classification head |
| | + citation_loss * 0.1 # Citation detection accuracy |
| | + numerical_loss * 0.2 # Numerical reasoning accuracy |
| | ) |
| | ``` |
| |
|
| | ## ⚙️ Configuration |
| |
|
| | ### 7B Config (VORTEX_7B_CONFIG) |
| |
|
| | - `d_model`: 4096 |
| | - `num_layers`: 32 |
| | - `num_heads`: 32 |
| | - `d_state`: 16 |
| | - `ssm_ratio`: 0.6 |
| | - `vocab_size`: 50000 |
| | - `max_seq_len`: 16384 |
| |
|
| | ### 13B Config (VORTEX_13B_CONFIG) |
| |
|
| | - `d_model`: 5120 |
| | - `num_layers`: 40 |
| | - `num_heads`: 40 |
| | - `d_state`: 32 |
| | - `ssm_ratio`: 0.5 |
| | - `vocab_size`: 50000 |
| | - `max_seq_len`: 16384 |
| |
|
| | ## 🔧 Hardware Targets |
| |
|
| | ### Nvidia 4060 Laptop (8GB VRAM) |
| |
|
| | - **7B**: BF16, no quantization, Flash Attention 2, torch.compile |
| | - **13B**: INT8 quantization, Flash Attention 2, torch.compile |
| | - Target TPS: 25-40 (7B), 15-25 (13B) |
| |
|
| | ### Apple Silicon (M2/M3) |
| |
|
| | - **7B on M3**: BF16 (via float16), SDPA, no compile |
| | - **13B on M3 Max**: BF16, unified memory, SDPA |
| | - Target TPS: 20-35 (7B), 12-20 (13B) |
| |
|
| | ## 🧪 Science Domains |
| |
|
| | 1. **Physics** (`[PHYS]`) |
| | 2. **Mathematics** (`[MATH]`) |
| | 3. **Chemistry** (`[CHEM]`) |
| | 4. **Biology** (`[BIO]`) |
| | 5. **Earth Science** (`[EARTH]`) |
| | 6. **Space Science** (`[SPACE]`) |
| | 7. **Zoology** (`[ZOO]`) |
| |
|
| | Domain tags can be included in training data to guide the SciGate FFN routing. |
| |
|
| | ## 📝 Tokenizer |
| |
|
| | Custom BPE tokenizer with: |
| |
|
| | - 40,000 base BPE tokens trained on scientific corpus |
| | - 10,000 science-specific tokens: |
| | - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.) |
| | - 118 chemical element symbols |
| | - 200 SI and derived units |
| | - 300 scientific abbreviations (DNA, RNA, ATP, etc.) |
| | - 500 mathematical operators |
| | - Amino acid codes |
| | - Greek alphabet (α, β, γ, etc.) |
| | - Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags |
| |
|
| | ## 🧪 Evaluation |
| |
|
| | Science benchmarks across all 7 domains will be added. Planned benchmarks: |
| |
|
| | - **Physics**: Feynman Questions, Physics GRE |
| | - **Math**: MATH dataset, GSM8K |
| | - **Chemistry**: Chemistry problem-solving, molecular property prediction |
| | - **Biology**: PubMed QA, bioinformatics tasks |
| | - **Earth Science**: Climate modeling questions |
| | - **Space Science**: Astronomy problem sets |
| | - **Zoology**: Species classification, ecological reasoning |
| |
|
| | ## 📄 License |
| |
|
| | This is a school science project. Code is provided for educational purposes. |
| |
|
| | ## 🙏 Acknowledgments |
| |
|
| | - **Mamba** (Gu et al.) for SSM architecture inspiration |
| | - **Flash Attention** (Dao et al.) for efficient attention |
| | - **HuggingFace** for transformers library |
| | - All open scientific data sources: arXiv, PubMed, S2ORC, etc. |
| |
|
| | ## 📧 Contact |
| |
|
| | For questions or issues, please open an issue on GitHub. |
| |
|
| | --- |
| |
|
| | **Built with ❤️ for scientific AI research** |
| |
|