--- language: - en license: mit tags: - vortex - science - physics - chemistry - biology - mathematics - ssm - mamba - hybrid-architecture - custom-tokenizer - from-scratch - matrix-corp pipeline_tag: text-generation library_name: transformers model_type: vortex --- # Vortex Scientific **Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs). ## ๐ŸŒŸ Features - **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks - **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures - **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3) - **Two Model Sizes**: - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization - **HuggingFace Compatible**: Full integration with `transformers` library - **From Scratch**: No base model โ€” everything built bottom-up including tokenizer and weights ## ๐Ÿ—๏ธ Architecture Vortex uses a two-block hybrid architecture: 1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity) 2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN Layer ratios: - 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...) - 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...) ### Science Modules - **EquationModule**: LaTeX equation detection and structural understanding - **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness - **CitationModule**: Citation span detection, provenance tracking, confidence scoring - **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences ## ๐Ÿ“ฆ Project Structure ``` Vortex/ โ”œโ”€โ”€ configs/ โ”‚ โ”œโ”€โ”€ vortex_7b_config.py # 7B model configuration โ”‚ โ”œโ”€โ”€ vortex_13b_config.py # 13B model configuration โ”‚ โ””โ”€โ”€ training_config.py # Training hyperparameters โ”œโ”€โ”€ models/ โ”‚ โ”œโ”€โ”€ ssm_layer.py # State-space layer โ”‚ โ”œโ”€โ”€ attention_layer.py # Local windowed attention โ”‚ โ”œโ”€โ”€ scigate_ffn.py # Science-gated feed-forward โ”‚ โ”œโ”€โ”€ vortex_model.py # Main model class โ”‚ โ””โ”€โ”€ science_modules/ # Specialized science modules โ”œโ”€โ”€ tokenizer/ โ”‚ โ””โ”€โ”€ vortex_tokenizer.py # Custom BPE tokenizer with science vocab โ”œโ”€โ”€ data/ โ”‚ โ”œโ”€โ”€ dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.) โ”‚ โ”œโ”€โ”€ quality_filter.py # Multi-stage quality filtering โ”‚ โ”œโ”€โ”€ domain_classifier.py # 7-domain classifier โ”‚ โ”œโ”€โ”€ deduplication.py # MinHash LSH deduplication โ”‚ โ””โ”€โ”€ scraper.py # Web scraping (arXiv, PubMed, etc.) โ”œโ”€โ”€ training/ โ”‚ โ”œโ”€โ”€ trainer.py # Main training loop โ”‚ โ”œโ”€โ”€ losses.py # Science-aware loss functions โ”‚ โ””โ”€โ”€ curriculum.py # Curriculum learning scheduler โ”œโ”€โ”€ inference/ โ”‚ โ”œโ”€โ”€ cuda_optimize.py # CUDA optimizations (Flash Attention, INT8) โ”‚ โ””โ”€โ”€ mps_optimize.py # MPS optimizations for Apple Silicon โ”œโ”€โ”€ evaluation/ # Science benchmarks (coming soon) โ”œโ”€โ”€ configuration_vortex.py # HF config class โ”œโ”€โ”€ tokenization_vortex.py # HF tokenizer wrapper โ”œโ”€โ”€ modeling_vortex.py # HF model integration โ”œโ”€โ”€ train.py # Training entry point โ”œโ”€โ”€ inference/inference.py # Inference entry point โ””โ”€โ”€ requirements.txt ``` ## ๐Ÿš€ Quick Start ### Installation ```bash # Clone and setup cd Vortex pip install -r requirements.txt # For CUDA optimizations pip install flash-attn pip install bitsandbytes ``` ### Training ```bash # Train 7B model on CUDA python train.py \ --model_size 7b \ --device cuda \ --data_dir ./data/processed \ --output_dir ./checkpoints \ --max_steps 100000 # Train 13B model with INT8 quantization (for 8GB VRAM) python train.py \ --model_size 13b \ --device cuda \ --quantization int8 \ --data_dir ./data/processed \ --output_dir ./checkpoints_13b ``` ### Inference ```bash # Generate text with 7B model python inference/inference.py \ --model_path ./checkpoints/latest.pt \ --model_size 7b \ --device cuda \ --prompt "The equation E = mc^2 describes" \ --max_new_tokens 100 # Interactive mode python inference/inference.py \ --model_path ./checkpoints/latest.pt \ --model_size 7b \ --device cuda \ --interactive # On Apple Silicon (MPS) python inference/inference.py \ --model_path ./checkpoints/latest.pt \ --model_size 7b \ --use_mps \ --prompt "Explain quantum mechanics" ``` ### HuggingFace Integration ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained("./checkpoints") tokenizer = AutoTokenizer.from_pretrained("./checkpoints") # Generate input_text = "The energy of a photon is given by" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0])) ``` ## ๐Ÿ“Š Data Pipeline 1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA) 2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations) 3. **Deduplication**: MinHash LSH for near-duplicate detection 4. **Domain Classification**: Classify into 7 science domains 5. **Tokenization**: Custom science-aware BPE tokenizer 6. **Sharding**: Write to Parquet with statistics ```python from data.dataset_loader import VortexDatasetLoader from data.quality_filter import ScienceQualityFilter from data.deduplication import MinHashLSH # Load and process data loader = VortexDatasetLoader() quality_filter = ScienceQualityFilter() lsh = MinHashLSH() # Stream datasets, filter, deduplicate, and shard for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]): if quality_filter.filter(sample["text"]): lsh.add_document(sample["id"], sample["text"]) # Tokenize and save ``` ## ๐ŸŽฏ Training Strategy ### Curriculum Learning Training progresses through 4 stages: 1. **Foundation** (0-20%): Basic science text, simple equations, definitions 2. **Domain** (20-50%): Domain-specific deep content per science area 3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations 4. **Integration** (80-100%): Cross-domain science, full dataset ### Science-Aware Loss ```python total_loss = ( lm_loss * 1.0 # Standard next token prediction + equation_loss * 0.3 # Equation reconstruction accuracy + domain_loss * 0.1 # Domain classification head + citation_loss * 0.1 # Citation detection accuracy + numerical_loss * 0.2 # Numerical reasoning accuracy ) ``` ## โš™๏ธ Configuration ### 7B Config (VORTEX_7B_CONFIG) - `d_model`: 4096 - `num_layers`: 32 - `num_heads`: 32 - `d_state`: 16 - `ssm_ratio`: 0.6 - `vocab_size`: 50000 - `max_seq_len`: 16384 ### 13B Config (VORTEX_13B_CONFIG) - `d_model`: 5120 - `num_layers`: 40 - `num_heads`: 40 - `d_state`: 32 - `ssm_ratio`: 0.5 - `vocab_size`: 50000 - `max_seq_len`: 16384 ## ๐Ÿ”ง Hardware Targets ### Nvidia 4060 Laptop (8GB VRAM) - **7B**: BF16, no quantization, Flash Attention 2, torch.compile - **13B**: INT8 quantization, Flash Attention 2, torch.compile - Target TPS: 25-40 (7B), 15-25 (13B) ### Apple Silicon (M2/M3) - **7B on M3**: BF16 (via float16), SDPA, no compile - **13B on M3 Max**: BF16, unified memory, SDPA - Target TPS: 20-35 (7B), 12-20 (13B) ## ๐Ÿงช Science Domains 1. **Physics** (`[PHYS]`) 2. **Mathematics** (`[MATH]`) 3. **Chemistry** (`[CHEM]`) 4. **Biology** (`[BIO]`) 5. **Earth Science** (`[EARTH]`) 6. **Space Science** (`[SPACE]`) 7. **Zoology** (`[ZOO]`) Domain tags can be included in training data to guide the SciGate FFN routing. ## ๐Ÿ“ Tokenizer Custom BPE tokenizer with: - 40,000 base BPE tokens trained on scientific corpus - 10,000 science-specific tokens: - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.) - 118 chemical element symbols - 200 SI and derived units - 300 scientific abbreviations (DNA, RNA, ATP, etc.) - 500 mathematical operators - Amino acid codes - Greek alphabet (ฮฑ, ฮฒ, ฮณ, etc.) - Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags ## ๐Ÿงช Evaluation Science benchmarks across all 7 domains will be added. Planned benchmarks: - **Physics**: Feynman Questions, Physics GRE - **Math**: MATH dataset, GSM8K - **Chemistry**: Chemistry problem-solving, molecular property prediction - **Biology**: PubMed QA, bioinformatics tasks - **Earth Science**: Climate modeling questions - **Space Science**: Astronomy problem sets - **Zoology**: Species classification, ecological reasoning ## ๐Ÿ“„ License This is a school science project. Code is provided for educational purposes. ## ๐Ÿ™ Acknowledgments - **Mamba** (Gu et al.) for SSM architecture inspiration - **Flash Attention** (Dao et al.) for efficient attention - **HuggingFace** for transformers library - All open scientific data sources: arXiv, PubMed, S2ORC, etc. ## ๐Ÿ“ง Contact For questions or issues, please open an issue on GitHub. --- **Built with โค๏ธ for scientific AI research**