README.md · Matrix-Corp/Vortex-7b-V1 at main

Vortex-7b-V1 / README.md

Zandy-Wandy

Update README.md

9438f6c verified 8 days ago

preview code

raw

history blame contribute delete

9.85 kB

	---
	language:
	- en
	license: mit
	tags:
	- vortex
	- science
	- physics
	- chemistry
	- biology
	- mathematics
	- ssm
	- mamba
	- hybrid-architecture
	- custom-tokenizer
	- from-scratch
	- matrix-corp
	pipeline_tag: text-generation
	library_name: transformers
	model_type: vortex
	---

	# Vortex Scientific

	Vortex Scientific is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).

	## 🌟 Features

	- Novel Architecture: Hybrid State-Space Model (SSM) + Local Attention blocks
	- Science-Specialized: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
	- Hardware Optimized: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
	- Two Model Sizes:
	- Vortex-7B: 7 billion parameters, fits in 8GB VRAM
	- Vortex-13B: 13 billion parameters, fits in 16GB VRAM with quantization
	- HuggingFace Compatible: Full integration with `transformers` library
	- From Scratch: No base model — everything built bottom-up including tokenizer and weights

	## 🏗️ Architecture

	Vortex uses a two-block hybrid architecture:

	1. SSM-Only Blocks: State-space layers for efficient long-context processing (O(n) complexity)
	2. Attention+Science Blocks: Local windowed attention + science modules + SciGate FFN

	Layer ratios:
	- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
	- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)

	### Science Modules

	- EquationModule: LaTeX equation detection and structural understanding
	- NumericalReasoningModule: Digit-level encoding, scientific notation, unit awareness
	- CitationModule: Citation span detection, provenance tracking, confidence scoring
	- MolecularModule: Element embeddings, SMILES understanding, amino acid sequences

	## 📦 Project Structure

	```
	Vortex/
	├── configs/
	│ ├── vortex_7b_config.py # 7B model configuration
	│ ├── vortex_13b_config.py # 13B model configuration
	│ └── training_config.py # Training hyperparameters
	├── models/
	│ ├── ssm_layer.py # State-space layer
	│ ├── attention_layer.py # Local windowed attention
	│ ├── scigate_ffn.py # Science-gated feed-forward
	│ ├── vortex_model.py # Main model class
	│ └── science_modules/ # Specialized science modules
	├── tokenizer/
	│ └── vortex_tokenizer.py # Custom BPE tokenizer with science vocab
	├── data/
	│ ├── dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.)
	│ ├── quality_filter.py # Multi-stage quality filtering
	│ ├── domain_classifier.py # 7-domain classifier
	│ ├── deduplication.py # MinHash LSH deduplication
	│ └── scraper.py # Web scraping (arXiv, PubMed, etc.)
	├── training/
	│ ├── trainer.py # Main training loop
	│ ├── losses.py # Science-aware loss functions
	│ └── curriculum.py # Curriculum learning scheduler
	├── inference/
	│ ├── cuda_optimize.py # CUDA optimizations (Flash Attention, INT8)
	│ └── mps_optimize.py # MPS optimizations for Apple Silicon
	├── evaluation/ # Science benchmarks (coming soon)
	├── configuration_vortex.py # HF config class
	├── tokenization_vortex.py # HF tokenizer wrapper
	├── modeling_vortex.py # HF model integration
	├── train.py # Training entry point
	├── inference/inference.py # Inference entry point
	└── requirements.txt
	```

	## 🚀 Quick Start

	### Installation

	```bash
	# Clone and setup
	cd Vortex
	pip install -r requirements.txt

	# For CUDA optimizations
	pip install flash-attn
	pip install bitsandbytes
	```

	### Training

	```bash
	# Train 7B model on CUDA
	python train.py \
	--model_size 7b \
	--device cuda \
	--data_dir ./data/processed \
	--output_dir ./checkpoints \
	--max_steps 100000

	# Train 13B model with INT8 quantization (for 8GB VRAM)
	python train.py \
	--model_size 13b \
	--device cuda \
	--quantization int8 \
	--data_dir ./data/processed \
	--output_dir ./checkpoints_13b
	```

	### Inference

	```bash
	# Generate text with 7B model
	python inference/inference.py \
	--model_path ./checkpoints/latest.pt \
	--model_size 7b \
	--device cuda \
	--prompt "The equation E = mc^2 describes" \
	--max_new_tokens 100

	# Interactive mode
	python inference/inference.py \
	--model_path ./checkpoints/latest.pt \
	--model_size 7b \
	--device cuda \
	--interactive

	# On Apple Silicon (MPS)
	python inference/inference.py \
	--model_path ./checkpoints/latest.pt \
	--model_size 7b \
	--use_mps \
	--prompt "Explain quantum mechanics"
	```

	### HuggingFace Integration

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("./checkpoints")
	tokenizer = AutoTokenizer.from_pretrained("./checkpoints")

	# Generate
	input_text = "The energy of a photon is given by"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0]))
	```

	## 📊 Data Pipeline

	1. Open Datasets: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
	2. Quality Filtering: Multi-stage checks (length, language, equations, repetition, citations)
	3. Deduplication: MinHash LSH for near-duplicate detection
	4. Domain Classification: Classify into 7 science domains
	5. Tokenization: Custom science-aware BPE tokenizer
	6. Sharding: Write to Parquet with statistics

	```python
	from data.dataset_loader import VortexDatasetLoader
	from data.quality_filter import ScienceQualityFilter
	from data.deduplication import MinHashLSH

	# Load and process data
	loader = VortexDatasetLoader()
	quality_filter = ScienceQualityFilter()
	lsh = MinHashLSH()

	# Stream datasets, filter, deduplicate, and shard
	for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
	if quality_filter.filter(sample["text"]):
	lsh.add_document(sample["id"], sample["text"])
	# Tokenize and save
	```

	## 🎯 Training Strategy

	### Curriculum Learning

	Training progresses through 4 stages:

	1. Foundation (0-20%): Basic science text, simple equations, definitions
	2. Domain (20-50%): Domain-specific deep content per science area
	3. Reasoning (50-80%): Scientific problem solving, multi-step derivations
	4. Integration (80-100%): Cross-domain science, full dataset

	### Science-Aware Loss

	```python
	total_loss = (
	lm_loss * 1.0 # Standard next token prediction
	+ equation_loss * 0.3 # Equation reconstruction accuracy
	+ domain_loss * 0.1 # Domain classification head
	+ citation_loss * 0.1 # Citation detection accuracy
	+ numerical_loss * 0.2 # Numerical reasoning accuracy
	)
	```

	## ⚙️ Configuration

	### 7B Config (VORTEX_7B_CONFIG)

	- `d_model`: 4096
	- `num_layers`: 32
	- `num_heads`: 32
	- `d_state`: 16
	- `ssm_ratio`: 0.6
	- `vocab_size`: 50000
	- `max_seq_len`: 16384

	### 13B Config (VORTEX_13B_CONFIG)

	- `d_model`: 5120
	- `num_layers`: 40
	- `num_heads`: 40
	- `d_state`: 32
	- `ssm_ratio`: 0.5
	- `vocab_size`: 50000
	- `max_seq_len`: 16384

	## 🔧 Hardware Targets

	### Nvidia 4060 Laptop (8GB VRAM)

	- 7B: BF16, no quantization, Flash Attention 2, torch.compile
	- 13B: INT8 quantization, Flash Attention 2, torch.compile
	- Target TPS: 25-40 (7B), 15-25 (13B)

	### Apple Silicon (M2/M3)

	- 7B on M3: BF16 (via float16), SDPA, no compile
	- 13B on M3 Max: BF16, unified memory, SDPA
	- Target TPS: 20-35 (7B), 12-20 (13B)

	## 🧪 Science Domains

	1. Physics (`[PHYS]`)
	2. Mathematics (`[MATH]`)
	3. Chemistry (`[CHEM]`)
	4. Biology (`[BIO]`)
	5. Earth Science (`[EARTH]`)
	6. Space Science (`[SPACE]`)
	7. Zoology (`[ZOO]`)

	Domain tags can be included in training data to guide the SciGate FFN routing.

	## 📝 Tokenizer

	Custom BPE tokenizer with:

	- 40,000 base BPE tokens trained on scientific corpus
	- 10,000 science-specific tokens:
	- 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
	- 118 chemical element symbols
	- 200 SI and derived units
	- 300 scientific abbreviations (DNA, RNA, ATP, etc.)
	- 500 mathematical operators
	- Amino acid codes
	- Greek alphabet (α, β, γ, etc.)
	- Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags

	## 🧪 Evaluation

	Science benchmarks across all 7 domains will be added. Planned benchmarks:

	- Physics: Feynman Questions, Physics GRE
	- Math: MATH dataset, GSM8K
	- Chemistry: Chemistry problem-solving, molecular property prediction
	- Biology: PubMed QA, bioinformatics tasks
	- Earth Science: Climate modeling questions
	- Space Science: Astronomy problem sets
	- Zoology: Species classification, ecological reasoning

	## 📄 License

	This is a school science project. Code is provided for educational purposes.

	## 🙏 Acknowledgments

	- Mamba (Gu et al.) for SSM architecture inspiration
	- Flash Attention (Dao et al.) for efficient attention
	- HuggingFace for transformers library
	- All open scientific data sources: arXiv, PubMed, S2ORC, etc.

	## 📧 Contact

	For questions or issues, please open an issue on GitHub.

	---

	Built with ❤️ for scientific AI research