File size: 9,847 Bytes
ad00e79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
---
language:
  - en
license: mit
tags:
  - vortex
  - science
  - physics
  - chemistry
  - biology
  - mathematics
  - ssm
  - mamba
  - hybrid-architecture
  - custom-tokenizer
  - from-scratch
  - matrix-corp
pipeline_tag: text-generation
library_name: transformers
model_type: vortex
---

# Vortex Scientific

**Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).

## 🌟 Features

- **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
- **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
- **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
- **Two Model Sizes**:
  - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
  - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
- **HuggingFace Compatible**: Full integration with `transformers` library
- **From Scratch**: No base model β€” everything built bottom-up including tokenizer and weights

## πŸ—οΈ Architecture

Vortex uses a two-block hybrid architecture:

1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN

Layer ratios:
- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)

### Science Modules

- **EquationModule**: LaTeX equation detection and structural understanding
- **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
- **CitationModule**: Citation span detection, provenance tracking, confidence scoring
- **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences

## πŸ“¦ Project Structure

```
Vortex/
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ vortex_7b_config.py      # 7B model configuration
β”‚   β”œβ”€β”€ vortex_13b_config.py     # 13B model configuration
β”‚   └── training_config.py       # Training hyperparameters
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ ssm_layer.py             # State-space layer
β”‚   β”œβ”€β”€ attention_layer.py       # Local windowed attention
β”‚   β”œβ”€β”€ scigate_ffn.py           # Science-gated feed-forward
β”‚   β”œβ”€β”€ vortex_model.py          # Main model class
β”‚   └── science_modules/         # Specialized science modules
β”œβ”€β”€ tokenizer/
β”‚   └── vortex_tokenizer.py      # Custom BPE tokenizer with science vocab
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ dataset_loader.py        # Open dataset loading (Pile, S2ORC, etc.)
β”‚   β”œβ”€β”€ quality_filter.py        # Multi-stage quality filtering
β”‚   β”œβ”€β”€ domain_classifier.py     # 7-domain classifier
β”‚   β”œβ”€β”€ deduplication.py         # MinHash LSH deduplication
β”‚   └── scraper.py               # Web scraping (arXiv, PubMed, etc.)
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ trainer.py               # Main training loop
β”‚   β”œβ”€β”€ losses.py                # Science-aware loss functions
β”‚   └── curriculum.py            # Curriculum learning scheduler
β”œβ”€β”€ inference/
β”‚   β”œβ”€β”€ cuda_optimize.py         # CUDA optimizations (Flash Attention, INT8)
β”‚   └── mps_optimize.py          # MPS optimizations for Apple Silicon
β”œβ”€β”€ evaluation/                  # Science benchmarks (coming soon)
β”œβ”€β”€ configuration_vortex.py      # HF config class
β”œβ”€β”€ tokenization_vortex.py       # HF tokenizer wrapper
β”œβ”€β”€ modeling_vortex.py           # HF model integration
β”œβ”€β”€ train.py                     # Training entry point
β”œβ”€β”€ inference/inference.py       # Inference entry point
└── requirements.txt
```

## πŸš€ Quick Start

### Installation

```bash
# Clone and setup
cd Vortex
pip install -r requirements.txt

# For CUDA optimizations
pip install flash-attn
pip install bitsandbytes
```

### Training

```bash
# Train 7B model on CUDA
python train.py \
    --model_size 7b \
    --device cuda \
    --data_dir ./data/processed \
    --output_dir ./checkpoints \
    --max_steps 100000

# Train 13B model with INT8 quantization (for 8GB VRAM)
python train.py \
    --model_size 13b \
    --device cuda \
    --quantization int8 \
    --data_dir ./data/processed \
    --output_dir ./checkpoints_13b
```

### Inference

```bash
# Generate text with 7B model
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --device cuda \
    --prompt "The equation E = mc^2 describes" \
    --max_new_tokens 100

# Interactive mode
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --device cuda \
    --interactive

# On Apple Silicon (MPS)
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --use_mps \
    --prompt "Explain quantum mechanics"
```

### HuggingFace Integration

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints")

# Generate
input_text = "The energy of a photon is given by"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
```

## πŸ“Š Data Pipeline

1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
3. **Deduplication**: MinHash LSH for near-duplicate detection
4. **Domain Classification**: Classify into 7 science domains
5. **Tokenization**: Custom science-aware BPE tokenizer
6. **Sharding**: Write to Parquet with statistics

```python
from data.dataset_loader import VortexDatasetLoader
from data.quality_filter import ScienceQualityFilter
from data.deduplication import MinHashLSH

# Load and process data
loader = VortexDatasetLoader()
quality_filter = ScienceQualityFilter()
lsh = MinHashLSH()

# Stream datasets, filter, deduplicate, and shard
for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
    if quality_filter.filter(sample["text"]):
        lsh.add_document(sample["id"], sample["text"])
        # Tokenize and save
```

## 🎯 Training Strategy

### Curriculum Learning

Training progresses through 4 stages:

1. **Foundation** (0-20%): Basic science text, simple equations, definitions
2. **Domain** (20-50%): Domain-specific deep content per science area
3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
4. **Integration** (80-100%): Cross-domain science, full dataset

### Science-Aware Loss

```python
total_loss = (
    lm_loss * 1.0              # Standard next token prediction
    + equation_loss * 0.3      # Equation reconstruction accuracy
    + domain_loss * 0.1        # Domain classification head
    + citation_loss * 0.1      # Citation detection accuracy
    + numerical_loss * 0.2     # Numerical reasoning accuracy
)
```

## βš™οΈ Configuration

### 7B Config (VORTEX_7B_CONFIG)

- `d_model`: 4096
- `num_layers`: 32
- `num_heads`: 32
- `d_state`: 16
- `ssm_ratio`: 0.6
- `vocab_size`: 50000
- `max_seq_len`: 16384

### 13B Config (VORTEX_13B_CONFIG)

- `d_model`: 5120
- `num_layers`: 40
- `num_heads`: 40
- `d_state`: 32
- `ssm_ratio`: 0.5
- `vocab_size`: 50000
- `max_seq_len`: 16384

## πŸ”§ Hardware Targets

### Nvidia 4060 Laptop (8GB VRAM)

- **7B**: BF16, no quantization, Flash Attention 2, torch.compile
- **13B**: INT8 quantization, Flash Attention 2, torch.compile
- Target TPS: 25-40 (7B), 15-25 (13B)

### Apple Silicon (M2/M3)

- **7B on M3**: BF16 (via float16), SDPA, no compile
- **13B on M3 Max**: BF16, unified memory, SDPA
- Target TPS: 20-35 (7B), 12-20 (13B)

## πŸ§ͺ Science Domains

1. **Physics** (`[PHYS]`)
2. **Mathematics** (`[MATH]`)
3. **Chemistry** (`[CHEM]`)
4. **Biology** (`[BIO]`)
5. **Earth Science** (`[EARTH]`)
6. **Space Science** (`[SPACE]`)
7. **Zoology** (`[ZOO]`)

Domain tags can be included in training data to guide the SciGate FFN routing.

## πŸ“ Tokenizer

Custom BPE tokenizer with:

- 40,000 base BPE tokens trained on scientific corpus
- 10,000 science-specific tokens:
  - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
  - 118 chemical element symbols
  - 200 SI and derived units
  - 300 scientific abbreviations (DNA, RNA, ATP, etc.)
  - 500 mathematical operators
  - Amino acid codes
  - Greek alphabet (Ξ±, Ξ², Ξ³, etc.)
- Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags

## πŸ§ͺ Evaluation

Science benchmarks across all 7 domains will be added. Planned benchmarks:

- **Physics**: Feynman Questions, Physics GRE
- **Math**: MATH dataset, GSM8K
- **Chemistry**: Chemistry problem-solving, molecular property prediction
- **Biology**: PubMed QA, bioinformatics tasks
- **Earth Science**: Climate modeling questions
- **Space Science**: Astronomy problem sets
- **Zoology**: Species classification, ecological reasoning

## πŸ“„ License

This is a school science project. Code is provided for educational purposes.

## πŸ™ Acknowledgments

- **Mamba** (Gu et al.) for SSM architecture inspiration
- **Flash Attention** (Dao et al.) for efficient attention
- **HuggingFace** for transformers library
- All open scientific data sources: arXiv, PubMed, S2ORC, etc.

## πŸ“§ Contact

For questions or issues, please open an issue on GitHub.

---

**Built with ❀️ for scientific AI research**